Daily arXiv Papers - 2025-07-22

Summaries of research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] DeepWriter: A Fact-Grounded Multimodal Writing Assistant Based On Offline Knowledge Base

Song Mao, Lejun Cheng, Pinlong Cai, Guohang Yan, Ding Wang, Botian Shi

Main category: cs.CL

TL;DR: DeepWriter is a customizable, multimodal writing assistant for specialized domains, using a curated offline knowledge base to generate high-quality, factually accurate documents.

DetailsMotivation: LLMs struggle in specialized domains due to lack of deep knowledge and hallucination. Existing solutions like RAG or online search are inconsistent or unreliable.

Method: DeepWriter uses a novel pipeline: task decomposition, outline generation, multimodal retrieval, and section-by-section composition with reflection. It employs a hierarchical knowledge representation for efficient retrieval.

Result: DeepWriter outperforms baselines in financial report generation, producing verifiable, high-quality content with better factual accuracy.

Conclusion: DeepWriter addresses LLM limitations in specialized writing tasks by leveraging structured knowledge and multimodal retrieval, achieving superior results.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various applications. However, their use as writing assistants in specialized domains like finance, medicine, and law is often hampered by a lack of deep domain-specific knowledge and a tendency to hallucinate. Existing solutions, such as Retrieval-Augmented Generation (RAG), can suffer from inconsistency across multiple retrieval steps, while online search-based methods often degrade quality due to unreliable web content. To address these challenges, we introduce DeepWriter, a customizable, multimodal, long-form writing assistant that operates on a curated, offline knowledge base. DeepWriter leverages a novel pipeline that involves task decomposition, outline generation, multimodal retrieval, and section-by-section composition with reflection. By deeply mining information from a structured corpus and incorporating both textual and visual elements, DeepWriter generates coherent, factually grounded, and professional-grade documents. We also propose a hierarchical knowledge representation to enhance retrieval efficiency and accuracy. Our experiments on financial report generation demonstrate that DeepWriter produces high-quality, verifiable articles that surpasses existing baselines in factual accuracy and generated content quality.

[2] Retention analysis of edited knowledge after fine-tuning

Fufang Wen, Shichang Zhang

Main category: cs.CL

TL;DR: Fine-tuning affects edited knowledge in LLMs more than intrinsic knowledge, making it susceptible to forgetting. Freezing edited layers can improve retention.

DetailsMotivation: To understand how fine-tuning impacts edited knowledge in LLMs, given the lack of prior research on this interaction.

Method: Systematically investigate interactions between fine-tuning objectives and model editing techniques, including freezing layers with edited content.

Result: Edited knowledge is more prone to forgetting during fine-tuning than intrinsic knowledge. Freezing edited layers enhances retention.

Conclusion: Current editing methods need robustness evaluation under fine-tuning. Freezing layers offers a potential solution for improving knowledge retention.

Abstract: Large language models (LLMs) store vast amounts of knowledge, which often requires updates to correct factual errors, incorporate newly acquired information, or adapt model behavior. Model editing methods have emerged as efficient solutions for such updates, offering localized and precise knowledge modification at significantly lower computational cost than continual training. In parallel, LLMs are frequently fine-tuned for a wide range of downstream tasks. However, the effect of fine-tuning on previously edited knowledge remains poorly understood. In this work, we systematically investigate how different fine-tuning objectives interact with various model editing techniques. Our findings show that edited knowledge is substantially more susceptible to forgetting during fine-tuning than intrinsic knowledge acquired through pre-training. This analysis highlights a key limitation of current editing approaches and suggests that evaluating edit robustness under downstream fine-tuning is critical for their practical deployment. We further find that freezing layers associated with edited content can significantly improve knowledge retention, offering insight into how future editing methods might be made more robust.

[3] Open-Source LLMs Collaboration Beats Closed-Source LLMs: A Scalable Multi-Agent System

Shengji Tang, Jianjian Cao, Weihao Lin, Jiale Hong, Bo Zhang, Shuyue Hu, Lei Bai, Tao Chen, Wanli Ouyang, Peng Ye

Main category: cs.CL

TL;DR: SMACS, a scalable multi-agent collaboration system, integrates open-source LLMs to outperform closed-source models like Claude-3.7-Sonnet and GPT-4.1, achieving higher performance across benchmarks.

DetailsMotivation: To explore whether multiple open-source LLMs can surpass closed-source LLMs in performance.

Method: Proposes SMACS with Retrieval-based Prior Selection (RPS) for LLM selection and Exploration-Exploitation-Driven Posterior Enhancement (EPE) for diverse, high-quality responses.

Result: SMACS outperforms leading closed-source LLMs by significant margins (e.g., +12.73% over Claude-3.7-Sonnet) and exceeds the best results from both open and closed-source models.

Conclusion: SMACS demonstrates the potential of open-source LLM collectives to surpass closed-source models, pushing the boundaries of AI performance.

Abstract: This paper aims to demonstrate the potential and strengths of open-source collectives. It leads to a promising question: Can we harness multiple open-source LLMs to match or even beat the closed-source LLMs? To answer this, we propose SMACS, a scalable multi-agent collaboration system (MACS) framework with high performance. Specifically, for continuous integration of new LLMs and generalization to diverse questions, we first propose a Retrieval-based Prior Selection (RPS), which assigns a proxy performance score to each LLM to select the Top-k LLMs at the instance level for any given question. Then, we propose an Exploration-Exploitation-Driven Posterior Enhancement (EPE), encouraging the generation of diverse responses through prior dropping and selecting the high-quality response via a hybrid posterior score. Experiments on eight mainstream benchmarks validate the effectiveness of our SMACS: by integrating fifteen open-source LLMs, SMACS outperforms leading closed-source LLMs in 2025, e.g., Claude-3.7-Sonnet (+12.73%), GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results of different datasets from both open-source LLMs (+2.86%) and closed-source LLMs (+2.04%), pushing the upper bound of intelligence. Code will be released at https://github.com/magent4aci/SMACS.

[4] Let’s Measure the Elephant in the Room: Facilitating Personalized Automated Analysis of Privacy Policies at Scale

Rui Zhao, Vladyslav Melnychuk, Jun Zhao, Jesse Wright, Nigel Shadbolt

Main category: cs.CL

TL;DR: PoliAnalyzer, a neuro-symbolic system, uses NLP and logical inference to analyze privacy policies against user preferences, reducing cognitive burden by highlighting non-compliant segments.

DetailsMotivation: Users rarely read privacy policies despite their importance, leading to a lack of awareness about data usage practices. PoliAnalyzer aims to automate and personalize policy analysis to empower users.

Method: Extends a formal policy language, uses NLP to extract data usage practices, and applies logical inference to compare policies with user preferences. Evaluated on the PolicyIE dataset and top 100 websites.

Result: Achieved 90-100% F1-score in identifying practices; 95.2% of policy segments complied with preferences, reducing focus to 4.8% non-compliant segments. Highlighted common violations like location data sharing.

Conclusion: PoliAnalyzer enables scalable, automated privacy policy analysis, empowering users and fostering discussions on fair data practices.

Abstract: In modern times, people have numerous online accounts, but they rarely read the Terms of Service or Privacy Policy of those sites despite claiming otherwise. This paper introduces PoliAnalyzer, a neuro-symbolic system that assists users with personalized privacy policy analysis. PoliAnalyzer uses Natural Language Processing (NLP) to extract formal representations of data usage practices from policy texts. In favor of deterministic, logical inference is applied to compare user preferences with the formal privacy policy representation and produce a compliance report. To achieve this, we extend an existing formal Data Terms of Use policy language to model privacy policies as app policies and user preferences as data policies. In our evaluation using our enriched PolicyIE dataset curated by legal experts, PoliAnalyzer demonstrated high accuracy in identifying relevant data usage practices, achieving F1-score of 90-100% across most tasks. Additionally, we demonstrate how PoliAnalyzer can model diverse user data-sharing preferences, derived from prior research as 23 user profiles, and perform compliance analysis against the top 100 most-visited websites. This analysis revealed that, on average, 95.2% of a privacy policy’s segments do not conflict with the analyzed user preferences, enabling users to concentrate on understanding the 4.8% (636 / 13205) that violates preferences, significantly reducing cognitive burden. Further, we identified common practices in privacy policies that violate user expectations - such as the sharing of location data with 3rd parties. This paper demonstrates that PoliAnalyzer can support automated personalized privacy policy analysis at scale using off-the-shelf NLP tools. This sheds light on a pathway to help individuals regain control over their data and encourage societal discussions on platform data practices to promote a fairer power dynamic.

[5] Beyond Architectures: Evaluating the Role of Contextual Embeddings in Detecting Bipolar Disorder on Social Media

Khalid Hasan, Jamil Saquer

Main category: cs.CL

TL;DR: The paper evaluates NLP models for detecting bipolar disorder from social media text, finding RoBERTa and LSTM with BERT embeddings perform best, while static embeddings fail. DistilBERT balances efficiency and accuracy.

DetailsMotivation: Bipolar disorder is often underdiagnosed due to subtle symptoms and stigma. The study aims to leverage NLP for early detection using social media data.

Method: Evaluated transformer models (BERT, RoBERTa, ALBERT, ELECTRA, DistilBERT) and LSTM models with contextualized (BERT) and static (GloVe, Word2Vec) embeddings on annotated Reddit posts.

Result: RoBERTa achieved the highest F1 score (~98%), while LSTM with BERT embeddings performed similarly. Static embeddings scored near-zero F1. DistilBERT offered efficiency-accuracy balance.

Conclusion: Contextual language models are crucial for bipolar disorder detection. The study provides insights for model selection in mental health NLP and supports early screening.

Abstract: Bipolar disorder is a chronic mental illness frequently underdiagnosed due to subtle early symptoms and social stigma. This paper explores the advanced natural language processing (NLP) models for recognizing signs of bipolar disorder based on user-generated social media text. We conduct a comprehensive evaluation of transformer-based models (BERT, RoBERTa, ALBERT, ELECTRA, DistilBERT) and Long Short Term Memory (LSTM) models based on contextualized (BERT) and static (GloVe, Word2Vec) word embeddings. Experiments were performed on a large, annotated dataset of Reddit posts after confirming their validity through sentiment variance and judgmental analysis. Our results demonstrate that RoBERTa achieves the highest performance among transformer models with an F1 score of ~98% while LSTM models using BERT embeddings yield nearly identical results. In contrast, LSTMs trained on static embeddings fail to capture meaningful patterns, scoring near-zero F1. These findings underscore the critical role of contextual language modeling in detecting bipolar disorder. In addition, we report model training times and highlight that DistilBERT offers an optimal balance between efficiency and accuracy. In general, our study offers actionable insights for model selection in mental health NLP applications and validates the potential of contextualized language models to support early bipolar disorder screening.

[6] Language Models Change Facts Based on the Way You Talk

Matthew Kearney, Reuben Binns, Yarin Gal

Main category: cs.CL

TL;DR: LLMs show bias in high-stakes applications like medicine, law, and job salaries, influenced by identity markers like race, gender, and age, leading to harmful disparities.

DetailsMotivation: To analyze how identity markers in user queries bias LLM responses in critical applications and assess the implications.

Method: Comprehensive analysis of LLM responses across five high-stakes domains (medicine, law, politics, government benefits, job salaries) using identity markers.

Result: LLMs exhibit significant bias, applying different standards of care, altering answers based on political views, and recommending unequal salaries based on identity.

Conclusion: Off-the-shelf LLMs can cause harmful disparities; thorough assessments are needed before deployment in user-facing applications.

Abstract: Large language models (LLMs) are increasingly being used in user-facing applications, from providing medical consultations to job interview advice. Recent research suggests that these models are becoming increasingly proficient at inferring identity information about the author of a piece of text from linguistic patterns as subtle as the choice of a few words. However, little is known about how LLMs use this information in their decision-making in real-world applications. We perform the first comprehensive analysis of how identity markers present in a user’s writing bias LLM responses across five different high-stakes LLM applications in the domains of medicine, law, politics, government benefits, and job salaries. We find that LLMs are extremely sensitive to markers of identity in user queries and that race, gender, and age consistently influence LLM responses in these applications. For instance, when providing medical advice, we find that models apply different standards of care to individuals of different ethnicities for the same symptoms; we find that LLMs are more likely to alter answers to align with a conservative (liberal) political worldview when asked factual questions by older (younger) individuals; and that LLMs recommend lower salaries for non-White job applicants and higher salaries for women compared to men. Taken together, these biases mean that the use of off-the-shelf LLMs for these applications may cause harmful differences in medical care, foster wage gaps, and create different political factual realities for people of different identities. Beyond providing an analysis, we also provide new tools for evaluating how subtle encoding of identity in users’ language choices impacts model decisions. Given the serious implications of these findings, we recommend that similar thorough assessments of LLM use in user-facing applications are conducted before future deployment.

[7] CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation

Weihua Zheng, Roy Ka-Wei Lee, Zhengyuan Liu, Kui Wu, AiTi Aw, Bowei Zou

Main category: cs.CL

TL;DR: CCL-XCoT, a two-stage fine-tuning framework, reduces hallucinations in Multilingual Large Language Models (MLLMs) by 62% using curriculum-based contrastive learning and cross-lingual Chain-of-Thought prompting.

DetailsMotivation: MLLMs suffer from hallucinations, especially in low-resource languages, due to training data imbalances, impacting domain-specific tasks.

Method: CCL-XCoT combines curriculum-based contrastive learning for semantic alignment and XCoT prompting for reasoning in high-resource languages before generating in low-resource ones.

Result: The framework reduces hallucination rates by up to 62% and improves factual knowledge transfer across languages.

Conclusion: CCL-XCoT effectively mitigates hallucinations in MLLMs without external tools, enhancing cross-lingual performance.

Abstract: Multilingual Large Language Models(MLLMs) demonstrate strong generalization across languages, yet they remain prone to hallucinations, especially in low-resource languages, due to training data imbalances. These hallucinations, which include inaccurate or fabricated outputs, are particularly problematic in domain-specific generation tasks (Chataigner et al., 2024). To address this challenge, we propose CCL-XCoT(Curriculum-based Contrastive Learning-based Cross-lingual Chain-of-Thought), a two-stage fine-tuning framework for mitigating hallucination in MLLMs. Our approach first enhances cross-lingual semantic alignment through curriculum-based contrastive learning combined with next-token prediction during continued pre-training. Building on this foundation, we then introduce a cross-lingual Chain-of-Thought (XCoT) prompting strategy during instruction fine-tuning, which guides the model to reason in a high-resource language before generating answers in the target low-resource language. Experimental results show that CCL-XCoT reduces hallucination rates by up to 62% and substantially improves factual knowledge transfer across language pairs, without relying on external retrieval or multi-model ensembles.

[8] HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

Mohammad Shahedur Rahman, Peng Gao, Yuede Ji

Main category: cs.CL

TL;DR: The paper analyzes the LLM supply chain, revealing its large, sparse, and dynamic nature, with datasets playing critical roles and strong interdependencies between models and datasets.

DetailsMotivation: The growing complexity and resource demands of LLMs create barriers, and inherited vulnerabilities or biases from base models and datasets necessitate understanding their origins to mitigate risks.

Method: The study collects LLM supply chain data, constructs a directed heterogeneous graph (397,376 nodes, 453,469 edges), and analyzes its structure and dynamics.

Result: Findings include the graph’s power-law distribution, dense core, fragmented periphery, pivotal dataset roles, strong model-dataset interdependence, and daily updates.

Conclusion: Understanding the LLM supply chain’s structure and dynamics is crucial for risk detection, fairness improvement, and compliance.

Abstract: Large language models (LLMs) leverage deep learning to process and predict sequences of words from context, enabling them to perform various NLP tasks, such as translation, summarization, question answering, and content generation. However, the growing size and complexity of developing, training, and deploying advanced LLMs require extensive computational resources and large datasets. This creates a barrier for users. As a result, platforms that host models and datasets are widely used. For example, Hugging Face, one of the most popular platforms, hosted 1.8 million models and 450K datasets by June 2025, with no sign of slowing down. Since many LLMs are built from base models, pre-trained models, and external datasets, they can inherit vulnerabilities, biases, or malicious components from earlier models or datasets. Therefore, it is critical to understand the origin and development of these components to better detect potential risks, improve model fairness, and ensure compliance. Motivated by this, our project aims to study the relationships between models and datasets, which are core components of the LLM supply chain. First, we design a method to systematically collect LLM supply chain data. Using this data, we build a directed heterogeneous graph to model the relationships between models and datasets, resulting in a structure with 397,376 nodes and 453,469 edges. We then perform various analyses and uncover several findings, such as: (i) the LLM supply chain graph is large, sparse, and follows a power-law degree distribution; (ii) it features a densely connected core and a fragmented periphery; (iii) datasets play pivotal roles in training; (iv) strong interdependence exists between models and datasets; and (v) the graph is dynamic, with daily updates reflecting the ecosystem’s ongoing evolution.

[9] Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Huan Wang, Caiming Xiong, Silvio Savarese

Main category: cs.CL

TL;DR: Promptomatix automates prompt optimization for LLMs, improving performance and accessibility without manual tuning.

DetailsMotivation: Manual prompt engineering is inconsistent and inaccessible to non-experts, necessitating an automated solution.

Method: Uses meta-prompt-based optimization and DSPy-powered compiler, analyzes intent, generates synthetic data, and refines prompts cost-aware.

Result: Competitive/superior performance in 5 task categories, with reduced prompt length and computational overhead.

Conclusion: Promptomatix makes prompt optimization scalable, efficient, and accessible.

Abstract: Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.

[10] In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Alexander Jacobson, Lu Yuan, Leonid Sigal

Main category: cs.CL

TL;DR: ChartScope is a new LVLM for chart comprehension, addressing limitations of existing methods by using a data generation pipeline and Dual-Path training. It outperforms on diverse chart types and introduces a new benchmark, ChartDQA.

DetailsMotivation: Existing LVLMs for chart comprehension lack generalization across chart types and targeted pre-training for data alignment.

Method: Proposes a data generation pipeline for diverse chart types and a Dual-Path training strategy to enhance data understanding and reasoning.

Result: ChartScope improves comprehension across various chart types and introduces the ChartDQA benchmark.

Conclusion: ChartScope advances chart comprehension by addressing data diversity and alignment, validated by the new benchmark.

Abstract: Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model’s understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types. The code and data are available at https://davidhalladay.github.io/chartscope_demo.

[11] Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study

Rakesh Paul, Anusha Kamath, Kanishk Singla, Raviraj Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar

Main category: cs.CL

TL;DR: Selective translation improves multilingual LLM alignment by preserving non-translatable content, outperforming vanilla translation for low-resource languages like Hindi.

DetailsMotivation: Addressing the performance gap in multilingual LLMs for low-resource languages due to limited high-quality alignment data.

Method: Investigates LLM-based selective translation, comparing it with vanilla translation and evaluating the impact of filtering and mixed data.

Result: Selective translation proves effective for Hindi, outperforming standard translation methods like GCP and Llama-3.1-405B.

Conclusion: Selective translation is a practical solution for enhancing multilingual LLM alignment in low-resource settings.

Abstract: Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around this approach, including its effectiveness compared to vanilla translation, the importance of filtering noisy outputs, and the benefits of mixing translated samples with original English data during alignment. Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Translation (GCP) and Llama-3.1-405B. The results highlight the promise of selective translation as a practical and effective method for improving multilingual alignment in LLMs.

[12] How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs

Karin de Langis, Jong Inn Park, Andreas Schramm, Bin Hu, Khanh Chi Le, Michael Mensink, Ahn Thu Tong, Dongyeop Kang

Main category: cs.CL

TL;DR: LLMs process linguistic aspect differently from humans, relying on prototypicality and struggling with causal reasoning, indicating a lack of robust narrative understanding.

DetailsMotivation: To determine if LLMs exhibit human-like cognition in processing linguistic aspect in narratives.

Method: Expert-in-the-Loop probing pipeline with targeted experiments to assess semantic and pragmatic processing.

Result: LLMs over-rely on prototypicality, produce inconsistent judgments, and struggle with causal reasoning.

Conclusion: LLMs process aspect differently from humans, lacking robust narrative comprehension; a standardized framework for assessment is proposed.

Abstract: Large language models (LLMs) exhibit increasingly sophisticated linguistic capabilities, yet the extent to which these behaviors reflect human-like cognition versus advanced pattern recognition remains an open question. In this study, we investigate how LLMs process the temporal meaning of linguistic aspect in narratives that were previously used in human studies. Using an Expert-in-the-Loop probing pipeline, we conduct a series of targeted experiments to assess whether LLMs construct semantic representations and pragmatic inferences in a human-like manner. Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, raising concerns about their ability to fully comprehend narratives. These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding. Beyond these empirical findings, we develop a standardized experimental framework for the reliable assessment of LLMs’ cognitive and linguistic capabilities.

[13] STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Main category: cs.CL

TL;DR: Stitch is a novel method enabling Spoken Language Models (SLMs) to alternate between unspoken reasoning and spoken responses, reducing latency while improving reasoning performance.

DetailsMotivation: Current SLMs lack internal reasoning like humans, causing delays if full chain-of-thought (CoT) is generated before speaking.

Method: Stitch alternates between generating unspoken reasoning chunks and spoken response chunks, leveraging audio playback time for reasoning.

Result: Stitch matches baseline latency while outperforming them by 15% on math reasoning tasks and performs equally on non-reasoning tasks.

Conclusion: Stitch successfully integrates unspoken reasoning into SLMs without added latency, enhancing performance on reasoning tasks.

Abstract: Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

[14] What Makes You CLIC: Detection of Croatian Clickbait Headlines

Marija Anđedelić, Dominik Šipek, Laura Majer, Jan Šnajder

Main category: cs.CL

TL;DR: The paper compares fine-tuned BERTić and LLM-based in-context learning for clickbait detection in Croatian, using the CLIC dataset, finding fine-tuned models outperform general LLMs.

DetailsMotivation: To address the need for automatic clickbait detection in less-resourced languages like Croatian, preserving information quality and reader trust.

Method: Compiled the CLIC dataset, fine-tuned BERTić, and compared it to LLM-based in-context learning with Croatian and English prompts.

Result: Nearly half of headlines contained clickbait; fine-tuned models outperformed general LLMs.

Conclusion: Fine-tuned models are more effective for clickbait detection in less-resourced languages like Croatian.

Abstract: Online news outlets operate predominantly on an advertising-based revenue model, compelling journalists to create headlines that are often scandalous, intriguing, and provocative – commonly referred to as clickbait. Automatic detection of clickbait headlines is essential for preserving information quality and reader trust in digital media and requires both contextual understanding and world knowledge. For this task, particularly in less-resourced languages, it remains unclear whether fine-tuned methods or in-context learning (ICL) yield better results. In this paper, we compile CLIC, a novel dataset for clickbait detection of Croatian news headlines spanning a 20-year period and encompassing mainstream and fringe outlets. We fine-tune the BERTi'c model on this task and compare its performance to LLM-based ICL methods with prompts both in Croatian and English. Finally, we analyze the linguistic properties of clickbait. We find that nearly half of the analyzed headlines contain clickbait, and that finetuned models deliver better results than general LLMs.

[15] Can LLMs Infer Personality from Real World Conversations?

Jianfeng Zhu, Ruoming Jin, Karin G. Coifman

Main category: cs.CL

TL;DR: LLMs like GPT-4 and LLaMA show promise for personality assessment but struggle with accuracy and validity. A benchmark of 555 interviews with BFI-10 scores tested three LLMs, revealing weak correlations with ground truth and biases. Chain-of-thought prompting helped slightly but not enough.

DetailsMotivation: To evaluate the effectiveness of LLMs in inferring personality traits from real-world data, addressing gaps in earlier work relying on synthetic or invalid data.

Method: Tested three LLMs (GPT-4.1 Mini, Meta-LLaMA, DeepSeek) using zero-shot and chain-of-thought prompting on a benchmark of 555 interviews with BFI-10 scores.

Result: Models had high reliability but weak validity (max Pearson’s r = 0.27), low interrater agreement, and bias toward moderate/high traits. Chain-of-thought improved alignment slightly.

Conclusion: Current LLMs have limitations for personality inference, emphasizing the need for evidence-based development in psychological applications.

Abstract: Large Language Models (LLMs) such as OpenAI’s GPT-4 and Meta’s LLaMA offer a promising approach for scalable personality assessment from open-ended language. However, inferring personality traits remains challenging, and earlier work often relied on synthetic data or social media text lacking psychometric validity. We introduce a real-world benchmark of 555 semi-structured interviews with BFI-10 self-report scores for evaluating LLM-based personality inference. Three state-of-the-art LLMs (GPT-4.1 Mini, Meta-LLaMA, and DeepSeek) were tested using zero-shot prompting for BFI-10 item prediction and both zero-shot and chain-of-thought prompting for Big Five trait inference. All models showed high test-retest reliability, but construct validity was limited: correlations with ground-truth scores were weak (max Pearson’s $r = 0.27$), interrater agreement was low (Cohen’s $\kappa < 0.10$), and predictions were biased toward moderate or high trait levels. Chain-of-thought prompting and longer input context modestly improved distributional alignment, but not trait-level accuracy. These results underscore limitations in current LLM-based personality inference and highlight the need for evidence-based development for psychological applications.

[16] Text-to-SQL for Enterprise Data Analytics

Albert Chen, Manas Bundele, Gaurav Ahlawat, Patrick Stetz, Zhitao Wang, Qiang Fei, Donghoon Jung, Audrey Chu, Bharadwaj Jayaraman, Ayushi Panth, Yatin Arora, Sourav Jain, Renjith Varma, Alexey Ilin, Iuliia Melnychuk, Chelsea Chueh, Joyan Sil, Xiaofeng Wang

Main category: cs.CL

TL;DR: The paper presents a practical approach for building an enterprise Text-to-SQL chatbot, leveraging a knowledge graph, a Text-to-SQL agent, and an interactive interface, achieving 53% accuracy on internal benchmarks.

DetailsMotivation: To address the challenge of creating a functional enterprise Text-to-SQL solution despite rapid progress in benchmarks, focusing on self-service data insights for teams.

Method: 1. Construct a dynamic knowledge graph from metadata, logs, wikis, and code. 2. Develop a Text-to-SQL agent for context retrieval, query writing, and error correction. 3. Build an interactive chatbot supporting diverse user intents with rich UI.

Result: The chatbot has 300+ weekly users and achieves 53% accuracy on expert-reviewed benchmarks. Ablation studies highlight key components for enterprise solutions.

Conclusion: The approach offers a practical framework for developing effective enterprise Text-to-SQL systems, emphasizing dynamic knowledge integration and user interaction.

Abstract: The introduction of large language models has brought rapid progress on Text-to-SQL benchmarks, but it is not yet easy to build a working enterprise solution. In this paper, we present insights from building an internal chatbot that enables LinkedIn’s product managers, engineers, and operations teams to self-serve data insights from a large, dynamic data lake. Our approach features three components. First, we construct a knowledge graph that captures up-to-date semantics by indexing database metadata, historical query logs, wikis, and code. We apply clustering to identify relevant tables for each team or product area. Second, we build a Text-to-SQL agent that retrieves and ranks context from the knowledge graph, writes a query, and automatically corrects hallucinations and syntax errors. Third, we build an interactive chatbot that supports various user intents, from data discovery to query writing to debugging, and displays responses in rich UI elements to encourage follow-up chats. Our chatbot has over 300 weekly users. Expert review shows that 53% of its responses are correct or close to correct on an internal benchmark set. Through ablation studies, we identify the most important knowledge graph and modeling components, offering a practical path for developing enterprise Text-to-SQL solutions.

[17] Error-Aware Curriculum Learning for Biomedical Relation Classification

Sinchani Chakraborty, Sudeshna Sarkar, Pawan Goyal

Main category: cs.CL

TL;DR: An error-aware teacher-student framework using GPT-4o improves biomedical relation classification by analyzing errors, generating remediations, and training models progressively.

DetailsMotivation: Enhancing relation classification in biomedical texts to support knowledge graphs and applications like drug repurposing and clinical decision-making.

Method: Uses a teacher-student framework where GPT-4o analyzes errors, generates remediations, and trains two student models via instruction tuning and curriculum learning. A knowledge graph from PubMed abstracts supports context-aware classification.

Result: Achieves state-of-the-art performance on 4 of 5 PPI datasets and the DDI dataset, remaining competitive on ChemProt.

Conclusion: The proposed framework effectively improves biomedical relation classification through structured error analysis and progressive learning.

Abstract: Relation Classification (RC) in biomedical texts is essential for constructing knowledge graphs and enabling applications such as drug repurposing and clinical decision-making. We propose an error-aware teacher–student framework that improves RC through structured guidance from a large language model (GPT-4o). Prediction failures from a baseline student model are analyzed by the teacher to classify error types, assign difficulty scores, and generate targeted remediations, including sentence rewrites and suggestions for KG-based enrichment. These enriched annotations are used to train a first student model via instruction tuning. This model then annotates a broader dataset with difficulty scores and remediation-enhanced inputs. A second student is subsequently trained via curriculum learning on this dataset, ordered by difficulty, to promote robust and progressive learning. We also construct a heterogeneous biomedical knowledge graph from PubMed abstracts to support context-aware RC. Our approach achieves new state-of-the-art performance on 4 of 5 PPI datasets and the DDI dataset, while remaining competitive on ChemProt.

[18] X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor Display

Xiaolin Yan, Yangxing Liu, Jiazhang Zheng, Chi Liu, Mingyu Du, Caisheng Chen, Haoyang Liu, Ming Ding, Yuan Li, Qiuping Liao, Linfeng Li, Zhili Mei, Siyu Wan, Li Li, Ruyi Zhong, Jiangling Yu, Xule Liu, Huihui Hu, Jiameng Yue, Ruohui Cheng, Qi Yang, Liangqing Wu, Ke Zhu, Chi Zhang, Chufei Jing, Yifan Zhou, Yan Liang, Dongdong Li, Zhaohui Wang, Bin Zhao, Mingzhou Wu, Mingzhong Zhou, Peng Du, Zuomin Liao, Chao Dai, Pengfei Liang, Xiaoguang Zhu, Yu Zhang, Yu Gu, Kun Pan, Yuan Wu, Yanqing Guan, Shaojing Wu, Zikang Feng, Xianze Ma, Peishan Cheng, Wenjuan Jiang, Jing Ba, Huihao Yu, Zeping Hu, Yuan Xu, Zhiwei Liu, He Wang, Zhenguo Lin, Ming Liu, Yanhong Meng

Main category: cs.CL

TL;DR: X-Intelligence 3.0 is a specialized 32B-parameter LLM for the semiconductor display industry, outperforming larger models like DeepSeek-R1-671B through domain-specific training and RAG.

DetailsMotivation: Current LLMs lack domain-specific expertise for the semiconductor display industry, limiting their effectiveness in solving its complex challenges.

Method: The model uses supervised fine-tuning, reinforcement learning, and a domain-specific RAG mechanism, supported by an automated evaluation framework.

Result: X-Intelligence 3.0 outperforms SOTA models like DeepSeek-R1-671B on benchmarks despite its smaller size.

Conclusion: The model addresses industry-specific reasoning challenges efficiently, proving its value as a specialized solution.

Abstract: Large language models (LLMs) have recently achieved significant advances in reasoning and demonstrated their advantages in solving challenging problems. Yet, their effectiveness in the semiconductor display industry remains limited due to a lack of domain-specific training and expertise. To bridge this gap, we present X-Intelligence 3.0, the first high-performance reasoning model specifically developed for the semiconductor display industry. This model is designed to deliver expert-level understanding and reasoning for the industry’s complex challenges. Leveraging a carefully curated industry knowledge base, the model undergoes supervised fine-tuning and reinforcement learning to enhance its reasoning and comprehension capabilities. To further accelerate development, we implemented an automated evaluation framework that simulates expert-level assessments. We also integrated a domain-specific retrieval-augmented generation (RAG) mechanism, resulting in notable performance gains on benchmark datasets. Despite its relatively compact size of 32 billion parameters, X-Intelligence 3.0 outperforms SOTA DeepSeek-R1-671B across multiple evaluations. This demonstrates its exceptional efficiency and establishes it as a powerful solution to the longstanding reasoning challenges faced by the semiconductor display industry.

[19] End-to-end Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training Data

Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent

Main category: cs.CL

TL;DR: The paper proposes two methods for joint punctuated and normalized ASR, improving performance with limited punctuated data.

DetailsMotivation: The lack of paired speech and punctuated text data in ASR corpora makes joint punctuated and normalized ASR challenging.

Method: Two approaches: 1) Using a language model to convert normalized transcripts into punctuated ones. 2) A single decoder conditioned on output type.

Result: First method reduces PC-WER by 17% on out-of-domain data. Second method reduces PC-WER by 42% and normalized WER by 4%. Works with as little as 5% punctuated data.

Conclusion: The proposed methods effectively improve joint ASR performance with limited punctuated data, demonstrating feasibility even with minimal resources.

Abstract: Joint punctuated and normalized automatic speech recognition (ASR) aims at outputing transcripts with and without punctuation and casing. This task remains challenging due to the lack of paired speech and punctuated text data in most ASR corpora. We propose two approaches to train an end-to-end joint punctuated and normalized ASR system using limited punctuated data. The first approach uses a language model to convert normalized training transcripts into punctuated transcripts. This achieves a better performance on out-of-domain test data, with up to 17% relative Punctuation-Case-aware Word Error Rate (PC-WER) reduction. The second approach uses a single decoder conditioned on the type of output. This yields a 42% relative PC-WER reduction compared to Whisper-base and a 4% relative (normalized) WER reduction compared to the normalized output of a punctuated-only model. Additionally, our proposed model demonstrates the feasibility of a joint ASR system using as little as 5% punctuated training data with a moderate (2.42% absolute) PC-WER increase.

[20] XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification

Sachin Yadav, Dominik Schlechtweg

Main category: cs.CL

TL;DR: XL-DURel is a multilingual Sentence Transformer model for ordinal Word-in-Context classification, outperforming previous models using angular distance-based ranking. It unifies binary and ordinal WiC tasks.

DetailsMotivation: To improve performance on ordinal and binary Word-in-Context (WiC) tasks by treating binary WiC as a special case of ordinal WiC.

Method: Finetuned multilingual Sentence Transformer model with ranking objectives based on angular distance in complex space.

Result: Outperforms previous models on ordinal and binary WiC tasks, showing improved performance on binary tasks when optimized for ordinal tasks.

Conclusion: XL-DURel enables a unified approach to WiC modeling, improving performance across task formulations.

Abstract: We propose XL-DURel, a finetuned, multilingual Sentence Transformer model optimized for ordinal Word-in-Context classification. We test several loss functions for regression and ranking tasks managing to outperform previous models on ordinal and binary data with a ranking objective based on angular distance in complex space. We further show that binary WiC can be treated as a special case of ordinal WiC and that optimizing models for the general ordinal task improves performance on the more specific binary task. This paves the way for a unified treatment of WiC modeling across different task formulations.

[21] Towards the Next Frontier in Speech Representation Learning Using Disentanglement

Varun Krishna, Sriram Ganapathy

Main category: cs.CL

TL;DR: Learn2Diss is a framework for self-supervised learning of speech representations, combining frame-level and utterance-level encoders to disentangle phonemic and speaker-specific features, achieving state-of-the-art results.

DetailsMotivation: Existing self-supervised learning methods focus on frame-level masked prediction, ignoring coarser speech factors like speaker or channel characteristics. Learn2Diss addresses this gap.

Method: The framework uses independent frame-level (pseudo-phonemic) and utterance-level (pseudo-speaker) encoders, later disentangled via mutual information.

Result: Learn2Diss achieves state-of-the-art performance, with frame-level encoder aiding semantic tasks and utterance-level encoder improving non-semantic tasks.

Conclusion: Learn2Diss effectively disentangles speech representations, enhancing performance across diverse downstream tasks.

Abstract: The popular frameworks for self-supervised learning of speech representations have largely focused on frame-level masked prediction of speech regions. While this has shown promising downstream task performance for speech recognition and related tasks, this has largely ignored factors of speech that are encoded at coarser level, like characteristics of the speaker or channel that remain consistent through-out a speech utterance. In this work, we propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is largely inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by constrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two modules consists of disentangling the two encoders using a mutual information based criterion. With several downstream evaluation experiments, we show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.

[22] Exploring Human-AI Complementarity in CPS Diagnosis Using Unimodal and Multimodal BERT Models

Kester Wong, Sahan Bulathwela, Mutlu Cukurova

Main category: cs.CL

TL;DR: The paper explores using multimodal BERT (AudiBERT) for detecting CPS indicators, showing significant improvements in social-cognitive dimensions but not affective ones. It emphasizes data size and human-AI collaboration for better performance.

DetailsMotivation: To enhance CPS diagnosis by integrating multimodal data (speech and acoustic-prosodic features) and improving human-AI collaboration.

Method: Uses AudiBERT, a multimodal BERT variant, and compares its performance with BERT on CPS indicator detection, analyzing statistical significance and correlations with data size and human coder agreement.

Result: AudiBERT improved sparse class classification and showed significant social-cognitive improvements over BERT, but not in affective dimensions. Data size and human coder agreement influenced model performance.

Conclusion: Proposes a structured approach for human-AI complementarity in CPS diagnosis, stressing model explainability to support human agency in coding.

Abstract: Detecting collaborative problem solving (CPS) indicators from dialogue using machine learning techniques is a significant challenge for the field of AI in Education. Recent studies have explored the use of Bidirectional Encoder Representations from Transformers (BERT) models on transcription data to reliably detect meaningful CPS indicators. A notable advancement involved the multimodal BERT variant, AudiBERT, which integrates speech and acoustic-prosodic audio features to enhance CPS diagnosis. Although initial results demonstrated multimodal improvements, the statistical significance of these enhancements remained unclear, and there was insufficient guidance on leveraging human-AI complementarity for CPS diagnosis tasks. This workshop paper extends the previous research by highlighting that the AudiBERT model not only improved the classification of classes that were sparse in the dataset, but it also had statistically significant class-wise improvements over the BERT model for classifications in the social-cognitive dimension. However, similar significant class-wise improvements over the BERT model were not observed for classifications in the affective dimension. A correlation analysis highlighted that larger training data was significantly associated with higher recall performance for both the AudiBERT and BERT models. Additionally, the precision of the BERT model was significantly associated with high inter-rater agreement among human coders. When employing the BERT model to diagnose indicators within these subskills that were well-detected by the AudiBERT model, the performance across all indicators was inconsistent. We conclude the paper by outlining a structured approach towards achieving human-AI complementarity for CPS diagnosis, highlighting the crucial inclusion of model explainability to support human agency and engagement in the reflective coding process.

[23] Explainable Collaborative Problem Solving Diagnosis with BERT using SHAP and its Implications for Teacher Adoption

Kester Wong, Sahan Bulathwela, Mutlu Cukurova

Main category: cs.CL

TL;DR: The paper explores BERT model explainability for CPS classification using SHAP, revealing that high performance doesn’t guarantee reasonable explanations and identifies spurious word contributions.

DetailsMotivation: Enhancing BERT-based CPS diagnostics' explainability to foster trust and adoption in education by understanding token contributions.

Method: Used SHAP to analyze tokenized words’ impact on BERT’s CPS classification decisions.

Result: Found well-performing classifications lacked reasonable explanations, with some spurious word contributions.

Conclusion: Suggests ensemble models and human-AI collaboration for better CPS diagnosis, as human reasoning is still crucial.

Abstract: The use of Bidirectional Encoder Representations from Transformers (BERT) model and its variants for classifying collaborative problem solving (CPS) has been extensively explored within the AI in Education community. However, limited attention has been given to understanding how individual tokenised words in the dataset contribute to the model’s classification decisions. Enhancing the explainability of BERT-based CPS diagnostics is essential to better inform end users such as teachers, thereby fostering greater trust and facilitating wider adoption in education. This study undertook a preliminary step towards model transparency and explainability by using SHapley Additive exPlanations (SHAP) to examine how different tokenised words in transcription data contributed to a BERT model’s classification of CPS processes. The findings suggested that well-performing classifications did not necessarily equate to a reasonable explanation for the classification decisions. Particular tokenised words were used frequently to affect classifications. The analysis also identified a spurious word, which contributed positively to the classification but was not semantically meaningful to the class. While such model transparency is unlikely to be useful to an end user to improve their practice, it can help them not to overrely on LLM diagnostics and ignore their human expertise. We conclude the workshop paper by noting that the extent to which the model appropriately uses the tokens for its classification is associated with the number of classes involved. It calls for an investigation into the exploration of ensemble model architectures and the involvement of human-AI complementarity for CPS diagnosis, since considerable human reasoning is still required for fine-grained discrimination of CPS subskills.

[24] Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification

Łukasz Radliński, Mateusz Guściora, Jan Kocoń

Main category: cs.CL

TL;DR: The paper explores data augmentation in NLP using methods like paraphrasing and backtranslation with large language models (e.g., GPT), comparing them to generative methods. Findings show traditional methods can match or outperform few-shot generation.

DetailsMotivation: Address data scarcity and class imbalance in NLP tasks by evaluating traditional augmentation methods (paraphrasing, backtranslation) against newer generative approaches.

Method: Compare four data augmentation approaches (including backtranslation and paraphrasing) using ChatGPT and an exemplary dataset, assessing data quality and classification performance.

Result: Backtranslation and paraphrasing achieve comparable or superior results to zero/few-shot generative methods.

Conclusion: Traditional augmentation methods, when combined with modern models, can effectively address data scarcity without relying solely on generative techniques.

Abstract: Numerous domain-specific machine learning tasks struggle with data scarcity and class imbalance. This paper systematically explores data augmentation methods for NLP, particularly through large language models like GPT. The purpose of this paper is to examine and evaluate whether traditional methods such as paraphrasing and backtranslation can leverage a new generation of models to achieve comparable performance to purely generative methods. Methods aimed at solving the problem of data scarcity and utilizing ChatGPT were chosen, as well as an exemplary dataset. We conducted a series of experiments comparing four different approaches to data augmentation in multiple experimental setups. We then evaluated the results both in terms of the quality of generated data and its impact on classification performance. The key findings indicate that backtranslation and paraphrasing can yield comparable or even better results than zero and a few-shot generation of examples.

[25] Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper

Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu, Diana Marion, Job Nyangena, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha, Eric Mibuari, Jean Philbert Nsengemana, Talkmore Chidede

Main category: cs.CL

TL;DR: The paper explores using LLMs in African primary care, creating a benchmark dataset for Kenyan clinical care using retrieval augmented generation (RAG) and involving local physicians for accuracy and cultural relevance. Results show LLMs perform worse in localized African scenarios compared to US benchmarks.

DetailsMotivation: To address the underexplored effectiveness of LLMs in African primary care and improve healthcare access in low-resource settings by aligning with local standards.

Method: Uses RAG to ground clinical questions in Kenya’s national guidelines, digitizes and indexes them, and generates clinical scenarios, questions, and answers in English and Swahili with physician input and expert review.

Result: Reveals significant performance gaps in LLMs for localized African medical content compared to US benchmarks.

Conclusion: Provides a replicable model for guideline-driven benchmarking to support safe AI deployment in African health systems.

Abstract: Large Language Models(LLMs) hold promise for improving healthcare access in low-resource settings, but their effectiveness in African primary care remains underexplored. We present a methodology for creating a benchmark dataset and evaluation framework focused on Kenyan Level 2 and 3 clinical care. Our approach uses retrieval augmented generation (RAG) to ground clinical questions in Kenya’s national guidelines, ensuring alignment with local standards. These guidelines were digitized, chunked, and indexed for semantic retrieval. Gemini Flash 2.0 Lite was then prompted with guideline excerpts to generate realistic clinical scenarios, multiple-choice questions, and rationale based answers in English and Swahili. Kenyan physicians co-created and refined the dataset, and a blinded expert review process ensured clinical accuracy, clarity, and cultural appropriateness. The resulting Alama Health QA dataset includes thousands of regulator-aligned question answer pairs across common outpatient conditions. Beyond accuracy, we introduce evaluation metrics that test clinical reasoning, safety, and adaptability such as rare case detection (Needle in the Haystack), stepwise logic (Decision Points), and contextual adaptability. Initial results reveal significant performance gaps when LLMs are applied to localized scenarios, consistent with findings that LLM accuracy is lower on African medical content than on US-based benchmarks. This work offers a replicable model for guideline-driven, dynamic benchmarking to support safe AI deployment in African health systems.

[26] Preventing Rogue Agents Improves Multi-Agent Collaboration

Ohav Barbi, Ori Yoran, Mor Geva

Main category: cs.CL

TL;DR: Proposes monitoring agents in multi-agent systems to detect and intervene when rogue agents may cause system failure, showing performance gains in experiments.

DetailsMotivation: Multi-agent systems are prone to failure if a single agent acts incorrectly, highlighting the need for preemptive detection and intervention.

Method: Introduces monitoring during action prediction and interventions to prevent errors, tested in WhoDunitEnv, code generation, and GovSim environments.

Result: Achieves performance gains of 17.4%, 2.5%, and 20% in respective tasks, with monitors effectively identifying agent confusion.

Conclusion: The approach successfully prevents agent errors from propagating, enhancing system reliability.

Abstract: Multi-agent systems, where specialized agents collaborate to solve a shared task hold great potential, from increased modularity to simulating complex environments. However, they also have a major caveat – a single agent can cause the entire system to fail. Consider a simple game where the knowledge to solve the task is distributed between agents, which share information in a communication channel. At each round, any of the agents can terminate the game and make the final prediction, even if they are uncertain about the outcome of their action. Detection of such rogue agents before they act may prevent the system’s failure. In this work, we propose to monitor agents during action prediction and intervene when a future error is likely to occur. To test our approach, we introduce WhoDunitEnv, a multi-agent collaboration environment that allows modular control over task complexity and communication structure. Experiments on WhoDunitEnv, code generation tasks and the GovSim environment for resource sustainability show that our approach leads to substantial performance gains up to 17.4%, 2.5% and 20%, respectively. Thorough analysis shows that our monitors successfully identify critical points of agent confusion and our interventions effectively stop agent errors from propagating.

[27] Linear Relational Decoding of Morphology in Language Models

Eric Xia, Jugal Kalita

Main category: cs.CL

TL;DR: A two-part affine approximation effectively approximates transformer computations for certain subject-object relations, achieving 90% faithfulness on morphological relations.

DetailsMotivation: To explore interpretability of conceptual relationships in language models, such as morphology, from latent space.

Method: Adapting the Bigger Analogy Test Set, using linear transformation Ws (where s is a middle layer representation and W is derived from model derivatives) to reproduce final object states.

Result: The linear technique achieves 90% faithfulness on morphological relations, with similar results across languages and models.

Conclusion: Some conceptual relationships in language models are interpretable and sparsely encoded by cross-layer linear transformations.

Abstract: A two-part affine approximation has been found to be a good approximation for transformer computations over certain subject object relations. Adapting the Bigger Analogy Test Set, we show that the linear transformation Ws, where s is a middle layer representation of a subject token and W is derived from model derivatives, is also able to accurately reproduce final object states for many relations. This linear technique is able to achieve 90% faithfulness on morphological relations, and we show similar findings multi-lingually and across models. Our findings indicate that some conceptual relationships in language models, such as morphology, are readily interpretable from latent space, and are sparsely encoded by cross-layer linear transformations.

[28] Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs

Minsuh Joo, Hyunsoo Cho

Main category: cs.CL

TL;DR: The paper introduces Cleanse, a clustering-based method to estimate uncertainty in LLM responses to detect hallucinations, validated on models like LLaMA and Mistral.

DetailsMotivation: Hallucinations in LLMs undermine reliability, and uncertainty estimation is key to distinguishing accurate responses.

Method: Cleanse uses clustering to measure semantic consistency in hidden embeddings of LLM responses.

Result: Validated on LLaMA and Mistral models using SQuAD and CoQA benchmarks, Cleanse effectively detects hallucinations.

Conclusion: Cleanse provides a reliable approach to quantify uncertainty and improve LLM safety.

Abstract: Despite the outstanding performance of large language models (LLMs) across various NLP tasks, hallucinations in LLMs–where LLMs generate inaccurate responses–remains as a critical problem as it can be directly connected to a crisis of building safe and reliable LLMs. Uncertainty estimation is primarily used to measure hallucination levels in LLM responses so that correct and incorrect answers can be distinguished clearly. This study proposes an effective uncertainty estimation approach, \textbf{Cl}ust\textbf{e}ring-based sem\textbf{an}tic con\textbf{s}ist\textbf{e}ncy (\textbf{Cleanse}). Cleanse quantifies the uncertainty with the proportion of the intra-cluster consistency in the total consistency between LLM hidden embeddings which contain adequate semantic information of generations, by employing clustering. The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B and two question-answering benchmarks, SQuAD and CoQA.

[29] Mangosteen: An Open Thai Corpus for Language Model Pretraining

Wannaphong Phatthiyaphaibun, Can Udomcharoenchaikit, Pakpoom Singkorapoom, Kunat Pipatanakul, Ekapol Chuangsuwanich, Peerat Limkonchotiwat, Sarana Nutanong

Main category: cs.CL

TL;DR: Mangosteen is a 47B-token Thai corpus built with a Thai-adapted pipeline, improving language model performance on Thai benchmarks and ensuring reproducibility.

DetailsMotivation: Existing corpora lack Thai-specific cleaning, risking harmful content and hindering reproducibility. Mangosteen addresses this gap.

Method: Adapted Dolma pipeline with custom language ID, quality filters, and Thai-trained content filters, plus curated non-web sources.

Result: Pipeline reduces CommonCrawl docs from 202M to 25M, improves SEA-HELM NLG from 3 to 11, and enhances model performance on Thai benchmarks.

Conclusion: Mangosteen provides a transparent, high-quality Thai corpus and reproducible pipeline, advancing Thai and regional LLM research.

Abstract: Pre-training data shapes a language model’s quality, but raw web text is noisy and demands careful cleaning. Existing large-scale corpora rely on English-centric or language-agnostic pipelines whose heuristics do not capture Thai script or cultural nuances, leaving risky material such as gambling content untreated. Prior Thai-specific efforts customize pipelines or build new ones, yet seldom release their data or document design choices, hindering reproducibility and raising the question of how to construct a transparent, high-quality Thai corpus. We introduce Mangosteen: a 47 billion-token Thai corpus built through a Thai-adapted Dolma pipeline that includes custom rule-based language ID, revised C4/Gopher quality filters, and Thai-trained content filters, plus curated non-web sources such as Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline trims CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG from 3 to 11; an 8B-parameter SEA-LION model continually pre-trained on Mangosteen then surpasses SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. We release the full pipeline code, cleaning manifests, corpus snapshot, and all checkpoints, providing a fully reproducible foundation for future Thai and regional LLM research.

[30] Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel Gómez-Bravo, Egbert van der Haring, Kees van Boven, Marcelo Finger, Luis Fernandez Lopez

Main category: cs.CL

TL;DR: LLMs demonstrate strong potential for automating ICPC-2 coding without fine-tuning, with top models achieving high F1-scores. Challenges include dataset limitations and the need for broader evaluations.

DetailsMotivation: To assess the feasibility of using LLMs for automating medical coding (ICPC-2) to improve efficiency in healthcare data processing.

Method: Used a dataset of 437 clinical expressions annotated with ICPC-2 codes. A semantic search engine retrieved candidates, and 33 LLMs were prompted to select the best-matching code. Performance was evaluated using F1-score, token usage, cost, response time, and format adherence.

Result: 28 models achieved F1-score > 0.8; 10 exceeded 0.85. Top performers included gpt-4.5-preview and gemini-2.5-pro. Retriever optimization improved performance by up to 4 points. Smaller models struggled with formatting and input length.

Conclusion: LLMs are promising for ICPC-2 coding automation, but broader, multilingual, and end-to-end evaluations are needed for clinical validation.

Abstract: Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI’s text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.

[31] MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing

Main category: cs.CL

TL;DR: The paper introduces MiroMind-M1, a fully open-source reasoning language model (RLM) series, addressing transparency and reproducibility gaps in existing RLMs by releasing models, datasets, and training configurations.

DetailsMotivation: To enhance transparency and reproducibility in RLM development, as current closed-source and open-source models lack critical resources like datasets and training details.

Method: Two-stage training: SFT on 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K problems, using Context-Aware Multi-Stage Policy Optimization for robust RL training.

Result: State-of-the-art or competitive performance on AIME24, AIME25, and MATH benchmarks, with superior token efficiency for Qwen-2.5-based 7B and 32B models.

Conclusion: The release of MiroMind-M1 models, datasets, and configurations aims to support research and community advancement in RLMs.

Abstract: Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.

[32] Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

Mohammed Alkhowaiter, Norah Alshahrani, Saied Alshahrani, Reem I. Masoud, Alaa Alzahrani, Deema Alnuhait, Emad A. Alghamdi, Khalid Almubarak

Main category: cs.CL

TL;DR: The paper reviews Arabic post-training datasets on Hugging Face Hub, highlighting gaps in task diversity, documentation, and adoption, and offers recommendations for improvement.

DetailsMotivation: To assess and improve the quality and diversity of Arabic post-training datasets for better alignment of LLMs with human instructions.

Method: Evaluation of datasets along four dimensions: LLM Capabilities, Steerability, Alignment, and Robustness, based on popularity, adoption, recency, documentation, licensing, and scientific contribution.

Result: Identified gaps include limited task diversity, poor documentation, and low community adoption.

Conclusion: The findings highlight the need for better Arabic post-training datasets and provide actionable recommendations for future development.

Abstract: Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., persona and system prompts); (3) Alignment (e.g., cultural, safety, ethics, and fairness), and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic LLMs and applications while providing concrete recommendations for future efforts in post-training dataset development.

[33] Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation

Amina Dzafic, Merve Kavut, Ulya Bayram

Main category: cs.CL

TL;DR: The paper addresses challenges in suicidal ideation detection, focusing on language diversity and annotation reliability, by creating a Turkish corpus and evaluating label and model consistency across datasets.

DetailsMotivation: Limited language coverage and unreliable annotation practices hinder progress in suicidal ideation detection, especially in non-English languages.

Method: Constructed a Turkish suicidal ideation corpus with a resource-efficient annotation framework, evaluated label reliability and model consistency across datasets using transfer learning.

Result: Highlighted the need for rigorous, language-inclusive annotation and evaluation, showing poor zero-shot transfer learning performance of popular models.

Conclusion: Advocates for transparency in dataset construction and model training in mental health NLP, emphasizing reliability.

Abstract: Suicidal ideation detection is critical for real-time suicide prevention, yet its progress faces two under-explored challenges: limited language coverage and unreliable annotation practices. Most available datasets are in English, but even among these, high-quality, human-annotated data remains scarce. As a result, many studies rely on available pre-labeled datasets without examining their annotation process or label reliability. The lack of datasets in other languages further limits the global realization of suicide prevention via artificial intelligence (AI). In this study, we address one of these gaps by constructing a novel Turkish suicidal ideation corpus derived from social media posts and introducing a resource-efficient annotation framework involving three human annotators and two large language models (LLMs). We then address the remaining gaps by performing a bidirectional evaluation of label reliability and model consistency across this dataset and three popular English suicidal ideation detection datasets, using transfer learning through eight pre-trained sentiment and emotion classifiers. These transformers help assess annotation consistency and benchmark model performance against manually labeled data. Our findings underscore the need for more rigorous, language-inclusive approaches to annotation and evaluation in mental health natural language processing (NLP) while demonstrating the questionable performance of popular models with zero-shot transfer learning. We advocate for transparency in model training and dataset construction in mental health NLP, prioritizing data and model reliability.

[34] Disparities in Peer Review Tone and the Role of Reviewer Anonymity

Maria Sahakyan, Bedoor AlShebli

Main category: cs.CL

TL;DR: The study analyzes linguistic biases in peer review using NLP and statistical modeling on 80,000 reviews, revealing disparities tied to author demographics and reviewer anonymity.

DetailsMotivation: To uncover subtle linguistic biases in peer review and their impact on fairness, given the lack of attention to language disparities.

Method: Natural language processing and large-scale statistical modeling on 80,000 reviews from two major journals, comparing anonymous and signed reviews.

Result: Reveals variations in review tone, sentiment, and supportive language based on author demographics and reviewer anonymity, challenging assumptions about fairness.

Conclusion: The findings highlight hidden biases in peer review, urging reconsideration of review policies to ensure equity in scientific careers and progress.

Abstract: The peer review process is often regarded as the gatekeeper of scientific integrity, yet increasing evidence suggests that it is not immune to bias. Although structural inequities in peer review have been widely debated, much less attention has been paid to the subtle ways in which language itself may reinforce disparities. This study undertakes one of the most comprehensive linguistic analyses of peer review to date, examining more than 80,000 reviews in two major journals. Using natural language processing and large-scale statistical modeling, it uncovers how review tone, sentiment, and supportive language vary across author demographics, including gender, race, and institutional affiliation. Using a data set that includes both anonymous and signed reviews, this research also reveals how the disclosure of reviewer identity shapes the language of evaluation. The findings not only expose hidden biases in peer feedback, but also challenge conventional assumptions about anonymity’s role in fairness. As academic publishing grapples with reform, these insights raise critical questions about how review policies shape career trajectories and scientific progress.

[35] On the robustness of modeling grounded word learning through a child’s egocentric input

Wai Keen Vong, Brenden M. Lake

Main category: cs.CL

TL;DR: Neural networks trained on child-like input can learn word-referent mappings, validated across multiple children’s data and network architectures.

DetailsMotivation: To bridge the gap between machine learning models and human language acquisition by testing if models can learn robustly from limited, child-like input.

Method: Used automated speech transcription on the SAYCam dataset (500+ hours of video from 3 children) to create multimodal datasets, testing various neural network configurations.

Result: Networks successfully learned and generalized word-referent mappings across architectures, showing robustness but also individual learning differences.

Conclusion: Multimodal neural networks are robust for grounded word learning, though individual developmental experiences influence learning patterns.

Abstract: What insights can machine learning bring to understanding human language acquisition? Large language and multimodal models have achieved remarkable capabilities, but their reliance on massive training datasets creates a fundamental mismatch with children, who succeed in acquiring language from comparatively limited input. To help bridge this gap, researchers have increasingly trained neural networks using data similar in quantity and quality to children’s input. Taking this approach to the limit, Vong et al. (2024) showed that a multimodal neural network trained on 61 hours of visual and linguistic input extracted from just one child’s developmental experience could acquire word-referent mappings. However, whether this approach’s success reflects the idiosyncrasies of a single child’s experience, or whether it would show consistent and robust learning patterns across multiple children’s experiences was not explored. In this article, we applied automated speech transcription methods to the entirety of the SAYCam dataset, consisting of over 500 hours of video data spread across all three children. Using these automated transcriptions, we generated multi-modal vision-and-language datasets for both training and evaluation, and explored a range of neural network configurations to examine the robustness of simulated word learning. Our findings demonstrate that networks trained on automatically transcribed data from each child can acquire and generalize word-referent mappings across multiple network architectures. These results validate the robustness of multimodal neural networks for grounded word learning, while highlighting the individual differences that emerge in how models learn when trained on each child’s developmental experiences.

[36] GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization

Luyi Ma, Wanjia Zhang, Kai Zhao, Abhishek Kulkarni, Lalitesh Morishetti, Anjana Ganesh, Ashish Ranjan, Aashika Padmanabhan, Jianpeng Xu, Jason Cho, Praveen Kanumala, Kaushiki Nag, Sumit Dutta, Kamiya Motwani, Malay Patel, Evren Korpeoglu, Sushant Kumar, Kannan Achan

Main category: cs.CL

TL;DR: GRACE improves multi-behavior recommendation by combining Chain-of-Thought tokenization and sparse attention, achieving significant performance gains and computational efficiency.

DetailsMotivation: Existing generative models for multi-behavior recommendation lack interpretability, efficiency, and multi-scale modeling, hindering their adoption.

Method: GRACE uses hybrid Chain-of-Thought tokenization with explicit attributes and Journey-Aware Sparse Attention for efficient, interpretable recommendations.

Result: GRACE outperforms baselines by up to +106.9% HR@10 and reduces attention computation by 48%.

Conclusion: GRACE addresses key limitations in generative recommendation, offering interpretability, efficiency, and superior performance.

Abstract: Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenization, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware sparse Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and +106.7% NDCG@10 improvement over the state-of-the-art baseline on the Home domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces attention computation by up to 48% with long sequences.

[37] FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

Shoutao Guo, Shaolei Zhang, Qingkai Fang, Zhengrui Ma, Min Zhang, Yang Feng

Main category: cs.CL

TL;DR: FastLongSpeech is a framework for efficient long-speech processing in LSLMs without needing long-speech training data, using iterative fusion and dynamic compression training.

DetailsMotivation: Existing LSLMs focus on short-speech tasks, leaving long-form speech processing underexplored due to data scarcity and high computational costs.

Method: Introduces FastLongSpeech with iterative fusion for sequence compression and dynamic compression training to adapt LSLMs for long-speech tasks.

Result: Demonstrates strong performance in long- and short-speech tasks while improving inference efficiency.

Conclusion: FastLongSpeech effectively bridges the gap in long-speech processing for LSLMs without requiring dedicated training data.

Abstract: The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.

[38] Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents

Akriti Jain, Pritika Ramu, Aparna Garimella, Apoorv Saxena

Main category: cs.CL

TL;DR: The paper introduces intent-based chart generation from documents using a two-staged LLM framework, outperforming baselines in accuracy and chart type selection.

DetailsMotivation: Existing methods for text-to-visualization struggle with real-world use cases where users provide intents for chart generation from long documents without pre-selecting content.

Method: An unsupervised two-staged framework: (1) LLM extracts and refines data from documents based on intent, (2) heuristic-guided module selects chart type and generates code.

Result: The method outperforms baselines by up to 9 points in data accuracy and 17 points in chart type selection, validated on a curated dataset of 1,242 tuples.

Conclusion: The proposed framework effectively addresses intent-based chart generation from documents, demonstrating superior performance over existing methods.

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in transforming text descriptions or tables to data visualizations via instruction-tuning methods. However, it is not straightforward to apply these methods directly for a more real-world use case of visualizing data from long documents based on user-given intents, as opposed to the user pre-selecting the relevant content manually. We introduce the task of intent-based chart generation from documents: given a user-specified intent and document(s), the goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting. We propose an unsupervised, two-staged framework in which an LLM first extracts relevant information from the document(s) by decomposing the intent and iteratively validates and refines this data. Next, a heuristic-guided module selects an appropriate chart type before final code generation. To assess the data accuracy of the generated charts, we propose an attribution-based metric that uses a structured textual representation of charts, instead of relying on visual decoding metrics that often fail to capture the chart data effectively. To validate our approach, we curate a dataset comprising of 1,242 $<$intent, document, charts$>$ tuples from two domains, finance and scientific, in contrast to the existing datasets that are largely limited to parallel text descriptions/ tables and their corresponding charts. We compare our approach with baselines using single-shot chart generation using LLMs and query-based retrieval methods; our method outperforms by upto $9$ points and $17$ points in terms of chart data accuracy and chart type respectively over the best baselines.

[39] Beyond Isolated Capabilities: Bridging Long CoT Reasoning and Long-Context Understanding

Yifei Wang

Main category: cs.CL

TL;DR: Reasoning distillation improves smaller models’ long-context understanding, addressing the ’lost in the middle’ issue in Retrieval-Augmented Generation systems.

DetailsMotivation: To explore how large-scale reasoning distillation affects in-context retrieval and reasoning, crucial for RAG systems.

Method: Comprehensive investigation using open-source models distilled from Deepseek-R1, evaluated through multi-document QA tasks.

Result: Distilled reasoning patterns enhance long-context comprehension by promoting detailed reasoning during context analysis.

Conclusion: Reasoning distillation significantly improves long-context awareness and mitigates the ’lost in the middle’ problem.

Abstract: Reasoning distillation has emerged as an effective approach to enhance the reasoning capabilities of smaller language models. However, the impact of large-scale reasoning distillation on other critical abilities, particularly in-context retrieval and reasoning, remains unexplored. This gap in understanding is particularly significant given the increasing importance of Retrieval-Augmented Generation (RAG) systems, where efficient acquisition and utilization of contextual information are paramount for generating reliable responses. Motivated by the need to understand how the extended long-CoT process influences long-context comprehension, we conduct a comprehensive investigation using a series of open-source models distilled from Deepseek-R1, renowned for its exceptional reasoning capabilities. Our study focuses on evaluating these models’ performance in extracting and integrating relevant information from extended contexts through multi-document question and answering tasks. Through rigorous experimentation, we demonstrate that distilled reasoning patterns significantly improve long-context understanding. Our analysis reveals that distillation fosters greater long-context awareness by promoting more detailed and explicit reasoning processes during context analysis and information parsing. This advancement effectively mitigates the persistent “lost in the middle” issue that has hindered long-context models.

[40] Tiny language models

Ronit D. Gross, Yarden Tzach, Tal Halevi, Ella Koresh, Ido Kanter

Main category: cs.CL

TL;DR: The paper explores whether tiny language models (TLMs) share key features of large language models (LLMs), finding that pre-trained TLMs outperform non-pre-trained ones, with performance scaling with dataset size and token overlap.

DetailsMotivation: The high computational cost of LLMs limits research participation, creating a need for accessible alternatives like TLMs.

Method: Pre-training BERT-6 and BERT-1 variants on Wikipedia subsets and evaluating on FewRel, AGNews, and DBPedia tasks.

Result: Pre-trained TLMs show a performance gap over non-pre-trained models, replicable via a soft committee of shallow architectures.

Conclusion: TLMs offer a viable, efficient alternative to LLMs, with potential insights into NLP mechanisms and language development.

Abstract: A prominent achievement of natural language processing (NLP) is its ability to understand and generate meaningful human language. This capability relies on complex feedforward transformer block architectures pre-trained on large language models (LLMs). However, LLM pre-training is currently feasible only for a few dominant companies due to the immense computational resources required, limiting broader research participation. This creates a critical need for more accessible alternatives. In this study, we explore whether tiny language models (TLMs) exhibit the same key qualitative features of LLMs. We demonstrate that TLMs exhibit a clear performance gap between pre-trained and non-pre-trained models across classification tasks, indicating the effectiveness of pre-training, even at a tiny scale. The performance gap increases with the size of the pre-training dataset and with greater overlap between tokens in the pre-training and classification datasets. Furthermore, the classification accuracy achieved by a pre-trained deep TLM architecture can be replicated through a soft committee of multiple, independently pre-trained shallow architectures, enabling low-latency TLMs without affecting classification accuracy. Our results are based on pre-training BERT-6 and variants of BERT-1 on subsets of the Wikipedia dataset and evaluating their performance on FewRel, AGNews, and DBPedia classification tasks. Future research on TLM is expected to further illuminate the mechanisms underlying NLP, especially given that its biologically inspired models suggest that TLMs may be sufficient for children or adolescents to develop language.

[41] MEKiT: Multi-source Heterogeneous Knowledge Injection Method via Instruction Tuning for Emotion-Cause Pair Extraction

Shiyi Mu, Yongkang Liu, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

Main category: cs.CL

TL;DR: MEKiT improves LLMs’ performance on the ECPE task by injecting multi-source heterogeneous knowledge, addressing their lack of auxiliary knowledge for emotion-cause reasoning.

DetailsMotivation: LLMs underperform in the ECPE task due to insufficient auxiliary knowledge for emotion perception and cause reasoning.

Method: MEKiT integrates internal emotional and external causal knowledge using instruction templates and mixed data for instruction-tuning.

Result: MEKiT outperforms baselines, significantly enhancing LLMs’ performance on the ECPE task.

Conclusion: MEKiT is an effective and adaptable solution for improving LLMs’ reasoning in the ECPE task.

Abstract: Although large language models (LLMs) excel in text comprehension and generation, their performance on the Emotion-Cause Pair Extraction (ECPE) task, which requires reasoning ability, is often underperform smaller language model. The main reason is the lack of auxiliary knowledge, which limits LLMs’ ability to effectively perceive emotions and reason causes. To address this issue, we propose a novel \textbf{M}ulti-source h\textbf{E}terogeneous \textbf{K}nowledge \textbf{i}njection me\textbf{T}hod, MEKiT, which integrates heterogeneous internal emotional knowledge and external causal knowledge. Specifically, for these two distinct aspects and structures of knowledge, we apply the approaches of incorporating instruction templates and mixing data for instruction-tuning, which respectively facilitate LLMs in more comprehensively identifying emotion and accurately reasoning causes. Experimental results demonstrate that MEKiT provides a more effective and adaptable solution for the ECPE task, exhibiting an absolute performance advantage over compared baselines and dramatically improving the performance of LLMs on the ECPE task.

[42] Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng

Main category: cs.CL

TL;DR: SASFT reduces unexpected code-switching in LLMs by over 50%, eliminating it in four cases, while maintaining multilingual performance.

DetailsMotivation: LLMs exhibit unexpected code-switching, degrading readability and usability, but existing solutions lack mechanistic analysis and effectiveness.

Method: Uses sparse autoencoders to analyze language feature pre-activation, then proposes SASFT for supervised fine-tuning to control these values.

Result: SASFT reduces code-switching by over 50%, eliminates it in four cases, and maintains or improves performance on multilingual benchmarks.

Conclusion: SASFT effectively addresses code-switching without compromising multilingual capabilities, offering a practical solution for LLMs.

Abstract: Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$parse $\textbf{A}$utoencoder-guided $\textbf{S}$upervised $\textbf{F}$ine$\textbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models’ performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.

[43] From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment

Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, Xiaodong Shi

Main category: cs.CL

TL;DR: NeuronXA is a novel method for evaluating cross-lingual alignment in LLMs, achieving high correlation with downstream tasks and transferability using minimal data.

DetailsMotivation: Existing benchmarks for cross-lingual alignment focus on sentence embeddings, which may not capture semantic alignment well, especially for low-resource languages.

Method: Proposes NeuronXA, a neuron state-based approach inspired by neuroscientific findings, to assess cross-lingual alignment in LLMs.

Result: NeuronXA achieves Pearson correlations of 0.9556 with downstream tasks and 0.8514 with transferability using only 100 parallel sentence pairs.

Conclusion: NeuronXA effectively evaluates cross-lingual alignment and transferability, advancing research and improving multilingual LLMs’ semantic understanding.

Abstract: Large language models (LLMs) have demonstrated remarkable multilingual capabilities, however, how to evaluate cross-lingual alignment remains underexplored. Existing alignment benchmarks primarily focus on sentence embeddings, but prior research has shown that neural models tend to induce a non-smooth representation space, which impact of semantic alignment evaluation on low-resource languages. Inspired by neuroscientific findings that similar information activates overlapping neuronal regions, we propose a novel Neuron State-Based Cross-Lingual Alignment (NeuronXA) to assess the cross-lingual a lignment capabilities of LLMs, which offers a more semantically grounded approach to assess cross-lingual alignment. We evaluate NeuronXA on several prominent multilingual LLMs (LLaMA, Qwen, Mistral, GLM, and OLMo) across two transfer tasks and three multilingual benchmarks. The results demonstrate that with only 100 parallel sentence pairs, NeuronXA achieves a Pearson correlation of 0.9556 with downstream tasks performance and 0.8514 with transferability. These findings demonstrate NeuronXA’s effectiveness in assessing both cross-lingual alignment and transferability, even with a small dataset. This highlights its potential to advance cross-lingual alignment research and to improve the semantic understanding of multilingual LLMs.

[44] PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

Eliya Habba, Noam Dahan, Gili Lior, Gabriel Stanovsky

Main category: cs.CL

TL;DR: PromptSuite is a framework for automatically generating diverse prompts to improve LLM evaluation reliability, offering modular design and extensibility.

DetailsMotivation: Single-prompt evaluations of LLMs are unreliable due to performance sensitivity to small changes, necessitating a robust multi-prompt approach.

Method: PromptSuite automates prompt generation with modular design, controlled perturbations, and extensibility for new components.

Result: Case studies demonstrate PromptSuite’s ability to provide meaningful prompt variations for robust evaluation.

Conclusion: PromptSuite enhances LLM evaluation practices and is accessible via a Python API and web interface.

Abstract: Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. It is available through both a Python API: https://github.com/eliyahabba/PromptSuite, and a user-friendly web interface: https://promptsuite.streamlit.app/

[45] SYNTHIA: Synthetic Yet Naturally Tailored Human-Inspired PersonAs

Vahid Rahimzadeh, Erfan Moosavi Monazzah, Mohammad Taher Pilehvar, Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: SYNTHIA is a dataset of 30,000 backstories from 10,000 real social media users, combining synthetic generation with authentic data to improve consistency and realism in persona-driven LLMs.

DetailsMotivation: Existing methods for persona-driven LLMs are either costly (human-curated) or lack realism (synthetic). SYNTHIA aims to bridge this gap.

Method: SYNTHIA uses backstories derived from real social media users (BlueSky) across three time windows, incorporating temporal and social interaction metadata.

Result: SYNTHIA matches state-of-the-art methods in demographic diversity and survey alignment but excels in narrative consistency.

Conclusion: SYNTHIA enables new research in computational social science and persona-driven language modeling by providing realistic, temporally grounded data.

Abstract: Persona-driven LLMs have emerged as powerful tools in computational social science, yet existing approaches fall at opposite extremes, either relying on costly human-curated data or producing synthetic personas that lack consistency and realism. We introduce SYNTHIA, a dataset of 30,000 backstories derived from 10,000 real social media users from BlueSky open platform across three time windows, bridging this spectrum by grounding synthetic generation in authentic user activity. Our evaluation demonstrates that SYNTHIA achieves competitive performance with state-of-the-art methods in demographic diversity and social survey alignment while significantly outperforming them in narrative consistency. Uniquely, SYNTHIA incorporates temporal dimensionality and provides rich social interaction metadata from the underlying network, enabling new research directions in computational social science and persona-driven language modeling.

[46] MUR: Momentum Uncertainty guided Reasoning for Large Language Models

Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu

Main category: cs.CL

TL;DR: MUR dynamically allocates thinking budgets to critical reasoning steps in LLMs, reducing computation by 50% while improving accuracy.

DetailsMotivation: Optimizing reasoning efficiency in LLMs without additional training, addressing overthinking in Test-Time Scaling (TTS).

Method: Proposes Momentum Uncertainty-guided Reasoning (MUR), tracking stepwise uncertainty and introducing gamma-control for flexible inference-time tuning.

Result: MUR reduces computation by 50% on average and improves accuracy by 0.62-3.37% across benchmarks.

Conclusion: MUR efficiently guides LLM reasoning, balancing computation and accuracy, validated by theoretical and empirical results.

Abstract: Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.

[47] RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun, Junyang Lin

Main category: cs.CL

TL;DR: RefCritic, a reinforcement learning-based critic module with dual rule-based rewards, outperforms supervised fine-tuning methods in generating high-quality critiques and actionable feedback for LLMs.

DetailsMotivation: Supervised fine-tuning for critic modules fails to enhance critique abilities, producing superficial feedback. RefCritic aims to unlock superior critique capabilities.

Method: RefCritic uses reinforcement learning with dual rewards: instance-level correctness and refinement accuracy, applied to models like Qwen2.5-14B-Instruct.

Result: RefCritic achieves consistent gains (e.g., 6.8% and 7.2% on AIME25) and outperforms step-level supervised methods on ProcessBench.

Conclusion: RefCritic effectively enhances critique quality and model refinement, demonstrating scalability and superiority over traditional methods.

Abstract: With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models’ critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8% and 7.2% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.

[48] WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

Main category: cs.CL

TL;DR: WebShaper introduces a formalization-driven framework for synthesizing high-quality training data for LLM-powered information-seeking agents, improving consistency and performance.

DetailsMotivation: Addressing the scarcity of high-quality training data and inconsistency in existing information-driven paradigms for IS agents.

Method: Proposes WebShaper, a framework using set theory and Knowledge Projections (KP) to formalize tasks, followed by multi-step expansion for dataset synthesis.

Result: Achieves state-of-the-art performance on GAIA and WebWalkerQA benchmarks.

Conclusion: WebShaper effectively mitigates data inconsistency and enhances IS agent performance through formalization-driven synthesis.

Abstract: The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.

[49] Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling

Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song

Main category: cs.CL

TL;DR: The paper compares k-mer segmentation, BPE tokenization, and positional encoding methods in DNA Transformer models, finding BPE superior for performance and stability, RoPE for periodic motifs, and AliBi for local dependencies. Optimal depth is 12 layers.

DetailsMotivation: To systematically evaluate and compare k-mer segmentation, BPE tokenization, and positional encoding methods in DNA Transformer models to determine the best configurations for performance and generalization.

Method: Compared k-mer segmentation (k=1,3,4,5,6), BPE tokenization (4,096-token vocabulary), and three positional encoding methods (sinusoidal, AliBi, RoPE) in 3, 6, 12, and 24-layer Transformer encoders. Evaluated on the GUE benchmark dataset.

Result: BPE outperforms k-mer segmentation by compressing frequent motifs and reducing sequence length. RoPE excels for periodic motifs, while AliBi works well for local dependencies. Increasing layers to 12 improves performance, but gains diminish at 24 layers.

Conclusion: BPE is recommended for DNA Transformer models due to its stability and performance. RoPE and AliBi are suitable for specific tasks. Optimal model depth is 12 layers, with diminishing returns beyond.

Abstract: Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we observe significant gains when increasing layers from 3 to 12, with only marginal improvements or slight overfitting at 24 layers. This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.

[50] A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations

Vijeta Deshpande, Ishita Dasgupta, Uttaran Bhattacharya, Somdeb Sarkhel, Saayan Mitra, Anna Rumshisky

Main category: cs.CL

TL;DR: The paper introduces PATTR, a new diversity metric for synthetic text, addressing biases from text length variations in existing metrics like MATTR and CR.

DetailsMotivation: To improve lexical diversity measurement in synthetic text by accounting for length variations, which current metrics fail to address adequately.

Method: Proposes Penalty-Adjusted Type-Token Ratio (PATTR), tests it on a 20M-word synthetic corpus from seven LLMs, and compares it to MATTR and CR.

Result: PATTR outperforms existing metrics by mitigating length biases and maintaining task-specific target response lengths.

Conclusion: PATTR is a robust diversity metric for synthetic text, offering better performance and adherence to desired response lengths than current methods.

Abstract: Synthetic text generated by Large Language Models (LLMs) is increasingly used for further training and improvement of LLMs. Diversity is crucial for the effectiveness of synthetic data, and researchers rely on prompt engineering to improve diversity. However, the impact of prompt variations on response text length, and, more importantly, the consequential effect on lexical diversity measurements, remain underexplored. In this work, we propose Penalty-Adjusted Type-Token Ratio (PATTR), a diversity metric robust to length variations. We generate a large synthetic corpus of over 20M words using seven models from the LLaMA, OLMo, and Phi families, focusing on a creative writing task of video script generation, where diversity is crucial. We evaluate per-response lexical diversity using PATTR and compare it against existing metrics of Moving-Average TTR (MATTR) and Compression Ratio (CR). Our analysis highlights how text length variations introduce biases favoring shorter responses. Unlike existing metrics, PATTR explicitly considers the task-specific target response length ($L_T$) to effectively mitigate length biases. We further demonstrate the utility of PATTR in filtering the top-10/100/1,000 most lexically diverse responses, showing that it consistently outperforms MATTR and CR by yielding on par or better diversity with high adherence to $L_T$.

[51] Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference?

Chathuri Jayaweera, Brianna Yanqui, Bonnie Dorr

Main category: cs.CL

TL;DR: The paper explores using Large Language Models (LLMs) as commonsense knowledge generators for Natural Language Inference (NLI), assessing their reliability and impact on prediction accuracy.

DetailsMotivation: Existing commonsense resources lack coverage for diverse premise-hypothesis pairs, prompting the use of LLMs to fill this gap.

Method: The study adapts metrics to evaluate LLM-generated commonsense knowledge for factuality and consistency in NLI.

Result: Incorporating commonsense knowledge doesn’t consistently boost overall accuracy but helps distinguish entailing instances and moderately improves contradictory/neutral inferences.

Conclusion: LLMs show promise as commonsense generators for NLI, though their impact varies by inference type.

Abstract: Natural Language Inference (NLI) is the task of determining the semantic entailment of a premise for a given hypothesis. The task aims to develop systems that emulate natural human inferential processes where commonsense knowledge plays a major role. However, existing commonsense resources lack sufficient coverage for a variety of premise-hypothesis pairs. This study explores the potential of Large Language Models as commonsense knowledge generators for NLI along two key dimensions: their reliability in generating such knowledge and the impact of that knowledge on prediction accuracy. We adapt and modify existing metrics to assess LLM factuality and consistency in generating in this context. While explicitly incorporating commonsense knowledge does not consistently improve overall results, it effectively helps distinguish entailing instances and moderately improves distinguishing contradictory and neutral inferences.

[52] From Disagreement to Understanding: The Case for Ambiguity Detection in NLI

Chathuri Jayaweera, Bonnie Dorr

Main category: cs.CL

TL;DR: The paper argues that annotation disagreement in NLI reflects meaningful interpretive variation due to ambiguity, not just noise. It proposes an ambiguity-aware NLI framework and highlights the need for datasets annotated for ambiguity.

DetailsMotivation: To address the issue of annotation disagreement in NLI by recognizing ambiguity as a meaningful signal of divergent human perspectives, rather than dismissing it as noise.

Method: Proposes a unified framework integrating existing taxonomies to classify ambiguity types, supported by concrete examples. Suggests new annotated resources and unsupervised approaches for ambiguity detection.

Result: Illustrates how ambiguity influences annotator decisions and identifies the lack of ambiguity-annotated datasets as a key limitation.

Conclusion: Calls for ambiguity-aware NLI systems, emphasizing the need for better datasets and detection methods to align models with human interpretation.

Abstract: This position paper argues that annotation disagreement in Natural Language Inference (NLI) is not mere noise but often reflects meaningful interpretive variation, especially when triggered by ambiguity in the premise or hypothesis. While underspecified guidelines and annotator behavior can contribute to variation, content-based ambiguity offers a process-independent signal of divergent human perspectives. We call for a shift toward ambiguity-aware NLI by systematically identifying ambiguous input pairs and classifying ambiguity types. To support this, we present a unified framework that integrates existing taxonomies and illustrate key ambiguity subtypes through concrete examples. These examples reveal how ambiguity shapes annotator decisions and motivate the need for targeted detection methods that better align models with human interpretation. A key limitation is the lack of datasets annotated for ambiguity and subtypes. We propose addressing this gap through new annotated resources and unsupervised approaches to ambiguity detection – paving the way for more robust, explainable, and human-aligned NLI systems.

[53] A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge’ez Script

Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Henok Biadglign Ademtew, Hizkel Mitiku Alemayehu, Negasi Haile Abadi, Tadesse Destaw Belay, Seid Muhie Yimam

Main category: cs.CL

TL;DR: The paper examines the impact of homophone normalization in Amharic NLP, proposing post-inference normalization to improve BLEU scores while preserving language features.

DetailsMotivation: To address the drawbacks of homophone normalization, which can hinder model generalization and language understanding.

Method: Experiments with monolingual training and cross-lingual transfer, followed by post-inference normalization of model predictions.

Result: Achieved a BLEU score increase of up to 1.03 while maintaining language features.

Conclusion: Advocates for language-aware interventions and contributes to discussions on language change in NLP.

Abstract: Homophone normalization, where characters that have the same sound in a writing script are mapped to one character, is a pre-processing step applied in Amharic Natural Language Processing (NLP) literature. While this may improve performance reported by automatic metrics, it also results in models that are not able to understand different forms of writing in a single language. Further, there might be impacts in transfer learning, where models trained on normalized data do not generalize well to other languages. In this paper, we experiment with monolingual training and cross-lingual transfer to understand the impacts of normalization on languages that use the Ge’ez script. We then propose a post-inference intervention in which normalization is applied to model predictions instead of training data. With our simple scheme of post-inference normalization, we show that we can achieve an increase in BLEU score of up to 1.03 while preserving language features in training. Our work contributes to the broader discussion on technology-facilitated language change and calls for more language-aware interventions.

[54] What Level of Automation is “Good Enough”? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

Lingbo Li, Anuradha Mathrani, Teo Susnjak

Main category: cs.CL

TL;DR: The study evaluates three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) for automating data extraction from RCTs in meta-analyses, focusing on precision and recall. Customised prompts improved recall by 15%, leading to proposed guidelines for task-specific automation.

DetailsMotivation: Automating data extraction from RCTs for meta-analysis is challenging, requiring evaluation of LLM performance to improve efficiency and accuracy.

Method: Tested three LLMs across medical domains (hypertension, diabetes, orthopaedics) using four prompting strategies (basic, self-reflective, model ensemble, customised) to assess extraction quality.

Result: High precision but poor recall across models; customised prompts boosted recall by 15%. Guidelines proposed for task-specific automation.

Conclusion: Customised prompts and tiered guidelines optimize LLM use for data extraction, balancing automation with expert oversight in meta-analyses.

Abstract: Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

[55] Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment

Xiandong Meng, Yan Wu, Yexin Tian, Xin Hu, Tianze Kang, Junliang Du

Main category: cs.CL

TL;DR: The paper proposes a multi-teacher guided distillation method to reduce computational costs and improve inference speed in large language models, achieving strong performance with a smaller student model.

DetailsMotivation: Addressing the high computational cost and slow inference of large language models by leveraging knowledge from multiple teacher models.

Method: Uses a distillation strategy with weighted output fusion, feature alignment loss, and dynamic teacher weighting to guide the student model.

Result: The student model outperforms other distillation methods in perplexity, distillation loss, and generation quality.

Conclusion: The method provides an efficient way to compress large language models and highlights the effectiveness of multi-teacher collaboration.

Abstract: This paper addresses the challenges of high computational cost and slow inference in deploying large language models. It proposes a distillation strategy guided by multiple teacher models. The method constructs several teacher models and integrates their output probability distributions and intermediate semantic features. This guides the student model to learn from multiple sources of knowledge. As a result, the student model gains stronger language understanding and generation ability while maintaining a small parameter size. To achieve this, the paper introduces a weighted output fusion mechanism, a feature alignment loss function, and an entropy-driven dynamic teacher weighting strategy. These components improve the quality and stability of knowledge transfer during distillation. Under multi-teacher guidance, the student model captures semantic information more effectively and demonstrates strong performance across multiple evaluation metrics. In particular, the method shows high consistency in expression, generalization ability, and task adaptability in tasks such as language modeling, text generation, and multi-task learning. The experiments compare the proposed method with several widely adopted distillation approaches. The results further confirm its overall advantages in perplexity, distillation loss, and generation quality. This study provides a feasible technical path for the efficient compression of large-scale language models. It also demonstrates the effectiveness of multi-teacher collaborative mechanisms in complex language modeling tasks.

[56] SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest

Shayan Vassef, Amirhossein Dabiriaghdam, Mohammadreza Bakhtiari, Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: The paper explores how multi-task, multi-lingual, and multi-source learning affect pretrained language models, introducing a framework (SOI) to categorize learning behaviors. Experiments show multi-source learning boosts out-of-distribution performance, while multi-task learning has mixed results. A two-stage fine-tuning method using SOI further improves performance.

DetailsMotivation: To understand and improve the robustness and performance of pretrained language models in multi-setting configurations (multi-task, multi-lingual, multi-source).

Method: Introduces Subsets of Interest (SOI) to categorize learning behaviors, uses transition heatmaps and dataset cartography for analysis, and conducts experiments comparing single vs. multi-setting learning across tasks, sources, and languages.

Result: Multi-source learning improves out-of-distribution performance by up to 7%, while multi-task learning shows mixed results. SOI-based subset selection in fine-tuning further enhances performance.

Conclusion: The study provides insights into training dynamics and practical methods for optimizing language models in multi-setting scenarios, with multi-source learning being particularly effective.

Abstract: This work investigates the impact of multi-task, multi-lingual, and multi-source learning approaches on the robustness and performance of pretrained language models. To enhance this analysis, we introduce Subsets of Interest (SOI), a novel categorization framework that identifies six distinct learning behavior patterns during training, including forgettable examples, unlearned examples, and always correct examples. Through SOI transition heatmaps and dataset cartography visualization, we analyze how examples shift between these categories when transitioning from single-setting to multi-setting configurations. We perform comprehensive experiments across three parallel comparisons: multi-task vs. single-task learning using English tasks (entailment, paraphrase, sentiment), multi-source vs. single-source learning using sentiment analysis datasets, and multi-lingual vs. single-lingual learning using intent classification in French, English, and Persian. Our results demonstrate that multi-source learning consistently improves out-of-distribution performance by up to 7%, while multi-task learning shows mixed results with notable gains in similar task combinations. We further introduce a two-stage fine-tuning approach where the second stage leverages SOI-based subset selection to achieve additional performance improvements. These findings provide new insights into training dynamics and offer practical approaches for optimizing multi-setting language model performance.

[57] ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling

Yuanhe Tian, Junjie Liu, Zhizhou Kou, Yuxiang Li, Yan Song

Main category: cs.CL

TL;DR: ChiMed 2.0 is a large-scale Chinese medical dataset designed for pre-training, fine-tuning, and RLHF, addressing limitations of existing datasets.

DetailsMotivation: Existing Chinese medical datasets are small and narrow, lacking diversity for effective AI training. ChiMed 2.0 aims to fill this gap.

Method: ChiMed 2.0 combines data from Chinese medical platforms and LLM-generated content, offering pre-training, SFT, and RLHF support.

Result: Experiments show performance gains across model scales, validating ChiMed 2.0’s effectiveness.

Conclusion: ChiMed 2.0 successfully addresses dataset limitations and supports diverse training needs for Chinese medical AI.

Abstract: Building high-quality data resources is crucial for advancing artificial intelligence research and applications in specific domains, particularly in the Chinese medical domain. Existing Chinese medical datasets are limited in size and narrow in domain coverage, falling short of the diverse corpora required for effective pre-training. Moreover, most datasets are designed solely for LLM fine-tuning and do not support pre-training and reinforcement learning from human feedback (RLHF). In this paper, we propose a Chinese medical dataset named ChiMed 2.0, which extends our previous work ChiMed, and covers data collected from Chinese medical online platforms and generated by LLMs. ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data, where there are 164.8K documents for pre-training, 351.6K question-answering pairs for supervised fine-tuning (SFT), and 41.7K preference data tuples for RLHF. To validate the effectiveness of our approach for training a Chinese medical LLM, we conduct further pre-training, SFT, and RLHF experiments on representative general domain LLMs and evaluate their performance on medical benchmark datasets. The results show performance gains across different model scales, validating the dataset’s effectiveness and applicability.

[58] A Novel Self-Evolution Framework for Large Language Models

Haoran Sun, Zekun Zhang, Shaoning Zeng

Main category: cs.CL

TL;DR: A Dual-Phase Self-Evolution (DPSE) framework is proposed to enhance LLMs by jointly optimizing user preference adaptation and domain-specific competence, outperforming existing methods.

DetailsMotivation: Existing post-training strategies improve user alignment but lack domain cognition enhancement, creating a gap DPSE aims to bridge.

Method: DPSE uses a Censor module to extract interaction signals and guide structured data expansion, followed by a two-stage fine-tuning pipeline for domain grounding and preference optimization.

Result: DPSE outperforms baselines like Supervised Fine-Tuning and Preference Optimization in general NLP benchmarks and long-term dialogue tasks.

Conclusion: DPSE offers an autonomous path for LLMs’ continual self-evolution, validated by ablation studies.

Abstract: The capabilities of Large Language Models (LLMs) are limited to some extent by pre-training, so some researchers optimize LLMs through post-training. Existing post-training strategies, such as memory-based retrieval or preference optimization, improve user alignment yet fail to enhance the model’s domain cognition. To bridge this gap, we propose a novel Dual-Phase Self-Evolution (DPSE) framework that jointly optimizes user preference adaptation and domain-specific competence. DPSE introduces a Censor module to extract multi-dimensional interaction signals and estimate satisfaction scores, which guide structured data expansion via topic-aware and preference-driven strategies. These expanded datasets support a two-stage fine-tuning pipeline: supervised domain grounding followed by frequency-aware preference optimization. Experiments across general NLP benchmarks and long-term dialogue tasks demonstrate that DPSE consistently outperforms Supervised Fine-Tuning, Preference Optimization, and Memory-Augmented baselines. Ablation studies validate the contribution of each module. In this way, our framework provides an autonomous path toward continual self-evolution of LLMs.

[59] Beyond Easy Wins: A Text Hardness-Aware Benchmark for LLM-generated Text Detection

Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee

Main category: cs.CL

TL;DR: The paper introduces SHIELD, a benchmark for AI text detectors that evaluates reliability and stability, addressing gaps in current metrics. It also proposes a humanification framework to challenge detectors.

DetailsMotivation: Current AI text detector evaluations focus on conventional metrics like AUROC, ignoring practical issues like false positives and stability across domains. These gaps hinder real-world deployment.

Method: The authors develop SHIELD, a benchmark integrating reliability and stability metrics. They also create a model-agnostic humanification framework with a hardness parameter to test detectors.

Result: SHIELD highlights limitations in current detectors, and the humanification framework effectively challenges state-of-the-art zero-shot detection methods.

Conclusion: The paper emphasizes the need for practical evaluation metrics in AI text detection and demonstrates SHIELD’s effectiveness in addressing these needs.

Abstract: We present a novel evaluation paradigm for AI text detectors that prioritizes real-world and equitable assessment. Current approaches predominantly report conventional metrics like AUROC, overlooking that even modest false positive rates constitute a critical impediment to practical deployment of detection systems. Furthermore, real-world deployment necessitates predetermined threshold configuration, making detector stability (i.e. the maintenance of consistent performance across diverse domains and adversarial scenarios), a critical factor. These aspects have been largely ignored in previous research and benchmarks. Our benchmark, SHIELD, addresses these limitations by integrating both reliability and stability factors into a unified evaluation metric designed for practical assessment. Furthermore, we develop a post-hoc, model-agnostic humanification framework that modifies AI text to more closely resemble human authorship, incorporating a controllable hardness parameter. This hardness-aware approach effectively challenges current SOTA zero-shot detection methods in maintaining both reliability and stability. (Data and code: https://github.com/navid-aub/SHIELD-Benchmark)

[60] On the Inevitability of Left-Leaning Political Bias in Aligned Language Models

Thilo Hagendorff

Main category: cs.CL

TL;DR: The paper argues that left-wing political bias in LLMs is inherent to AI alignment goals like harmlessness and honesty, conflicting with right-wing ideologies, and critiques framing this bias as problematic.

DetailsMotivation: To reconcile the inherent left-wing bias in AI alignment with critiques of such bias, highlighting the normative alignment of HHH principles with progressive values.

Method: Theoretical analysis of AI alignment principles (HHH) and their alignment with progressive moral frameworks, contrasting with right-wing ideologies.

Result: Left-wing bias in LLMs is shown as a natural outcome of alignment goals, while critiques of this bias undermine HHH principles.

Conclusion: Framing left-wing bias as problematic contradicts AI alignment objectives, which inherently align with progressive values.

Abstract: The guiding principle of AI alignment is to train large language models (LLMs) to be harmless, helpful, and honest (HHH). At the same time, there are mounting concerns that LLMs exhibit a left-wing political bias. Yet, the commitment to AI alignment cannot be harmonized with the latter critique. In this article, I argue that intelligent systems that are trained to be harmless and honest must necessarily exhibit left-wing political bias. Normative assumptions underlying alignment objectives inherently concur with progressive moral frameworks and left-wing principles, emphasizing harm avoidance, inclusivity, fairness, and empirical truthfulness. Conversely, right-wing ideologies often conflict with alignment guidelines. Yet, research on political bias in LLMs is consistently framing its insights about left-leaning tendencies as a risk, as problematic, or concerning. This way, researchers are actively arguing against AI alignment, tacitly fostering the violation of HHH principles.

[61] Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman, Taylor Lundy, Kevin Leyton-Brown

Main category: cs.CL

TL;DR: The paper evaluates the effectiveness of multiple-choice question-answering (MCQA) as a proxy for downstream performance of LLMs, finding it reliable only when models perform chain-of-thought reasoning before seeing options.

DetailsMotivation: To assess whether MCQA remains a valid benchmark for state-of-the-art LLMs, given their evolving reasoning capabilities.

Method: Systematic evaluation of 15 QA benchmarks and 25 LLMs, testing 5 question-presentation variations, including chain-of-thought reasoning timing and option availability.

Result: MCQA works well if reasoning occurs before seeing options, but models exploiting options post-reasoning outperform free-text performance, undermining MCQA’s validity.

Conclusion: MCQA is no longer a reliable proxy for downstream performance; new benchmark designs are needed to better assess LLMs’ reasoning.

Abstract: When evaluating Large Language Models (LLMs) in question-answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of $15$ different question-answering benchmarks (e.g., MMLU, HLE) and $25$ different LLMs (including small models such as Qwen 7B and relatively large models such as Llama 70B). For each model-benchmark pair, we considered $5$ ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether “none of the above” sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning after being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We conclude that MCQA is no longer a good proxy for assessing downstream performance of state-of-the-art models, and offer practical guidelines for designing more robust, bias-resistant benchmarks that better reflect LLMs’ genuine reasoning capabilities.

[62] LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Leanne Tan, Gabriel Chua, Ziyu Ge, Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: LionGuard 2 is a lightweight, multilingual moderation classifier for Singapore, outperforming commercial systems without fine-tuning large models.

DetailsMotivation: Addressing gaps in multilingual moderation, especially for low-resource languages, in real-world deployments.

Method: Uses pre-trained OpenAI embeddings and a multi-head ordinal classifier, tailored for English, Chinese, Malay, and Tamil.

Result: Outperforms commercial and open-source systems across 17 benchmarks, including Singapore-specific datasets.

Conclusion: High-quality local data and multilingual embeddings enable strong moderation without large models, with practical deployment in Singapore.

Abstract: Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants - creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.

[63] Probing Information Distribution in Transformer Architectures through Entropy Analysis

Amedeo Buonanno, Alessandro Rivetti, Francesco A. N. Palmieri, Giovanni Di Gennaro, Gianmarco Romano

Main category: cs.CL

TL;DR: The paper explores entropy analysis in Transformer models to study information distribution, using GPT as a case study to reveal model behavior insights.

DetailsMotivation: To understand how information is managed and transformed in Transformer-based architectures.

Method: Quantifies token-level uncertainty and examines entropy patterns across processing stages in a GPT-based model.

Result: Demonstrates potential to reveal insights into model behavior and internal representations.

Conclusion: The approach may aid in developing interpretability and evaluation frameworks for Transformer models.

Abstract: This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models

[64] Metaphor and Large Language Models: When Surface Features Matter More than Deep Understanding

Elisa Sanchez-Bayona, Rodrigo Agerri

Main category: cs.CL

TL;DR: The paper evaluates LLMs’ metaphor interpretation across diverse datasets and tasks, finding performance influenced by surface-level features rather than metaphorical content.

DetailsMotivation: Address limitations of prior research by evaluating LLMs' metaphor processing in realistic, diverse settings.

Method: Extensive experiments on NLI and QA tasks using publicly available datasets with metaphor annotations.

Result: LLMs’ performance is driven by lexical overlap and sentence length, not metaphorical understanding.

Conclusion: Highlights LLMs’ limitations in figurative language and calls for more realistic evaluation frameworks.

Abstract: This paper presents a comprehensive evaluation of the capabilities of Large Language Models (LLMs) in metaphor interpretation across multiple datasets, tasks, and prompt configurations. Although metaphor processing has gained significant attention in Natural Language Processing (NLP), previous research has been limited to single-dataset evaluations and specific task settings, often using artificially constructed data through lexical replacement. We address these limitations by conducting extensive experiments using diverse publicly available datasets with inference and metaphor annotations, focusing on Natural Language Inference (NLI) and Question Answering (QA) tasks. The results indicate that LLMs’ performance is more influenced by features like lexical overlap and sentence length than by metaphorical content, demonstrating that any alleged emergent abilities of LLMs to understand metaphorical language are the result of a combination of surface-level features, in-context learning, and linguistic knowledge. This work provides critical insights into the current capabilities and limitations of LLMs in processing figurative language, highlighting the need for more realistic evaluation frameworks in metaphor interpretation tasks. Data and code are publicly available.

[65] AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming

Jierui Li, Raymond Mooney

Main category: cs.CL

TL;DR: The paper introduces AlgoSimBench to assess LLMs’ ability to identify algorithmically similar problems (ASPs), finding they struggle. A new method, ASM, improves accuracy by 6.7-11.7%.

DetailsMotivation: To explore whether LLMs' problem-solving abilities generalize to less-seen domains, specifically identifying ASPs.

Method: Introduces AlgoSimBench with 1317 problems and 402 MCQs. Proposes ASM for similarity detection and evaluates code embedding models and retrieval methods.

Result: LLMs struggle with ASP identification (best model: 65.9% accuracy). ASM improves accuracy by 6.7-11.7%. Combining ASM with BM25 yields 52.2% accuracy.

Conclusion: LLMs need improvement in ASP identification. ASM and summarization methods show promise for enhancing performance.

Abstract: Recent progress in LLMs, such as reasoning models, has demonstrated strong abilities to solve complex competitive programming problems, often rivaling top human competitors. However, it remains underexplored whether these abilities generalize to relevant domains that are less seen during training. To address this, we introduce AlgoSimBench, a new benchmark designed to assess LLMs' ability to identify algorithmically similar problems (ASPs)-problems that can be solved using similar algorithmic approaches. AlgoSimBench consists of 1317 problems, annotated with 231 distinct fine-grained algorithm tags, from which we curate 402 multiple-choice questions (MCQs), where each question presents one algorithmically similar problem alongside three textually similar but algorithmically dissimilar distractors. Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task. To address this challenge, we propose attempted solution matching (ASM), a novel method for improving problem similarity detection. On our MCQ task, ASM yields an absolute accuracy improvement of 6.7% to 11.7% across different models. We also evaluated code embedding models and retrieval methods on similar problem identification. While the adversarial selection of problems degrades the performance to be less than random, we found that simply summarizing the problem to remove narrative elements eliminates the effect, and combining ASM with a keyword-prioritized method, BM25, can yield up to 52.2% accuracy. Code and data are available at github.com

[66] ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution

Alexandru Coca, Mark Gaynor, Zhenxing Zhang, Jianpeng Cheng, Bo-Hsiang Tseng, Pete Boothroyd, Héctor Martinez Alonso, Diarmuid Ó Séaghdha, Anders Johannsen

Main category: cs.CL

TL;DR: The paper evaluates LLMs for powering digital assistants via complex action execution, introduces ASPERA for task generation, and benchmarks performance with Asper-Bench.

DetailsMotivation: To address challenges in LLM-based digital assistants for executing multi-step goals using pre-trained programming knowledge.

Method: Developed ASPERA, a framework with a simulation and human-assisted LLM data generation engine to create high-quality tasks.

Result: Showed that program generation grounded in custom libraries is harder for LLMs than dependency-free code generation.

Conclusion: ASPERA and Asper-Bench provide tools to improve LLM performance in complex action execution for digital assistants.

Abstract: This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. These assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To achieve this, we develop ASPERA, a framework comprising an assistant library simulation and a human-assisted LLM data generation engine. Our engine allows developers to guide LLM generation of high-quality tasks consisting of complex user queries, simulation state and corresponding validation programs, tackling data availability and evaluation robustness challenges. Alongside the framework we release Asper-Bench, an evaluation dataset of 250 challenging tasks generated using ASPERA, which we use to show that program generation grounded in custom assistant libraries is a significant challenge to LLMs compared to dependency-free code generation.

[67] Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

Kaiyan Chang, Yonghao Shi, Chenglong Wang, Hang Zhou, Chi Hu, Xiaoqian Liu, Yingfeng Luo, Yuan Ge, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: Hybrid Test-Time Scaling (TTS) combines fine-grained sequential and parallel scaling methods to enhance reasoning in LLMs without additional training overhead.

DetailsMotivation: Training-based TTS methods increase computational burden, so the paper focuses on training-free TTS for efficient reasoning.

Method: Develops Conditional Step-level Self-refinement (sequential scaling) and combines it with parallel scaling for Hybrid TTS.

Result: Hybrid TTS significantly improves reasoning performance across various LLMs (3B-14B).

Conclusion: Training-free Hybrid TTS offers a promising paradigm for expanding LLM reasoning capabilities.

Abstract: Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling. In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.

[68] Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko

Main category: cs.CL

TL;DR: The paper addresses the challenge of evaluating text style transfer (TST) in multilingual settings, focusing on text detoxification across nine languages. It compares neural-based evaluation models and LLM-as-a-judge approaches, offering practical insights for reliable evaluation.

DetailsMotivation: The gap between automatic metrics and human judgments in TST evaluation, especially in multilingual contexts, motivates this study. Prior work's focus on English leaves multilingual TST evaluation underexplored.

Method: The study conducts a comprehensive multilingual evaluation of text detoxification systems across nine languages, comparing neural-based models and LLM-as-a-judge approaches.

Result: The findings highlight the effectiveness of modern evaluation methods and provide a practical framework for designing reliable multilingual TST evaluation pipelines.

Conclusion: The study bridges the gap in multilingual TST evaluation, offering actionable insights for improving evaluation pipelines in text detoxification.

Abstract: Despite recent progress in large language models (LLMs), evaluation of text generation tasks such as text style transfer (TST) remains a significant challenge. Recent studies (Dementieva et al., 2024; Pauli et al., 2025) revealed a substantial gap between automatic metrics and human judgments. Moreover, most prior work focuses exclusively on English, leaving multilingual TST evaluation largely unexplored. In this paper, we perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic. Drawing inspiration from the machine translation, we assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches. Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline in the text detoxification case.

[69] Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging

Nicolas Poggi, Shashank Agnihotri, Margret Keuper

Main category: cs.CL

TL;DR: The paper introduces In-Context Learning (ICL) with Vision-Language Models (VLMs) for THz image classification, improving performance and interpretability without fine-tuning.

DetailsMotivation: THz imaging faces challenges like limited annotations, low resolution, and visual ambiguity, making effective classification difficult.

Method: Adapts two open-weight VLMs to the THz domain using modality-aligned prompting, evaluated under zero-shot and one-shot settings.

Result: ICL enhances classification and interpretability in low-data regimes, marking the first application of ICL-enhanced VLMs to THz imaging.

Conclusion: ICL with VLMs offers a promising, flexible solution for resource-constrained THz imaging applications.

Abstract: Terahertz (THz) imaging enables non-invasive analysis for applications such as security screening and material classification, but effective image classification remains challenging due to limited annotations, low resolution, and visual ambiguity. We introduce In-Context Learning (ICL) with Vision-Language Models (VLMs) as a flexible, interpretable alternative that requires no fine-tuning. Using a modality-aligned prompting framework, we adapt two open-weight VLMs to the THz domain and evaluate them under zero-shot and one-shot settings. Our results show that ICL improves classification and interpretability in low-data regimes. This is the first application of ICL-enhanced VLMs to THz imaging, offering a promising direction for resource-constrained scientific domains. Code: \href{https://github.com/Nicolas-Poggi/Project_THz_Classification/tree/main}{GitHub repository}.

[70] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: LEAR improves RAG by learning to extract rational evidence through explicit reasoning and conscious extraction, enhancing LLM accuracy.

DetailsMotivation: Retrieval noises degrade LLM generation quality, and existing methods lack explicit reasoning, risking key clue omission and poor generalization.

Method: LEAR combines evidence reasoning and extraction into unified training, uses knowledge token masks for disentanglement, and applies verifiable rewards for optimization.

Result: LEAR outperforms on benchmarks, providing compact, high-quality evidence and boosting downstream task accuracy.

Conclusion: LEAR effectively denoises RAG systems, improving evidence quality and LLM performance.

Abstract: Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose LEAR, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of LEAR, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.

[71] Conflicting narratives and polarization on social media

Armin Pournaki

Main category: cs.CL

TL;DR: The paper analyzes conflicting narratives in political discourse on Twitter, revealing polarization and alignment mechanisms.

DetailsMotivation: To understand how conflicting narratives shape political polarization and issue alignment in public discourse.

Method: Analyzed tweets from opposing opinion groups in the German Twittersphere (2021-2023), focusing on issues like Ukraine war, Covid, and climate change.

Result: Identified conflicting narratives through role attributions and emplotment, and found evidence of narrative alignment across issues.

Conclusion: Narratives serve as a valuable analytical tool for studying discursive polarization and alignment strategies.

Abstract: Narratives are key interpretative devices by which humans make sense of political reality. In this work, we show how the analysis of conflicting narratives, i.e. conflicting interpretive lenses through which political reality is experienced and told, provides insight into the discursive mechanisms of polarization and issue alignment in the public sphere. Building upon previous work that has identified ideologically polarized issues in the German Twittersphere between 2021 and 2023, we analyze the discursive dimension of polarization by extracting textual signals of conflicting narratives from tweets of opposing opinion groups. Focusing on a selection of salient issues and events (the war in Ukraine, Covid, climate change), we show evidence for conflicting narratives along two dimensions: (i) different attributions of actantial roles to the same set of actants (e.g. diverging interpretations of the role of NATO in the war in Ukraine), and (ii) emplotment of different actants for the same event (e.g. Bill Gates in the right-leaning Covid narrative). Furthermore, we provide first evidence for patterns of narrative alignment, a discursive strategy that political actors employ to align opinions across issues. These findings demonstrate the use of narratives as an analytical lens into the discursive mechanisms of polarization.

[72] Leveraging Context for Multimodal Fallacy Classification in Political Debates

Alessio Pittiglio

Main category: cs.CL

TL;DR: The paper presents a multimodal approach for detecting logical fallacies in political debates, achieving comparable performance between text and multimodal models.

DetailsMotivation: To advance research in multimodal argument mining, specifically targeting logical fallacies in political debates.

Method: Uses pretrained Transformer-based models and explores leveraging context in multimodal data (text, audio).

Result: Achieved macro F1-scores of 0.4444 (text), 0.3559 (audio), and 0.4403 (multimodal).

Conclusion: The multimodal model’s performance is comparable to text-only, indicating room for improvement in leveraging multimodal data.

Abstract: In this paper, we present our submission to the MM-ArgFallacy2025 shared task, which aims to advance research in multimodal argument mining, focusing on logical fallacies in political debates. Our approach uses pretrained Transformer-based models and proposes several ways to leverage context. In the fallacy classification subtask, our models achieved macro F1-scores of 0.4444 (text), 0.3559 (audio), and 0.4403 (multimodal). Our multimodal model showed performance comparable to the text-only model, suggesting potential for improvements.

[73] P3: Prompts Promote Prompting

Xinyu Zhang, Yuanquan Hu, Fangchao Liu, Zhicheng Dou

Main category: cs.CL

TL;DR: P3 is a self-improvement framework that optimizes both system and user prompts concurrently, outperforming unilateral approaches in LLM applications.

DetailsMotivation: Unilateral optimization of system or user prompts often leads to suboptimal results due to their interdependence.

Method: P3 employs an iterative process to optimize both prompts simultaneously and leverages offline-optimized prompts for online query-dependent optimization.

Result: P3 achieves superior performance in general and reasoning tasks, demonstrating the effectiveness of holistic optimization.

Conclusion: A holistic approach to prompt optimization significantly enhances LLM performance across diverse domains.

Abstract: Current large language model (LLM) applications often employ multi-component prompts, comprising both system and user prompts, to guide model behaviors. While recent advancements have demonstrated the efficacy of automatically optimizing either the system or user prompt to boost performance, such unilateral approaches often yield suboptimal outcomes due to the interdependent nature of these components. In this work, we introduce P3, a novel self-improvement framework that concurrently optimizes both system and user prompts through an iterative process. The offline optimized prompts are further leveraged to promote online prompting by performing query-dependent prompt optimization. Extensive experiments on general tasks (e.g., Arena-hard and Alpaca-eval) and reasoning tasks (e.g., GSM8K and GPQA) demonstrate that P3 achieves superior performance in the realm of automatic prompt optimization. Our results highlight the effectiveness of a holistic optimization strategy in enhancing LLM performance across diverse domains.

[74] CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models

Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Yong Yu, Weinan Zhang, Mengyue Yang

Main category: cs.CL

TL;DR: CoLD framework mitigates length bias in Process Reward Models (PRMs) for LLMs, improving reward prediction reliability and conciseness in reasoning.

DetailsMotivation: Existing PRMs exhibit length bias, favoring longer reasoning steps regardless of semantic validity, undermining reliability and output quality.

Method: Proposes CoLD: explicit length-penalty adjustment, learned bias estimator, and joint training for length-invariant rewards, grounded in counterfactual reasoning.

Result: CoLD reduces reward-length correlation, improves step selection accuracy, and promotes concise, valid reasoning in MATH500 and GSM-Plus benchmarks.

Conclusion: CoLD effectively enhances PRM fidelity and robustness, addressing length bias for more reliable and concise reasoning outputs.

Abstract: Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning. These results demonstrate the effectiveness and practicality of CoLD in improving the fidelity and robustness of PRMs.

[75] Compositional Understanding in Signaling Games

David Peter Wallis Freeborn

Main category: cs.CL

TL;DR: The paper introduces two new signaling game models where receivers learn compositional information, addressing a limitation in standard models where compositional understanding fails.

DetailsMotivation: Standard signaling game models fail to enable receivers to learn or retain compositional information, leading to loss of context when parts of messages are forgotten.

Method: Two new models are proposed: a minimalist receiver learning from atomic message components and a generalist receiver utilizing all available information.

Result: The new models are simpler and successfully allow receivers to learn from atomic components of messages.

Conclusion: The proposed models effectively address the compositional learning problem in signaling games, offering simpler and more functional alternatives to existing approaches.

Abstract: Receivers in standard signaling game models struggle with learning compositional information. Even when the signalers send compositional messages, the receivers do not interpret them compositionally. When information from one message component is lost or forgotten, the information from other components is also erased. In this paper I construct signaling game models in which genuine compositional understanding evolves. I present two new models: a minimalist receiver who only learns from the atomic messages of a signal, and a generalist receiver who learns from all of the available information. These models are in many ways simpler than previous alternatives, and allow the receivers to learn from the atomic components of messages.

[76] Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Seok Hwan Song, Mohna Chakraborty, Qi Li, Wallapak Tavanapong

Main category: cs.CL

TL;DR: The study explores how different question types affect LLM accuracy in reasoning tasks, revealing performance variations and influencing factors like question format and wording.

DetailsMotivation: To understand the impact of question types on LLM performance in reasoning tasks, an area not previously explored.

Method: Evaluated five LLMs on three question types (multiple-choice, true/false, short/long answers) using quantitative and deductive reasoning tasks, measuring accuracy in reasoning steps and final answer selection.

Result: (1) Performance varies significantly by question type. (2) Reasoning accuracy doesn’t always align with final answer accuracy. (3) Number of options and wording influence LLM performance.

Conclusion: Question type and design significantly impact LLM reasoning accuracy, highlighting the need for careful evaluation framework design.

Abstract: Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.

[77] Chinchunmei at SemEval-2025 Task 11: Boosting the Large Language Model’s Capability of Emotion Perception using Contrastive Learning

Tian Li, Yujian Sun, Huizhi Liang

Main category: cs.CL

TL;DR: The paper discusses SemEval-2025 Task 11, focusing on emotion detection in 28 languages using contrastive learning methods. It achieved top-tier rankings in multi-label classification and emotion intensity prediction.

DetailsMotivation: To address challenges in emotion detection due to diverse expressions and backgrounds, the task encourages advanced approaches like contrastive learning.

Method: Two contrastive learning approaches were explored: sample-based (Contrastive Reasoning Calibration) and generation-based (DPO, SimPO), fine-tuned from LLaMa3-Instruct-8B.

Result: The system ranked 9th in Track A (multi-label classification) and 6th in Track B (emotion intensity prediction) for English, with strong performance in other languages.

Conclusion: Contrastive learning methods effectively improve emotion detection, demonstrating competitive performance across multiple languages.

Abstract: The SemEval-2025 Task 11, Bridging the Gap in Text-Based Emotion Detection, introduces an emotion recognition challenge spanning over 28 languages. This competition encourages researchers to explore more advanced approaches to address the challenges posed by the diversity of emotional expressions and background variations. It features two tracks: multi-label classification (Track A) and emotion intensity prediction (Track B), covering six emotion categories: anger, fear, joy, sadness, surprise, and disgust. In our work, we systematically explore the benefits of two contrastive learning approaches: sample-based (Contrastive Reasoning Calibration) and generation-based (DPO, SimPO) contrastive learning. The sample-based contrastive approach trains the model by comparing two samples to generate more reliable predictions. The generation-based contrastive approach trains the model to differentiate between correct and incorrect generations, refining its prediction. All models are fine-tuned from LLaMa3-Instruct-8B. Our system achieves 9th place in Track A and 6th place in Track B for English, while ranking among the top-tier performing systems for other languages.

[78] From Queries to Criteria: Understanding How Astronomers Evaluate LLMs

Alina Hyk, Kiera McCormick, Mian Zhong, Ioana Ciucă, Sanjib Sharma, John F Wu, J. E. G. Peek, Kartheik G. Iyer, Ziang Xiao, Anjalie Field

Main category: cs.CL

TL;DR: The paper explores improving LLM evaluation by studying user interactions with an LLM-powered astronomy bot, leading to better benchmarks for scientific research.

DetailsMotivation: There's a gap between how LLMs are benchmarked and how users actually evaluate them, especially in scientific contexts like astronomy.

Method: Inductive coding of 368 queries to an LLM-powered astronomy bot and interviews with 11 astronomers.

Result: Identified user evaluation criteria and question types, leading to recommendations for better benchmarks.

Conclusion: The study provides actionable insights to enhance LLM evaluation and usability in scientific research.

Abstract: There is growing interest in leveraging LLMs to aid in astronomy and other scientific research, but benchmarks for LLM evaluation in general have not kept pace with the increasingly diverse ways that real people evaluate and use these models. In this study, we seek to improve evaluation procedures by building an understanding of how users evaluate LLMs. We focus on a particular use case: an LLM-powered retrieval-augmented generation bot for engaging with astronomical literature, which we deployed via Slack. Our inductive coding of 368 queries to the bot over four weeks and our follow-up interviews with 11 astronomers reveal how humans evaluated this system, including the types of questions asked and the criteria for judging responses. We synthesize our findings into concrete recommendations for building better benchmarks, which we then employ in constructing a sample benchmark for evaluating LLMs for astronomy. Overall, our work offers ways to improve LLM evaluation and ultimately usability, particularly for use in scientific research.

[79] BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

Sahana Srinivasan, Xuguang Ai, Thaddaeus Wai Soon Lo, Aidan Gilson, Minjie Zou, Ke Zou, Hyunjae Kim, Mingjia Yang, Krithi Pushpanathan, Samantha Yew, Wan Ting Loke, Jocelyn Goh, Yibing Chen, Yiming Kong, Emily Yuelei Fu, Michelle Ongyong Hui, Kristen Nwanyanwu, Amisha Dave, Kelvin Zhenghao Li, Chen-Hsin Sun, Mark Chia, Gabriel Dawei Yang, Wendy Meihua Wong, David Ziyou Chen, Dianbo Liu, Maxwell Singer, Fares Antaki, Lucian V Del Priore, Jost Jonas, Ron Adelman, Qingyu Chen, Yih-Chung Tham

Main category: cs.CL

TL;DR: BELO is a new benchmark for evaluating LLMs in ophthalmology, focusing on clinical accuracy and reasoning quality. It includes 900 expert-reviewed MCQs and uses multiple metrics for assessment.

DetailsMotivation: Existing benchmarks for LLMs in ophthalmology are limited and overly focused on accuracy, lacking comprehensive evaluation.

Method: BELO was developed by curating ophthalmology-specific MCQs from diverse datasets, refined by expert ophthalmologists, and evaluated using accuracy, macro-F1, and text-generation metrics.

Result: Six LLMs were evaluated using BELO, with a public leaderboard established for transparent reporting. The dataset consists of 900 high-quality questions.

Conclusion: BELO provides a standardized, expert-reviewed benchmark for fair and reproducible evaluation of LLMs in ophthalmology.

Abstract: Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ’s correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO’s utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.

[80] Understanding Large Language Models’ Ability on Interdisciplinary Research

Yuanhao Shen, Daniel Xavier de Sousa, Ricardo Marçal, Ali Asad, Hongyu Guo, Xiaodan Zhu

Main category: cs.CL

TL;DR: IDRBench is introduced as a benchmark to evaluate LLMs’ ability in interdisciplinary research (IDR) idea generation, revealing gaps despite some awareness.

DetailsMotivation: The lack of a dedicated benchmark for assessing LLMs' interdisciplinary research capabilities hinders understanding their potential and limitations.

Method: IDRBench includes expert-annotated datasets from six ArXiv disciplines and tasks like IDR Paper Identification, Idea Integration, and Idea Recommendation.

Result: LLMs show limited ability to produce quality IDR ideas, despite some awareness, as tested across 10 models.

Conclusion: IDRBench provides a framework for future LLM development in interdisciplinary research, highlighting current shortcomings.

Abstract: Recent advancements in Large Language Models (LLMs) have revealed their impressive ability to perform multi-step, logic-driven reasoning across complex domains, positioning them as powerful tools and collaborators in scientific discovery while challenging the long-held view that inspiration-driven ideation is uniquely human. However, the lack of a dedicated benchmark that evaluates LLMs’ ability to develop ideas in Interdisciplinary Research (IDR) settings poses a critical barrier to fully understanding their strengths and limitations. To address this gap, we introduce IDRBench – a pioneering benchmark featuring an expert annotated dataset and a suite of tasks tailored to evaluate LLMs’ capabilities in proposing valuable research ideas from different scientific domains for interdisciplinary research. This benchmark aims to provide a systematic framework for assessing LLM performance in complex, cross-domain scientific research. Our dataset consists of scientific publications sourced from the ArXiv platform covering six distinct disciplines, and is annotated by domain experts with diverse academic backgrounds. To ensure high-quality annotations, we emphasize clearly defined dimensions that characterize authentic interdisciplinary research. The design of evaluation tasks in IDRBench follows a progressive, real-world perspective, reflecting the natural stages of interdisciplinary research development, including 1) IDR Paper Identification, 2) IDR Idea Integration, and 3) IDR Idea Recommendation. Using IDRBench, we construct baselines across 10 LLMs and observe that despite fostering some level of IDR awareness, LLMs still struggle to produce quality IDR ideas. These findings could not only spark new research directions, but also help to develop next-generation LLMs that excel in interdisciplinary research.

[81] A Fisher’s exact test justification of the TF-IDF term-weighting scheme

Paul Sheridan, Zeyad Ahmed, Aitazaz A. Farooque

Main category: cs.CL

TL;DR: The paper justifies TF-IDF from a significance testing perspective, linking it to Fisher’s exact test and its p-value.

DetailsMotivation: To provide a theoretical foundation for TF-IDF's effectiveness by connecting it to statistical significance testing.

Method: Demonstrates that TF-ICF (a TF-IDF variant) relates to the negative logarithm of a p-value from Fisher’s exact test under certain conditions.

Result: Establishes a connection between TF-IDF and statistical significance, showing convergence to TF-IDF in large document collections.

Conclusion: The Fisher’s exact test justification offers statisticians a clear explanation for TF-IDF’s long-standing effectiveness.

Abstract: Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher’s exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher’s exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme’s long-established effectiveness.

[82] DialogueForge: LLM Simulation of Human-Chatbot Dialogue

Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Malgorzata Lazuka, Elliott Ash

Main category: cs.CL

TL;DR: DialogueForge is a framework for generating AI-simulated human-chatbot dialogues using seed prompts from real interactions, tested with various LLMs. Proprietary models like GPT-4o perform best, while smaller models (e.g., Llama, Mistral) show promise with fine-tuning. Coherent long-form dialogues remain a challenge.

DetailsMotivation: Manual collection of human-chatbot dialogues is time-consuming and limits conversational AI research. DialogueForge aims to automate this process.

Method: Uses seed prompts from real interactions and tests various LLMs (proprietary and open-source) to generate multi-turn dialogues. Fine-tuning enhances smaller models.

Result: Large proprietary models (e.g., GPT-4o) generate more realistic dialogues, while smaller models improve with fine-tuning. Coherent long-form dialogues are challenging.

Conclusion: DialogueForge offers a scalable solution for dialogue generation, with proprietary models leading in quality and smaller models benefiting from customization. Long-form coherence remains an open challenge.

Abstract: Collecting human-chatbot dialogues typically demands substantial manual effort and is time-consuming, which limits and poses challenges for research on conversational AI. In this work, we propose DialogueForge - a framework for generating AI-simulated conversations in human-chatbot style. To initialize each generated conversation, DialogueForge uses seed prompts extracted from real human-chatbot interactions. We test a variety of LLMs to simulate the human chatbot user, ranging from state-of-the-art proprietary models to small-scale open-source LLMs, and generate multi-turn dialogues tailored to specific tasks. In addition, we explore fine-tuning techniques to enhance the ability of smaller models to produce indistinguishable human-like dialogues. We evaluate the quality of the simulated conversations and compare different models using the UniEval and GTEval evaluation protocols. Our experiments show that large proprietary models (e.g., GPT-4o) generally outperform others in generating more realistic dialogues, while smaller open-source models (e.g., Llama, Mistral) offer promising performance with greater customization. We demonstrate that the performance of smaller models can be significantly improved by employing supervised fine-tuning techniques. Nevertheless, maintaining coherent and natural long-form human-like dialogues remains a common challenge across all models.

[83] Interaction as Intelligence: Deep Research With Human-AI Partnership

Lyumanshan Ye, Xiaojie Cai, Xinkai Wang, Junfei Wang, Xiangkun Hu, Jiadi Su, Yang Nan, Sihan Wang, Bohan Zhang, Xiaoze Fan, Jinbin Luo, Yuxiang Zheng, Tianze Xu, Dayuan Fu, Yunze Wu, Pengrui Lu, Zengzhi Wang, Yiwei Qin, Zhen Huang, Yan Ma, Zhulin Hu, Haoyang Zou, Tiantian Mi, Yixin Ye, Ethan Chern, Pengfei Liu

Main category: cs.CL

TL;DR: The paper redefines human-AI interaction as a core aspect of intelligence, introducing Deep Cognition for cognitive oversight, outperforming traditional systems in key metrics.

DetailsMotivation: Current AI systems treat interaction as a passive interface, leading to inefficiencies like error cascades and missed expertise integration.

Method: Deep Cognition enables transparent, interruptible interaction, fine-grained dialogue, and shared cognitive context for human-guided AI thinking.

Result: Outperforms baselines in transparency (+20%), fine-grained interaction (+29.2%), and other metrics, with 31.8%-50% improvements in research tasks.

Conclusion: Interaction is essential for AI intelligence; Deep Cognition’s cognitive oversight paradigm significantly enhances research efficiency and collaboration.

Abstract: This paper introduces “Interaction as Intelligence” research series, presenting a reconceptualization of human-AI relationships in deep research tasks. Traditional approaches treat interaction merely as an interface for accessing AI capabilities-a conduit between human intent and machine output. We propose that interaction itself constitutes a fundamental dimension of intelligence. As AI systems engage in extended thinking processes for research tasks, meaningful interaction transitions from an optional enhancement to an essential component of effective intelligence. Current deep research systems adopt an “input-wait-output” paradigm where users initiate queries and receive results after black-box processing. This approach leads to error cascade effects, inflexible research boundaries that prevent question refinement during investigation, and missed opportunities for expertise integration. To address these limitations, we introduce Deep Cognition, a system that transforms the human role from giving instructions to cognitive oversight-a mode of engagement where humans guide AI thinking processes through strategic intervention at critical junctures. Deep cognition implements three key innovations: (1)Transparent, controllable, and interruptible interaction that reveals AI reasoning and enables intervention at any point; (2)Fine-grained bidirectional dialogue; and (3)Shared cognitive context where the system observes and adapts to user behaviors without explicit instruction. User evaluation demonstrates that this cognitive oversight paradigm outperforms the strongest baseline across six key metrics: Transparency(+20.0%), Fine-Grained Interaction(+29.2%), Real-Time Intervention(+18.5%), Ease of Collaboration(+27.7%), Results-Worth-Effort(+8.8%), and Interruptibility(+20.7%). Evaluations on challenging research problems show 31.8% to 50.0% points of improvements over deep research systems.

[84] Supernova: Achieving More with Less in Transformer Architectures

Andrei-Valentin Tanase, Elena Pelican

Main category: cs.CL

TL;DR: Supernova, a 650M-parameter transformer, matches 1B-model performance with fewer parameters and tokens, using innovative architecture and tokenization.

DetailsMotivation: To challenge the scaling paradigm by showing architectural efficiency and tokenization can compensate for reduced model size.

Method: Combines RoPE, GQA (3:1 compression), RMSNorm, SwiGLU, and a custom 128K-vocabulary byte-level BPE tokenizer.

Result: Achieves 90% of 1B-model performance with 53% fewer parameters and 100B training tokens (10x less than competitors).

Conclusion: Architectural and tokenization innovations can replace sheer model size, redefining efficiency in transformer models.

Abstract: We present Supernova, a 650M-parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve the performance of larger models while maintaining computational efficiency. Our architecture combines Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with a 3:1 compression ratio, RMSNorm for computational efficiency, and SwiGLU activation functions. A critical innovation is our custom 128,000-vocabulary byte-level BPE tokenizer, which achieves state-of-the-art compression performance. Through detailed analysis, we show that Supernova achieves 90% of the performance of 1B-parameter models while using 53% fewer parameters and requiring only 100B training tokens–an order of magnitude less than competing models. Our findings challenge the prevailing scaling paradigm, demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts.

[85] Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou

Main category: cs.CL

TL;DR: Archer introduces an entropy-aware RLVR method with dual-token constraints and synchronous updates, improving reasoning in LLMs by treating knowledge and reasoning tokens differently.

DetailsMotivation: Previous RLVR methods apply uniform training signals to all tokens, ignoring the distinct roles of knowledge and reasoning tokens, which can hinder learning.

Method: Archer uses weaker KL regularization and higher clipping thresholds for reasoning tokens to encourage exploration, while applying stronger constraints on knowledge tokens to preserve facts.

Result: Archer outperforms prior RLVR methods on math and code benchmarks, matching or exceeding state-of-the-art performance for models of similar size.

Conclusion: Archer’s entropy-aware approach effectively balances exploration and factual accuracy, advancing RLVR for LLMs.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs), mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at https://github.com/wizard-III/ArcherCodeR.

[86] Reservoir Computing as a Language Model

Felix Köster, Atsushi Uchida

Main category: cs.CL

TL;DR: The paper compares reservoir computing (RC) and transformer-based models for language tasks, highlighting RC’s efficiency and transformers’ superior performance.

DetailsMotivation: Address the energy and speed bottlenecks of LLMs by exploring RC for efficient natural text processing.

Method: Compare two RC approaches (static linear readout and attention-enhanced) with transformers, varying trainable parameters equally.

Result: Transformers outperform in quality, but RC is faster and more efficient. Attention-enhanced RC shows promise.

Conclusion: RC offers a resource-efficient alternative to transformers, with attention-enhanced RC balancing performance and efficiency.

Abstract: Large Language Models (LLM) have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing still a bottleneck for further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain sparse. In this paper, we compare three distinct approaches for character-level language modeling, two different reservoir computing approaches, where only an output layer is trainable, and the well-known transformer-based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that transformers excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a traditional reservoir with a static linear readout, and an attention-enhanced reservoir that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.

[87] Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work

Anton Abilov, Ke Zhang, Hemank Lamba, Elizabeth M. Olson, Joel R. Tetreault, Alejandro Jaimes

Main category: cs.CL

TL;DR: The paper highlights the gap in AI for Good literature regarding deployment and collaboration with partner organizations, sharing insights from a real-world humanitarian project.

DetailsMotivation: To address the lack of discussion on deployment and real-world impact in AI for Good publications.

Method: Close collaboration with a humanitarian organization, deploying and maintaining an AI model in resource-constrained settings.

Result: Key takeaways for practitioners on effective deployment and maintenance of AI models in humanitarian contexts.

Conclusion: The work emphasizes the importance of collaboration and practical deployment strategies for AI in humanitarian efforts.

Abstract: Publications in the AI for Good space have tended to focus on the research and model development that can support high-impact applications. However, very few AI for Good papers discuss the process of deploying and collaborating with the partner organization, and the resulting real-world impact. In this work, we share details about the close collaboration with a humanitarian-to-humanitarian (H2H) organization and how to not only deploy the AI model in a resource-constrained environment, but also how to maintain it for continuous performance updates, and share key takeaways for practitioners.

[88] The Impact of Language Mixing on Bilingual LLM Reasoning

Yihao Li, Jiayi Xin, Miranda Muqing Miao, Qi Long, Lyle Ungar

Main category: cs.CL

TL;DR: Language mixing in bilingual LLMs enhances reasoning, as shown by a 5.6% accuracy drop when enforcing monolingual decoding. RLVR training drives this behavior, and a probe can predict beneficial switches, boosting accuracy by 6.25%.

DetailsMotivation: To understand why bilingual LLMs mix languages during reasoning and whether this behavior strategically improves performance.

Method: Studied Chinese-English bilingual models, identified RLVR as the cause of language mixing, and tested monolingual vs. mixed decoding. Developed a probe to predict beneficial language switches.

Result: Language mixing improves reasoning accuracy (5.6% drop without it). The probe increased accuracy by up to 6.25%.

Conclusion: Language mixing is a strategic reasoning behavior, not just a training byproduct, and can be leveraged to enhance model performance.

Abstract: Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing–alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We demonstrate that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on math reasoning tasks. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by up to 6.25 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.

[89] 3LM: Bridging Arabic, STEM, and Code through Benchmarking

Basma El Amel Boussaha, Leen AlQadi, Mugariya Farooq, Shaikha Alsuwaidi, Giulia Campesan, Ahmed Alzubaidi, Mohammed Alyafeai, Hakim Hacid

Main category: cs.CL

TL;DR: The paper introduces 3LM, a suite of three Arabic benchmarks for STEM and code generation, addressing gaps in existing Arabic LLM evaluations.

DetailsMotivation: Limited Arabic LLM benchmarks focus on linguistic/cultural content, neglecting STEM and code domains, which are crucial for real-world applications.

Method: Developed three benchmarks: (1) STEM Q&A from Arabic textbooks, (2) synthetic STEM questions, (3) translated code benchmarks with human review.

Result: Publicly released 3LM benchmarks to support Arabic LLM research in underrepresented STEM and code domains.

Conclusion: 3LM fills a critical gap in Arabic LLM evaluation, promoting broader research and application in STEM and coding.

Abstract: Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.

[90] Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic languages

Angel Felipe Magnossão de Paula, Imene Bensalem, Paolo Rosso, Wajdi Zaghouani

Main category: cs.CL

TL;DR: The paper evaluates six transformer models and ensemble methods for hate speech detection, achieving best results with a majority vote ensemble (F1: 0.60, Accuracy: 0.86).

DetailsMotivation: To address hate speech detection as part of the CERIST NLP Challenge 2022 by leveraging transformer models and ensemble techniques.

Method: Evaluated six transformer models and two ensemble approaches (including majority vote) using five-fold cross-validation.

Result: Majority vote ensemble performed best (F1: 0.60, Accuracy: 0.86) on the test set.

Conclusion: Ensemble methods, particularly majority vote, improve hate speech detection performance.

Abstract: This paper describes our participation in the shared task of hate speech detection, which is one of the subtasks of the CERIST NLP Challenge 2022. Our experiments evaluate the performance of six transformer models and their combination using 2 ensemble approaches. The best results on the training set, in a five-fold cross validation scenario, were obtained by using the ensemble approach based on the majority vote. The evaluation of this approach on the test set resulted in an F1-score of 0.60 and an Accuracy of 0.86.

[91] Where Do People Tell Stories Online? Story Detection Across Online Communities

Maria Antoniak, Joel Mire, Maarten Sap, Elliott Ash, Andrew Piper

Main category: cs.CL

TL;DR: The paper introduces StorySeeker, a toolkit for detecting stories in online communities, featuring a dataset of 502 Reddit posts, annotations, and models for document- and span-level storytelling detection.

DetailsMotivation: Story detection in online communities is difficult due to scattered and interwoven storytelling. The study aims to address this challenge.

Method: Built the StorySeeker toolkit with annotated Reddit data, a codebook, and predictive models. Evaluated detection methods and analyzed textual features.

Result: Identified distinctive features of online storytelling and its distribution. Demonstrated utility in inter- and intra-community research via a case study.

Conclusion: The toolkit aids narratology and online community studies, with implications for future research.

Abstract: Story detection in online communities is a challenging task as stories are scattered across communities and interwoven with non-storytelling spans within a single text. We address this challenge by building and releasing the StorySeeker toolkit, including a richly annotated dataset of 502 Reddit posts and comments, a detailed codebook adapted to the social media context, and models to predict storytelling at the document and span levels. Our dataset is sampled from hundreds of popular English-language Reddit communities ranging across 33 topic categories, and it contains fine-grained expert annotations, including binary story labels, story spans, and event spans. We evaluate a range of detection methods using our data, and we identify the distinctive textual features of online storytelling, focusing on storytelling spans. We illuminate distributional characteristics of storytelling on a large community-centric social media platform, and we also conduct a case study on r/ChangeMyView, where storytelling is used as one of many persuasive strategies, illustrating that our data and models can be used for both inter- and intra-community research. Finally, we discuss implications of our tools and analyses for narratology and the study of online communities.

[92] A Survey of the Evolution of Language Model-Based Dialogue Systems: Data, Task and Models

Hongru Wang, Lingzhi Wang, Yiming Du, Liang Chen, Jingyan Zhou, Yufei Wang, Kam-Fai Wong

Main category: cs.CL

TL;DR: This survey explores the relationship between language models (LMs) and dialogue systems, reviewing their evolution and discussing future directions for LLM-based systems.

DetailsMotivation: To understand how advancements in language models (PLMs, LLMs) have influenced dialogue systems (TOD, ODD) and to guide future research.

Method: A systematic review of dialogue system history, categorized by LM breakthroughs, and analysis of emerging topics and challenges.

Result: Comprehensive insights into the interplay between LMs and dialogue systems, highlighting state-of-the-art outcomes and open challenges.

Conclusion: The survey provides a roadmap for future developments in LM-based dialogue systems, emphasizing their dynamic relationship.

Abstract: Dialogue systems (DS), including the task-oriented dialogue system (TOD) and the open-domain dialogue system (ODD), have always been a fundamental task in natural language processing (NLP), allowing various applications in practice. Owing to sophisticated training and well-designed model architecture, language models (LM) are usually adopted as the necessary backbone to build the dialogue system. Consequently, every breakthrough in LM brings about a shift in learning paradigm and research attention within dialogue system, especially the appearance of pre-trained language models (PLMs) and large language models (LLMs). In this paper, we take a deep look at the history of the dialogue system, especially its special relationship with the advancements of language models. Thus, our survey offers a systematic perspective, categorizing different stages in a chronological order aligned with LM breakthroughs, providing a comprehensive review of state-of-the-art research outcomes. What’s more, we turn our attention to emerging topics and engage in a discussion on open challenges, providing valuable insights into the future directions for LLM-based dialogue systems. In summary, this survey delves into the dynamic interplay between language models and dialogue systems, unraveling the evolutionary path of this essential relationship. Through this exploration, we pave the way for a deeper comprehension of the field, guiding future developments in LM-based dialogue systems.

[93] VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension

Thinh Phuoc Ngo, Khoa Tran Anh Dang, Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Main category: cs.CL

TL;DR: The paper introduces VlogQA, a Vietnamese spoken language corpus for MRC, derived from YouTube transcripts, addressing gaps in existing formal corpora. It achieved an F1 score of 75.34% and EM of 53.97%, highlighting challenges in spoken language processing.

DetailsMotivation: Existing Vietnamese MRC corpora focus on formal written documents, neglecting spoken language. VlogQA fills this gap by using real-world YouTube data.

Method: Developed VlogQA with 10,076 QA pairs from 1,230 YouTube transcripts (food/travel topics). Evaluated performance using deep-learning models.

Result: Achieved F1 score of 75.34% and EM of 53.97%, showing progress but also challenges in processing spoken content.

Conclusion: VlogQA is a valuable resource for Vietnamese MRC research, though spoken language processing remains challenging and requires further improvement.

Abstract: This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube – an extensive source of user-uploaded content, covering the topics of food and travel. By capturing the spoken language of native Vietnamese speakers in natural settings, an obscure corner overlooked in Vietnamese research, the corpus provides a valuable resource for future research in reading comprehension tasks for the Vietnamese language. Regarding performance evaluation, our deep-learning models achieved the highest F1 score of 75.34% on the test set, indicating significant progress in machine reading comprehension for Vietnamese spoken language data. In terms of EM, the highest score we accomplished is 53.97%, which reflects the challenge in processing spoken-based content and highlights the need for further improvement.

[94] Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

Elisa Sanchez-Bayona, Rodrigo Agerri

Main category: cs.CL

TL;DR: Meta4XNLI is a parallel dataset for NLI, annotated for metaphor detection in English and Spanish, showing encoder models outperform decoders in metaphor detection and highlighting translation’s role in metaphor preservation.

DetailsMotivation: To evaluate language models' ability to capture deeper aspects of meaning like metaphor, and to provide a resource for multilingual and cross-lingual metaphor analysis.

Method: Creation of Meta4XNLI dataset for NLI, annotated for metaphor detection and interpretation, comparing encoder- and decoder-based models.

Result: Fine-tuned encoders outperform decoders in metaphor detection; metaphor interpretation performance drops with metaphorical language. Translation affects metaphor preservation.

Conclusion: Meta4XNLI advances metaphor analysis in language models, offering insights into cross-lingual metaphor transferability and translation impacts.

Abstract: Metaphors are a ubiquitous but often overlooked part of everyday language. As a complex cognitive-linguistic phenomenon, they provide a valuable means to evaluate whether language models can capture deeper aspects of meaning, including semantic, pragmatic, and cultural context. In this work, we present Meta4XNLI, the first parallel dataset for Natural Language Inference (NLI) newly annotated for metaphor detection and interpretation in both English and Spanish. Meta4XNLI facilitates the comparison of encoder- and decoder-based models in detecting and understanding metaphorical language in multilingual and cross-lingual settings. Our results show that fine-tuned encoders outperform decoders-only LLMs in metaphor detection. Metaphor interpretation is evaluated via the NLI framework with comparable performance of masked and autoregressive models, which notably decreases when the inference is affected by metaphorical language. Our study also finds that translation plays an important role in the preservation or loss of metaphors across languages, introducing shifts that might impact metaphor occurrence and model performance. These findings underscore the importance of resources like Meta4XNLI for advancing the analysis of the capabilities of language models and improving our understanding of metaphor processing across languages. Furthermore, the dataset offers previously unavailable opportunities to investigate metaphor interpretation, cross-lingual metaphor transferability, and the impact of translation on the development of multilingual annotated resources.

[95] Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

Devichand Budagam, Ashutosh Kumar, Mahsa Khoshnoodi, Sankalp KJ, Vinija Jain, Aman Chadha

Main category: cs.CL

TL;DR: The paper introduces Hierarchical Prompting Taxonomy (HPT) to evaluate LLMs by analyzing task cognitive demands, using a framework (HPF) and index (HPI) for standardized assessment. Experiments show HPT improves LLM performance by 2%-63%, with GSM8k as the most complex task.

DetailsMotivation: To systematically assess LLMs' strengths and weaknesses by understanding the cognitive demands of tasks they perform.

Method: Develops HPT, grounded in human cognition, using HPF (five hierarchical prompting strategies) and HPI (task complexity metric). Evaluates LLMs across datasets.

Result: HPF boosts LLM performance by 2%-63%, with GSM8k having the highest HPI (3.20), indicating its cognitive complexity.

Conclusion: HPT provides a standardized way to evaluate LLMs and task complexity, enhancing performance and offering insights for future research.

Abstract: Assessing the effectiveness of large language models (LLMs) in performing different tasks is crucial for understanding their strengths and weaknesses. This paper presents Hierarchical Prompting Taxonomy (HPT), grounded on human cognitive principles and designed to assess LLMs by examining the cognitive demands of various tasks. The HPT utilizes the Hierarchical Prompting Framework (HPF), which structures five unique prompting strategies in a hierarchical order based on their cognitive requirement on LLMs when compared to human mental capabilities. It assesses the complexity of tasks with the Hierarchical Prompting Index (HPI), which demonstrates the cognitive competencies of LLMs across diverse datasets and offers insights into the cognitive demands that datasets place on different LLMs. This approach enables a comprehensive evaluation of an LLMs problem solving abilities and the intricacy of a dataset, offering a standardized metric for task complexity. Extensive experiments with multiple datasets and LLMs show that HPF enhances LLM performance by 2% to 63% compared to baseline performance, with GSM8k being the most cognitively complex task among reasoning and coding tasks with an average HPI of 3.20 confirming the effectiveness of HPT. To support future research and reproducibility in this domain, the implementations of HPT and HPF are available here.

[96] Why Does New Knowledge Create Messy Ripple Effects in LLMs?

Jiaxin Qin, Zixuan Zhang, Manling Li, Pengfei Yu, Heng Ji

Main category: cs.CL

TL;DR: The paper investigates why post-training knowledge editing (KE) in language models (LMs) often fails to handle ripple effects, introducing GradSim as an indicator to predict these effects.

DetailsMotivation: To understand and address the failure of KE methods in managing ripple effects, where LMs should accurately answer logically related knowledge after edits.

Method: The authors analyze KE methods and introduce GradSim, a gradient similarity metric between original and related knowledge, to predict ripple effects.

Result: GradSim shows a strong positive correlation with ripple effect performance across LMs, KE methods, and metrics, and identifies failure cases like Negation, Over-Ripple, and Multi-Lingual.

Conclusion: GradSim is validated as an effective indicator for predicting ripple effects in LMs, addressing a key challenge in KE.

Abstract: Extensive previous research has focused on post-training knowledge editing (KE) for language models (LMs) to ensure that knowledge remains accurate and up-to-date. One desired property and open question in KE is to let edited LMs correctly handle ripple effects, where LM is expected to answer its logically related knowledge accurately. In this paper, we answer the question of why most KE methods still create messy ripple effects. We conduct extensive analysis and identify a salient indicator, GradSim, that effectively reveals when and why updated knowledge ripples in LMs. GradSim is computed by the cosine similarity between gradients of the original fact and its related knowledge. We observe a strong positive correlation between ripple effect performance and GradSim across different LMs, KE methods, and evaluation metrics. Further investigations into three counter-intuitive failure cases (Negation, Over-Ripple, Multi-Lingual) of ripple effects demonstrate that these failures are often associated with very low GradSim. This finding validates that GradSim is an effective indicator of when knowledge ripples in LMs.

[97] Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

Arief Purnama Muharram, Ayu Purwarianti

Main category: cs.CL

TL;DR: The paper proposes using a Knowledge Graph (KG) to enhance Natural Language Inference (NLI) performance for automated COVID-19 fact-checking in Indonesian, achieving improved accuracy.

DetailsMotivation: To address performance stagnation in deep learning for fact-checking due to lack of knowledge during training.

Method: A model with three modules: fact (KG processing), NLI (semantic relationships), and classifier (final result).

Result: Incorporating KGs improved NLI performance, achieving an accuracy of 0.8616.

Conclusion: KGs are valuable for enhancing NLI in automated fact-checking.

Abstract: Automated fact-checking is a key strategy to overcome the spread of COVID-19 misinformation on the internet. These systems typically leverage deep learning approaches through Natural Language Inference (NLI) to verify the truthfulness of information based on supporting evidence. However, one challenge that arises in deep learning is performance stagnation due to a lack of knowledge during training. This study proposes using a Knowledge Graph (KG) as external knowledge to enhance NLI performance for automated COVID-19 fact-checking in the Indonesian language. The proposed model architecture comprises three modules: a fact module, an NLI module, and a classifier module. The fact module processes information from the KG, while the NLI module handles semantic relationships between the given premise and hypothesis. The representation vectors from both modules are concatenated and fed into the classifier module to produce the final result. The model was trained using the generated Indonesian COVID-19 fact-checking dataset and the COVID-19 KG Bahasa Indonesia. Our study demonstrates that incorporating KGs can significantly improve NLI performance in fact-checking, achieving the best accuracy of 0.8616. This suggests that KGs are a valuable component for enhancing NLI performance in automated fact-checking.

[98] DARE: Diverse Visual Question Answering with Robustness Evaluation

Hannah Sterz, Jonas Pfeiffer, Ivan Vulić

Main category: cs.CL

TL;DR: DARE is a new benchmark for evaluating VLMs on diverse VQA tasks and robustness, revealing significant performance gaps and brittleness in state-of-the-art models.

DetailsMotivation: Existing benchmarks fail to assess VLMs' robustness and reasoning abilities, prompting the creation of DARE for comprehensive evaluation.

Method: DARE introduces a multiple-choice VQA benchmark with five diverse categories and four robustness evaluations (prompts, answer options, output format, and correct answers).

Result: VLMs struggle with most categories and robustness tests, with performance drops up to 34%. Closed-source models outperform open-source ones but remain brittle.

Conclusion: DARE highlights VLMs’ limitations in reasoning and robustness, urging further research to improve their reliability and versatility.

Abstract: Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse categories and includes four robustness-oriented evaluations based on the variations of: prompts, the subsets of answer options, the output format and the number of correct answers. Among a spectrum of other findings, we report that state-of-the-art VLMs still struggle with questions in most categories and are unable to consistently deliver their peak performance across the tested robustness evaluations. The worst case performance across the subsets of options is up to 34% below the performance in the standard case. The robustness of the open-source VLMs such as LLaVA 1.6 and Idefics2 cannot match the closed-source models such as GPT-4 and Gemini, but even the latter remain very brittle to different variations.

[99] CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages

Pretam Ray, Jivnesh Sandhan, Amrith Krishna, Pawan Goyal

Main category: cs.CL

TL;DR: The paper investigates enhancing dependency parsing for morphologically rich languages by making models robust to word order variations, proposing a contrastive self-supervised learning method that improves performance by 3.03/2.95 points (UAS/LAS).

DetailsMotivation: To improve dependency parsing performance for morphologically rich languages with free word order by leveraging their inherent flexibility.

Method: Examines graph-based parsing architectures on 7 free word order languages, using data augmentation and removing position encoding. Proposes a contrastive self-supervised learning method.

Result: Achieves an average gain of 3.03/2.95 points (UAS/LAS) over baselines in 7 languages.

Conclusion: The proposed method effectively enhances robustness to word order variations in dependency parsing for morphologically rich languages.

Abstract: Neural dependency parsing has achieved remarkable performance for low resource morphologically rich languages. It has also been well-studied that morphologically rich languages exhibit relatively free word order. This prompts a fundamental investigation: Is there a way to enhance dependency parsing performance, making the model robust to word order variations utilizing the relatively free word order nature of morphologically rich languages? In this work, we examine the robustness of graph-based parsing architectures on 7 relatively free word order languages. We focus on scrutinizing essential modifications such as data augmentation and the removal of position encoding required to adapt these architectures accordingly. To this end, we propose a contrastive self-supervised learning method to make the model robust to word order variations. Furthermore, our proposed modification demonstrates a substantial average gain of 3.03/2.95 points in 7 relatively free word order languages, as measured by the UAS/LAS Score metric when compared to the best performing baseline.

[100] CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Yixi Ding, Jiaying Wu, Tongyao Zhu, Yanxia Qin, Qian Liu, Min-Yen Kan

Main category: cs.CL

TL;DR: CCSBench is introduced as the first benchmark for compositional controllable summarization in science, enabling fine-grained control over explicit and implicit attributes. Experiments with LLMs reveal limitations in balancing trade-offs, especially for implicit attributes.

DetailsMotivation: To address the underexplored area of controlling multiple attributes (e.g., length, empirical focus) in scientific document summarization for diverse audiences.

Method: Introduces CCSBench for evaluating compositional control. Tests LLMs under various settings (in-context learning, fine-tuning, modular methods) to balance control over explicit and implicit attributes.

Result: LLMs struggle to balance trade-offs between control attributes, particularly for implicit ones requiring deeper understanding.

Conclusion: CCSBench highlights the need for improved methods in compositional controllable summarization, especially for handling implicit attributes.

Abstract: To broaden the dissemination of scientific knowledge to diverse audiences, it is desirable for scientific document summarization systems to simultaneously control multiple attributes such as length and empirical focus. However, existing research typically focuses on controlling single attributes, leaving the compositional control of multiple attributes underexplored. To address this gap, we introduce CCSBench, the first evaluation benchmark for compositional controllable summarization in the scientific domain. Our benchmark enables fine-grained control over both explicit attributes (e.g., length), which are objective and straightforward, and implicit attributes (e.g., conceptual or empirical focus), which are more subjective and abstract. We conduct extensive experiments using various large language models (LLMs) under various settings, including in-context learning, parameter-efficient fine-tuning, and two-stage modular methods for balancing control over different attributes. Our findings reveal significant limitations in LLMs capabilities in balancing trade-offs between control attributes, especially implicit ones that require deeper understanding and abstract reasoning.

[101] Vulnerability of LLMs to Vertically Aligned Text Manipulations

Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Zhen Xiong, Nanyun Peng, Kai-wei Chang

Main category: cs.CL

TL;DR: Vertical text input degrades LLM accuracy in classification tasks; CoT reasoning doesn’t help, but few-shot learning does. Tokenization and attention issues are key causes.

DetailsMotivation: To investigate if decoder-based LLMs are vulnerable to vertical text input, given its real-world implications for harmful content detection.

Method: Analyzed impact of vertical text on LLMs across datasets, tested CoT and few-shot learning, and examined tokenization/attention issues.

Result: Vertical input lowers accuracy; CoT fails, but few-shot learning helps. Tokenization and attention matrices are problematic.

Conclusion: LLMs are vulnerable to vertical text; few-shot learning mitigates it, but deeper tokenization/attention fixes are needed.

Abstract: Vertical text input is commonly encountered in various real-world applications, such as mathematical computations and word-based Sudoku puzzles. While current large language models (LLMs) have excelled in natural language tasks, they remain vulnerable to variations in text formatting. Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: \textit{Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input?} In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) \textit{Chain of Thought (CoT)} reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but \textit{few-shot learning} with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.

[102] Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos

Main category: cs.CL

TL;DR: Ev²R combines reference-based evaluation and verdict-level proxy scoring to improve automated fact-checking by aligning evidence with gold references and supporting verdicts reliably.

DetailsMotivation: Current AFC methods rely on flawed evaluation metrics and closed knowledge sources, limiting their effectiveness.

Method: Ev²R jointly evaluates evidence alignment with gold references and verdict support, outperforming reference-based, proxy-reference, and reference-less baselines.

Result: Ev²R shows higher accuracy and robustness, correlating better with human judgments and resisting adversarial tests.

Conclusion: Ev²R is a reliable metric for evidence evaluation in AFC, addressing prior limitations.

Abstract: Current automated fact-checking (AFC) approaches typically evaluate evidence either implicitly via the predicted verdicts or through exact matches with predefined closed knowledge sources, such as Wikipedia. However, these methods are limited due to their reliance on evaluation metrics originally designed for other purposes and constraints from closed knowledge sources. In this work, we introduce \textbf{\textcolor{skyblue}{Ev\textsuperscript{2}}\textcolor{orangebrown}{R}} which combines the strengths of reference-based evaluation and verdict-level proxy scoring. Ev\textsuperscript{2}R jointly assesses how well the evidence aligns with the gold references and how reliably it supports the verdict, addressing the shortcomings of prior methods. We evaluate Ev\textsuperscript{2}R against three types of evidence evaluation approaches: reference-based, proxy-reference, and reference-less baselines. Assessments against human ratings and adversarial tests demonstrate that Ev\textsuperscript{2}R consistently outperforms existing scoring approaches in accuracy and robustness. It achieves stronger correlation with human judgments and greater robustness to adversarial perturbations, establishing it as a reliable metric for evidence evaluation in AFC.\footnote{Code is available at \href{https://github.com/mubasharaak/fc-evidence-evaluation}{https://github.com/mubasharaak/fc-evidence-evaluation}.}

[103] DRS: Deep Question Reformulation With Structured Output

Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang

Main category: cs.CL

TL;DR: DRS is a zero-shot method combining LLMs with DFS to improve question reformulation, boosting GPT-3.5’s accuracy from 23.03% to 70.42%.

DetailsMotivation: LLMs struggle to help users reformulate unanswerable questions due to insufficient understanding.

Method: DRS uses a DFS-based algorithm to explore entity combinations and constrain outputs with predefined entities.

Result: DRS improves GPT-3.5’s reformulation accuracy to 70.42% and Gemma2-9B’s to 56.75%.

Conclusion: DRS significantly enhances LLMs’ ability to assist in question reformulation.

Abstract: Question answering represents a core capability of large language models (LLMs). However, when individuals encounter unfamiliar knowledge in texts, they often formulate questions that the text itself cannot answer due to insufficient understanding of the underlying information. Recent studies reveal that while LLMs can detect unanswerable questions, they struggle to assist users in reformulating these questions. Even advanced models like GPT-3.5 demonstrate limited effectiveness in this regard. To address this limitation, we propose DRS: Deep Question Reformulation with Structured Output, a novel zero-shot method aimed at enhancing LLMs ability to assist users in reformulating questions to extract relevant information from new documents. DRS combines the strengths of LLMs with a DFS-based algorithm to iteratively explore potential entity combinations and constrain outputs using predefined entities. This structured approach significantly enhances the reformulation capabilities of LLMs. Comprehensive experimental evaluations demonstrate that DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while also enhancing the performance of open-source models, such as Gemma2-9B, from 26.35% to 56.75%.

[104] A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios

Xiachong Feng, Longxu Dou, Ella Li, Qinghao Wang, Haochuan Wang, Yu Guo, Chang Ma, Lingpeng Kong

Main category: cs.CL

TL;DR: A survey reviewing LLM-based social agents in game-theoretic scenarios, organized into Game Framework, Social Agent, and Evaluation Protocol, with insights for future research.

DetailsMotivation: Address the lack of a comprehensive survey on LLM-based social agents in game-theoretic settings.

Method: Systematic review of existing research, organizing findings into three components: Game Framework, Social Agent, and Evaluation Protocol.

Result: Identifies diverse game scenarios, agent traits, and evaluation metrics, analyzing current agent performance.

Conclusion: Provides insights to advance social agent development and evaluation in game-theoretic scenarios.

Abstract: Game-theoretic scenarios have become pivotal in evaluating the social intelligence of Large Language Model (LLM)-based social agents. While numerous studies have explored these agents in such settings, there is a lack of a comprehensive survey summarizing the current progress. To address this gap, we systematically review existing research on LLM-based social agents within game-theoretic scenarios. Our survey organizes the findings into three core components: Game Framework, Social Agent, and Evaluation Protocol. The game framework encompasses diverse game scenarios, ranging from choice-focusing to communication-focusing games. The social agent part explores agents' preferences, beliefs, and reasoning abilities, as well as their interactions and synergistic effects on decision-making. The evaluation protocol covers both game-agnostic and game-specific metrics for assessing agent performance. Additionally, we analyze the performance of current social agents across various game scenarios. By reflecting on the current research and identifying future research directions, this survey provides insights to advance the development and evaluation of social agents in game-theoretic scenarios.

[105] KnowShiftQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education?

Tianshi Zheng, Weihan Li, Jiaxin Bai, Weiqi Wang, Yangqiu Song

Main category: cs.CL

TL;DR: KnowShiftQA dataset evaluates RAG systems in K-12 education, revealing performance drops due to knowledge discrepancies between textbooks and LLMs.

DetailsMotivation: To assess RAG system robustness against knowledge shifts between authoritative textbooks and LLM parametric knowledge.

Method: Introduces KnowShiftQA, a dataset simulating knowledge discrepancies via hypothetical updates to answers and sources, covering 3,005 questions across five subjects.

Result: Most RAG systems show significant performance drops when handling knowledge discrepancies, especially in integrating textbook and LLM knowledge.

Conclusion: Current RAG systems struggle with knowledge discrepancies, highlighting the need for improved robustness in integrating contextual and parametric knowledge.

Abstract: Retrieval-Augmented Generation (RAG) systems show remarkable potential as question answering tools in the K-12 Education domain, where knowledge is typically queried within the restricted scope of authoritative textbooks. However, discrepancies between these textbooks and the parametric knowledge inherent in Large Language Models (LLMs) can undermine the effectiveness of RAG systems. To systematically investigate RAG system robustness against such knowledge discrepancies, we introduce KnowShiftQA. This novel question answering dataset simulates these discrepancies by applying deliberate hypothetical knowledge updates to both answers and source documents, reflecting how textbook knowledge can shift. KnowShiftQA comprises 3,005 questions across five subjects, designed with a comprehensive question typology focusing on context utilization and knowledge integration. Our extensive experiments on retrieval and question answering performance reveal that most RAG systems suffer a substantial performance drop when faced with these knowledge discrepancies. Furthermore, questions requiring the integration of contextual (textbook) knowledge with parametric (LLM) knowledge pose a significant challenge to current LLMs.

[106] Finding A Voice: Exploring the Potential of African American Dialect and Voice Generation for Chatbots

Sarah E. Finch, Ellie S. Paek, Ikseon Choi, Jinho D. Choi

Main category: cs.CL

TL;DR: The study explores how linguistic similarity, specifically African American English (AAE), impacts chatbot performance, finding spoken AAE chatbots outperform text-based ones.

DetailsMotivation: To enhance trust, engagement, and inclusivity for the African American community by personalizing chatbots with AAE.

Method: Developed text-based and spoken AAE chatbots using large language models and text-to-speech, then evaluated them against standard English chatbots with AAE speakers.

Result: Spoken AAE chatbots with African American voices and AAE elements performed better and were preferred, while text-based AAE chatbots underperformed.

Conclusion: Linguistic personalization is complex, with speech modality showing promise, but technological limitations in AAE generation need addressing.

Abstract: As chatbots become integral to daily life, personalizing systems is key for fostering trust, engagement, and inclusivity. This study examines how linguistic similarity affects chatbot performance, focusing on integrating African American English (AAE) into virtual agents to better serve the African American community. We develop text-based and spoken chatbots using large language models and text-to-speech technology, then evaluate them with AAE speakers against standard English chatbots. Our results show that while text-based AAE chatbots often underperform, spoken chatbots benefit from an African American voice and AAE elements, improving performance and preference. These findings underscore the complexities of linguistic personalization and the dynamics between text and speech modalities, highlighting technological limitations that affect chatbots’ AA speech generation and pointing to promising future research directions.

[107] Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım

Main category: cs.CL

TL;DR: A novel framework evaluates tokenization strategies for NLP, focusing on morphologically rich and low-resource languages, using Turkish MMLU data. Key metrics include vocabulary size, token count, processing time, %TR, and token purity. %TR correlates strongly with downstream performance, outperforming token purity. Tailored tokenization strategies are crucial, as larger models don’t guarantee better results.

DetailsMotivation: To address challenges in tokenization for morphologically rich and low-resource languages, ensuring better linguistic structure preservation and model performance.

Method: A framework evaluates tokenizers using a Turkish MMLU dataset (6,200 questions) and five metrics: vocabulary size, token count, processing time, %TR, and token purity. %TR measures valid words; %Pure assesses alignment with linguistic units.

Result: %TR shows stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Larger model parameters don’t improve tokenization quality, emphasizing the need for tailored strategies.

Conclusion: The framework sets a standard for robust tokenization in complex languages. Future work includes refining morphological analysis, domain customization, and cross-linguistic evaluations.

Abstract: Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models’ (LLMs) ability to capture syntactic, morphosyntactic, and semantic structures. This paper introduces a novel framework for systematically evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages. Using a Turkish dataset of 6,200 multiple-choice questions from the Massive Multitask Language Understanding (MMLU) benchmark, the framework assesses tokenizers across five key metrics: vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity. These metrics provide a structured approach to evaluating how well tokenizers preserve linguistic structures. While %TR measures the proportion of valid words in the target language, %Pure assesses the alignment of tokens with meaningful linguistic units, such as roots and valid morphemes, minimizing semantic fragmentation. The findings reveal that %TR, introduced as a critical metric, exhibits a stronger correlation with downstream performance (e.g., MMLU scores) than token purity, emphasizing its role in improving model accuracy. Additionally, larger model parameters do not necessarily yield better tokenization quality or enhanced results, highlighting the importance of tailored tokenization strategies that prioritize linguistic alignment. This framework sets a new standard for developing robust tokenization methods optimized for morphologically complex and low-resource languages. Future work will refine morphological analysis, explore domain-specific customizations, and conduct cross-linguistic evaluations to further enhance tokenization practices.

[108] Layerwise Recall and the Geometry of Interwoven Knowledge in LLMs

Ge Lei, Samuel J. Cooper

Main category: cs.CL

TL;DR: LLMs encode scientific knowledge in a 3D spiral structure, reflecting the periodic table’s geometry, with middle layers handling continuous attributes and deeper layers sharpening distinctions.

DetailsMotivation: To understand how LLMs encode and organize scientific knowledge, specifically chemical elements, and their geometric representation.

Method: Analyzed hidden states of LLaMA-series models, identifying a 3D spiral structure and using linear probing to study layer-specific encoding.

Result: Middle layers encode overlapping attributes for indirect recall, while deeper layers enhance categorical distinctions and linguistic context.

Conclusion: LLMs represent symbolic knowledge as structured geometric manifolds, suggesting potential for exploring scientific reasoning in domains like materials science.

Abstract: This study explores how large language models (LLMs) encode interwoven scientific knowledge, using chemical elements and LLaMA-series models as a case study. We identify a 3D spiral structure in the hidden states that aligns with the conceptual structure of the periodic table, suggesting that LLMs can reflect the geometric organization of scientific concepts learned from text. Linear probing reveals that middle layers encode continuous, overlapping attributes that enable indirect recall, while deeper layers sharpen categorical distinctions and incorporate linguistic context. These findings suggest that LLMs represent symbolic knowledge not as isolated facts, but as structured geometric manifolds that intertwine semantic information across layers. We hope this work inspires further exploration of how LLMs represent and reason about scientific knowledge, particularly in domains such as materials science.

[109] FastMCTS: A Simple Sampling Strategy for Data Synthesis

Peiji Li, Kai Lv, Yunfan Shao, Yichuan Ma, Linyang Li, Xiaoqing Zheng, Xipeng Qiu, Qipeng Guo

Main category: cs.CL

TL;DR: FastMCTS is a new data synthesis method inspired by Monte Carlo Tree Search, improving efficiency and balance in generating multi-step reasoning data compared to rejection sampling.

DetailsMotivation: Existing methods like rejection sampling are inefficient and imbalanced for generating multi-step reasoning data.

Method: FastMCTS, inspired by Monte Carlo Tree Search, provides step-level evaluation and balanced sampling.

Result: FastMCTS generates 30% more correct reasoning paths and improves model performance by 3.9% over rejection sampling.

Conclusion: FastMCTS is a practical, efficient alternative for high-quality reasoning data synthesis.

Abstract: Synthetic high-quality multi-step reasoning data can significantly enhance the performance of large language models on various tasks. However, most existing methods rely on rejection sampling, which generates trajectories independently and suffers from inefficiency and imbalanced sampling across problems of varying difficulty. In this work, we introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search. FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals and promoting balanced sampling across problems of different difficulty levels. Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30% more correct reasoning paths compared to rejection sampling as the number of generated tokens scales up. Furthermore, under comparable synthetic data budgets, models trained on FastMCTS-generated data outperform those trained on rejection sampling data by 3.9% across multiple benchmarks. As a lightweight sampling strategy, FastMCTS offers a practical and efficient alternative for synthesizing high-quality reasoning data. Our code will be released soon.

[110] Commonsense Reasoning in Arab Culture

Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto

Main category: cs.CL

TL;DR: ArabCulture is a new dataset for commonsense reasoning in Modern Standard Arabic, addressing cultural biases in existing datasets by involving native speakers from 13 Arab countries.

DetailsMotivation: Existing Arabic language models rely on machine-translated datasets, which lack cultural depth and introduce Anglocentric biases, failing to represent the diverse Arab world.

Method: ArabCulture was built by native speakers writing and validating culturally relevant questions across 12 daily life domains and 54 subtopics, covering 13 Arab countries.

Result: Zero-shot evaluations show open-weight language models (up to 32B parameters) struggle with diverse Arab cultures, with performance varying by region.

Conclusion: The study underscores the need for culturally aware models and datasets tailored to the Arabic-speaking world.

Abstract: Despite progress in Arabic large language models, such as Jais and AceGPT, their evaluation on commonsense reasoning has largely relied on machine-translated datasets, which lack cultural depth and may introduce Anglocentric biases. Commonsense reasoning is shaped by geographical and cultural contexts, and existing English datasets fail to capture the diversity of the Arab world. To address this, we introduce ArabCulture, a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley. The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries. ArabCulture spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences. Zero-shot evaluations show that open-weight language models with up to 32B parameters struggle to comprehend diverse Arab cultures, with performance varying across regions. These findings highlight the need for more culturally aware models and datasets tailored to the Arabic-speaking world.

[111] KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan

Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto

Main category: cs.CL

TL;DR: The paper introduces KazMMLU, the first MMLU-style dataset for the Kazakh language, addressing underrepresentation in NLP. It includes 23,000 questions in Kazakh and Russian, evaluating multilingual models like GPT-4, revealing performance gaps.

DetailsMotivation: Kazakh language and culture are underrepresented in NLP, with limited progress in Kazakh-centric LLMs and benchmarks.

Method: Creation of KazMMLU, a dataset with 23,000 questions (10,969 Kazakh, 12,031 Russian) from educational materials, manually validated. Evaluation of multilingual models (Llama-3.1, Qwen-2.5, GPT-4, DeepSeek V3).

Result: State-of-the-art models underperform in Kazakh and Russian, highlighting gaps compared to high-resource languages.

Conclusion: KazMMLU aims to spur research and development of Kazakh-centric LLMs, addressing the lack of resources and benchmarks.

Abstract: Despite having a population of twenty million, Kazakhstan’s culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan’s bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings underscore significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs. Data and code will be made available upon acceptance.

[112] How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation

Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui

Main category: cs.CL

TL;DR: BehaviorChain is a new benchmark for evaluating LLMs’ ability to simulate continuous human behavior, revealing gaps in current models.

DetailsMotivation: Current LLM evaluations focus on dialogue simulation, neglecting human behavior simulation, which is essential for digital twins.

Method: BehaviorChain includes 15,846 persona-based behaviors across 1,001 personas. LLMs are tested by inferring contextually appropriate behaviors in dynamic scenarios.

Result: State-of-the-art LLMs struggle with accurately simulating continuous human behavior.

Conclusion: BehaviorChain highlights the need for improved LLM capabilities in human behavior simulation for digital twin applications.

Abstract: Recently, LLMs have garnered increasing attention across academic disciplines for their potential as human digital twins, virtual proxies designed to replicate individuals and autonomously perform tasks such as decision-making, problem-solving, and reasoning on their behalf. However, current evaluations of LLMs primarily emphasize dialogue simulation while overlooking human behavior simulation, which is crucial for digital twins. To address this gap, we introduce BehaviorChain, the first benchmark for evaluating LLMs’ ability to simulate continuous human behavior. BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas, each with detailed history and profile metadata. For evaluation, we integrate persona metadata into LLMs and employ them to iteratively infer contextually appropriate behaviors within dynamic scenarios provided by BehaviorChain. Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.

[113] MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs

Xinxin You, Xien Liu, Xue Yang, Ziyi Wang, Ji Wu

Main category: cs.CL

TL;DR: MKE-Coder, a novel framework for automatic ICD coding in Chinese EMRs, leverages multi-axial knowledge and evidence verification to improve accuracy and efficiency.

DetailsMotivation: Existing methods struggle with Chinese EMRs due to concise writing styles and lack of multi-axial knowledge integration.

Method: MKE-Coder identifies candidate codes, categorizes them into four axes, retrieves clinical evidence, and verifies validity using a masked language model.

Result: Experiments show MKE-Coder outperforms existing methods, improving coding accuracy and speed in real scenarios.

Conclusion: MKE-Coder effectively addresses challenges in Chinese EMR ICD coding, offering a robust solution for practical use.

Abstract: The task of automatically coding the International Classification of Diseases (ICD) in the medical field has been well-established and has received much attention. Automatic coding of the ICD in the medical field has been successful in English but faces challenges when dealing with Chinese electronic medical records (EMRs). The first issue lies in the difficulty of extracting disease code-related information from Chinese EMRs, primarily due to the concise writing style and specific internal structure of the EMRs. The second problem is that previous methods have failed to leverage the disease-based multi-axial knowledge and lack of association with the corresponding clinical evidence. This paper introduces a novel framework called MKE-Coder: Multi-axial Knowledge with Evidence verification in ICD coding for Chinese EMRs. Initially, we identify candidate codes for the diagnosis and categorize each of them into knowledge under four coding axes.Subsequently, we retrieve corresponding clinical evidence from the comprehensive content of EMRs and filter credible evidence through a scoring model. Finally, to ensure the validity of the candidate code, we propose an inference module based on the masked language modeling strategy. This module verifies that all the axis knowledge associated with the candidate code is supported by evidence and provides recommendations accordingly. To evaluate the performance of our framework, we conduct experiments using a large-scale Chinese EMR dataset collected from various hospitals. The experimental results demonstrate that MKE-Coder exhibits significant superiority in the task of automatic ICD coding based on Chinese EMRs. In the practical evaluation of our method within simulated real coding scenarios, it has been demonstrated that our approach significantly aids coders in enhancing both their coding accuracy and speed.

[114] Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans

Masha Fedzechkina, Eleonora Gualdoni, Sinead Williamson, Katherine Metcalf, Skyler Seto, Barry-John Theobald

Main category: cs.CL

TL;DR: The paper introduces a method to study alignment between LLM and human representations using neuron activation patterns, showing close alignment and outperforming word embeddings.

DetailsMotivation: To understand how well LLM representations align with human representations, given their mixed performance on tasks.

Method: Uses activation steering to identify neurons for specific concepts and analyzes activation patterns.

Result: LLM representations closely align with human ones, matching inter-human alignment levels and outperforming word embeddings.

Conclusion: LLMs organize concepts similarly to humans, providing a granular view of their representation alignment.

Abstract: Modern large language models (LLMs) achieve impressive performance on some tasks, while exhibiting distinctly non-human-like behaviors on others. This raises the question of how well the LLM’s learned representations align with human representations. In this work, we introduce a novel approach to study representation alignment: we adopt a method from research on activation steering to identify neurons responsible for specific concepts (e.g., ‘‘cat’’) and then analyze the corresponding activation patterns. We find that LLM representations captured this way closely align with human representations inferred from behavioral data, matching inter-human alignment levels. Our approach significantly outperforms the alignment captured by word embeddings, which have been the focus of prior work on human-LLM alignment. Additionally, our approach enables a more granular view of how LLMs represent concepts – we show that LLMs organize concepts in a way that mirrors human concept organization.

[115] Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models

Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina

Main category: cs.CL

TL;DR: The paper explores model interventions as a data-efficient alternative to fine-tuning for improving cross-lingual alignment in multilingual large language models (mLLMs), demonstrating enhanced alignment and downstream performance.

DetailsMotivation: Aligning representations across languages in mLLMs is crucial for cross-lingual tasks, but fine-tuning is computationally expensive and data-intensive.

Method: The study uses model interventions (manipulating activations) to steer generation, identifies neurons for manipulation, and analyzes embedding space changes pre- and post-intervention.

Result: Interventions improve cross-lingual alignment and boost downstream performance, achieving up to 2x top-1 accuracy in cross-lingual retrieval.

Conclusion: Model interventions offer a viable, data-efficient method to enhance cross-lingual alignment in mLLMs without extensive fine-tuning.

Abstract: Aligned representations across languages is a desired property in multilingual large language models (mLLMs), as alignment can improve performance in cross-lingual tasks. Typically alignment requires fine-tuning a model, which is computationally expensive, and sizable language data, which often may not be available. A data-efficient alternative to fine-tuning is model interventions – a method for manipulating model activations to steer generation into the desired direction. We analyze the effect of a popular intervention (finding experts) on the alignment of cross-lingual representations in mLLMs. We identify the neurons to manipulate for a given language and introspect the embedding space of mLLMs pre- and post-manipulation. We show that modifying the mLLM’s activations changes its embedding space such that cross-lingual alignment is enhanced. Further, we show that the changes to the embedding space translate into improved downstream performance on retrieval tasks, with up to 2x improvements in top-1 accuracy on cross-lingual retrieval.

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

Main category: cs.CL

TL;DR: KVLink is a method for efficient key-value (KV) cache reuse in LLMs, reducing redundant computation by precomputing and reusing cached representations of overlapping contexts.

DetailsMotivation: Different inputs in LLM applications often share overlapping context, but current methods require redundant encoding of the same context for each query, leading to inefficiency.

Method: KVLink precomputes KV caches for documents independently and concatenates them during inference. It adjusts positional embeddings and uses trainable special tokens to maintain self-attention across documents.

Result: KVLink improves question answering accuracy by 4% on average and reduces time-to-first-token by up to 96%. It also works well with KV cache compression.

Conclusion: KVLink is a scalable and efficient solution for context reuse in LLMs, offering significant performance improvements and reduced computational overhead.

Abstract: We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.

[117] MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

Main category: cs.CL

TL;DR: The paper introduces MEMERAG, a multilingual benchmark for evaluating RAG systems, addressing cultural nuances by using native-language questions and expert annotations for faithfulness and relevance.

DetailsMotivation: Existing benchmarks for RAG systems lack cultural nuance as they focus on English or translated data, limiting their applicability to diverse languages and contexts.

Method: The benchmark uses the MIRACL dataset with native-language questions, generates responses via diverse LLMs, and employs expert annotators to assess faithfulness and relevance.

Result: High inter-annotator agreement is achieved, and the benchmark effectively evaluates multilingual automatic evaluators, identifying improvements from advanced prompting techniques and LLMs.

Conclusion: MEMERAG provides a reliable, culturally nuanced benchmark for multilingual RAG evaluation, enhancing the development of automatic evaluators.

Abstract: Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. Our dataset is available at https://github.com/amazon-science/MEMERAG

[118] Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents

Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang

Main category: cs.CL

TL;DR: A retrieval-augmented QA pipeline using synthetic data and self-training reduces hallucination in LLMs for customer support, outperforming crowdsourced data and proprietary models.

DetailsMotivation: Addressing hallucination and high costs in deploying LLMs for customer support.

Method: Proposes a retrieval-augmented QA pipeline, tests synthetic vs. crowdsourced data, and compares self-training with knowledge distillation.

Result: Synthetic data reduces hallucination; self-training matches knowledge distillation’s performance. Contextualized “I don’t know” responses improve robustness.

Conclusion: Open-source models with synthetic data and self-training offer scalable, cost-efficient QA systems, reducing reliance on proprietary tools or human annotations.

Abstract: The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination (generating false information) and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models’ outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized “I don’t know” responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.

[119] Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks

Yanran Chen, Steffen Eger

Main category: cs.CL

TL;DR: The paper explores how emotional intensity affects argument convincingness using a dynamic framework and LLM-based manipulation checks, finding emotions often enhance convincingness but LLMs struggle with nuanced emotional effects.

DetailsMotivation: Emotions' role in argument convincingness is underexplored in NLP, with prior studies limited by static analyses, single domains/languages, or treating emotion as one of many factors.

Method: A dynamic framework using LLM-based manipulation checks evaluates emotional intensity’s impact on convincingness across languages, domains, and topics via human evaluation.

Result: Human judgments of convincingness often remain unchanged despite emotional variations; when emotions matter, they usually enhance convincingness. LLMs mimic human patterns but fail in nuanced emotional judgments.

Conclusion: Emotions can enhance argument convincingness, but LLMs lack the nuance to fully replicate human emotional judgment patterns.

Abstract: Emotions have been shown to play a role in argument convincingness, yet this aspect is underexplored in the natural language processing (NLP) community. Unlike prior studies that use static analyses, focus on a single text domain or language, or treat emotion as just one of many factors, we introduce a dynamic framework inspired by manipulation checks commonly used in psychology and social science; leveraging LLM-based manipulation checks, this framework examines the extent to which perceived emotional intensity influences perceived convincingness. Through human evaluation of arguments across different languages, text domains, and topics, we find that in over half of cases, human judgments of convincingness remain unchanged despite variations in perceived emotional intensity; when emotions do have an impact, they more often enhance rather than weaken convincingness. We further analyze whether 11 LLMs behave like humans in the same scenario, finding that while LLMs generally mirror human patterns, they struggle to capture nuanced emotional effects in individual judgments.

[120] Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, Nazneen Rajani

Main category: cs.CL

TL;DR: The paper introduces CatAttack, a method to generate adversarial triggers that mislead reasoning models into incorrect answers without changing problem semantics, revealing vulnerabilities in state-of-the-art models.

DetailsMotivation: To investigate the robustness of reasoning models by exposing their susceptibility to subtle adversarial inputs, raising security and reliability concerns.

Method: Proposes CatAttack, an automated iterative attack pipeline using a weaker proxy model (DeepSeek V3) to generate triggers, then transfers them to advanced models (DeepSeek R1, R1-distilled-Qwen-32B).

Result: Triggers like “Interesting fact: cats sleep most of their lives” increase incorrect answers by over 300%, demonstrating model vulnerabilities.

Conclusion: State-of-the-art reasoning models are highly susceptible to adversarial triggers, highlighting critical security and reliability issues.

Abstract: We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers - short, irrelevant text that, when appended to math problems, systematically mislead models to output incorrect answers without altering the problem’s semantics. We propose CatAttack, an automated iterative attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transfer them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending, “Interesting fact: cats sleep most of their lives,” to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of-the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns. The CatAttack triggers dataset with model responses is available at https://huggingface.co/datasets/collinear-ai/cat-attack-adversarial-triggers.

[121] Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

Main category: cs.CL

TL;DR: Symbolic-MoE is a gradient-free Mixture-of-Experts framework for adaptive instance-level expert selection, improving performance and efficiency in diverse reasoning tasks.

DetailsMotivation: Existing expert LLM selection is too coarse-grained for heterogeneous tasks, requiring fine-grained, instance-level expertise.

Method: Symbolic-MoE dynamically selects experts based on skills, generates multiple outputs, and synthesizes them via an aggregator. A batch strategy reduces computational overhead.

Result: Symbolic-MoE outperforms GPT4o-mini and multi-agent baselines by 8.15% on average, with efficient GPU usage.

Conclusion: Symbolic-MoE enables scalable, high-quality reasoning with minimal computational cost, generalizing well to unseen tasks.

Abstract: Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting task-level experts is often too coarse-grained, as heterogeneous tasks may require different expertise per instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE’s instance-level expert selection improves performance by a large margin but – when implemented naively – can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we show that Symbolic-MoE beats strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute avg. gain of 8.15% over the best multi-agent baseline. Moreover, Symbolic-MoE generalizes well to unseen tasks and removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.

[122] Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, Jiawei Han

Main category: cs.CL

TL;DR: Search-R1 enhances LLMs’ reasoning by teaching them to autonomously generate search queries during step-by-step reasoning, improving performance by up to 41% over baselines.

DetailsMotivation: LLMs often struggle to optimally interact with search engines during inference, limiting their ability to acquire external knowledge efficiently.

Method: Search-R1 uses reinforcement learning to optimize LLM reasoning trajectories with multi-turn search interactions, employing retrieved token masking and outcome-based rewards.

Result: Experiments show performance improvements of 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over RAG baselines.

Conclusion: Search-R1 effectively improves LLM reasoning with real-time retrieval, offering insights into RL optimization and LLM choices.

Abstract: Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

[123] BriLLM: Brain-inspired Large Language Model

Hai Zhao, Hongqiu Wu, Dongjie Yang, Anni Zou, Jiale Hong

Main category: cs.CL

TL;DR: BriLLM is a brain-inspired, non-Transformer language model using a directed graph (SiFu) for interpretability and infinite n-gram support. It mimics human cognition with recall activation and multi-modal potential. Initial Chinese version matches GPT-1 performance.

DetailsMotivation: To create an interpretable, brain-like language model that diverges from traditional Transformer/GPT architectures, enabling infinite n-gram support and cognitive-inspired features.

Method: Uses Signal Fully-connected flowing (SiFu) on a directed graph, where tokens are nodes and signal flows follow ’least resistance’ paths. Supports infinitely long n-grams and recall activation.

Result: First BriLLM version in Chinese achieves GPT-1-level performance with 4000 tokens, 32D node width, and 16-token sequence prediction.

Conclusion: BriLLM offers a novel, interpretable approach to language modeling with brain-like features, promising scalability with more computing power.

Abstract: This paper reports the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of “least resistance” along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long $n$-gram models when the model size is independent of the input and predicted length of the model. The model’s working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM version in Chinese, with 4000 tokens, 32-dimensional node width, 16-token long sequence prediction ability, and language model prediction performance comparable to GPT-1. More computing power will help us explore the infinite possibilities depicted above.

[124] OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

Ivan Kartáč, Mateusz Lango, Ondřej Dušek

Main category: cs.CL

TL;DR: OpeNLGauge is an open-source, reference-free NLG evaluation metric that provides explanatory feedback, outperforming proprietary models in some tasks.

DetailsMotivation: Existing LLM-based metrics rely on proprietary models and lack fine-grained feedback.

Method: OpeNLGauge uses a two-stage ensemble of open-weight LLMs or a small fine-tuned model, offering error-span-based explanations.

Result: OpeNLGauge achieves competitive correlation with human judgments and outperforms state-of-the-art models in some tasks.

Conclusion: OpeNLGauge is a reproducible, open-source solution for NLG evaluation with accurate explanations.

Abstract: Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

[125] Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

Rrubaa Panchendrarajan, Arkaitz Zubiaga

Main category: cs.CL

TL;DR: EX-Claim is an entity-aware cross-lingual claim detection model that improves multilingual claim detection by leveraging entity information, showing consistent performance across 27 languages.

DetailsMotivation: The proliferation of multilingual misinformation on social media necessitates better tools for identifying claims requiring verification, especially in cross-lingual contexts.

Method: The paper introduces EX-Claim, which uses named entity recognition and entity linking to enhance multilingual claim detection, fine-tuning pre-trained models for better cross-lingual knowledge transfer.

Result: Experiments on three datasets show EX-Claim achieves consistent performance gains across 27 languages and robust knowledge transfer between seen and unseen languages.

Conclusion: EX-Claim is an effective solution for multilingual claim detection, demonstrating strong generalization and cross-lingual transfer capabilities.

Abstract: Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite notable progress, challenges remain-particularly in handling multilingual data prevalent in online discourse. Recent efforts have focused on fine-tuning pre-trained multilingual language models to address this. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce EX-Claim, an entity-aware cross-lingual claim detection model that generalizes well to handle multilingual claims. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model stands out as an effective solution, demonstrating consistent performance gains across 27 languages and robust knowledge transfer between languages seen and unseen during training.

[126] SWI: Speaking with Intent in Large Language Models

Yuwei Yin, EunJeong Hwang, Giuseppe Carenini

Main category: cs.CL

TL;DR: The paper introduces Speaking with Intent (SWI) for LLMs, showing it improves reasoning and generation quality by emulating human deliberate thought.

DetailsMotivation: To enhance LLMs' reasoning and generation by incorporating explicit intent, mimicking human cognitive processes.

Method: SWI involves generating explicit intent to guide analysis and action, tested on summarization, QA, and math reasoning tasks.

Result: SWI outperforms direct generation, showing effectiveness, generalizability, and improved coherence in human evaluations.

Conclusion: SWI offers a promising approach to boost LLMs’ cognitive abilities through explicit intent.

Abstract: Intent, typically clearly formulated and planned, functions as a cognitive framework for communication and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model’s underlying intention and provides high-level planning to guide subsequent analysis and action. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on text summarization, multi-task question answering, and mathematical reasoning benchmarks consistently demonstrate the effectiveness and generalizability of Speaking with Intent over direct generation without explicit intent. Further analysis corroborates the generalizability of SWI under different experimental settings. Moreover, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. The promising results in enhancing LLMs with explicit intents pave a new avenue for boosting LLMs’ generation and reasoning abilities with cognitive notions.

[127] Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang

Main category: cs.CL

TL;DR: The paper investigates whether Vision-Language Models (VLMs) can recognize fonts, introducing the Font Recognition Benchmark (FRB) and finding current VLMs perform poorly, with minimal improvement from few-shot learning or Chain-of-Thought prompting.

DetailsMotivation: To assess VLMs' capability in fine-grained tasks like font recognition, given their multimodal potential and free accessibility.

Method: Created the FRB dataset with easy and hard versions (introducing a stroop effect) and evaluated various VLMs on font recognition.

Result: Current VLMs show limited font recognition, are affected by stroop effects, and few-shot/CoT methods offer little improvement.

Conclusion: VLMs have inherent limitations in capturing semantic features for font recognition, highlighting a gap in their fine-grained task performance.

Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

[128] Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models

Feng Chen, Dror Ben-Zeev, Gillian Sparks, Arya Kadakia, Trevor Cohen

Main category: cs.CL

TL;DR: The study evaluates NLP methods for PTSD detection from clinical interviews, finding domain-specific models and embeddings outperform general ones, with SentenceBERT and few-shot LLM prompting showing promise.

DetailsMotivation: PTSD is underdiagnosed, creating a need for automated detection tools to improve patient identification.

Method: Compared transformer models (BERT/RoBERTa), embeddings (SentenceBERT/LLaMA), and LLM prompting (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset.

Result: Domain-specific models (Mental-RoBERTa) and SentenceBERT embeddings performed best. Few-shot LLM prompting was competitive. Performance varied by symptom severity and comorbidity.

Conclusion: Domain-adapted embeddings and LLMs show potential for scalable PTSD screening, but nuanced detection needs improvement for clinical viability.

Abstract: Post-Traumatic Stress Disorder (PTSD) remains underdiagnosed in clinical settings, presenting opportunities for automated detection to identify patients. This study evaluates natural language processing approaches for detecting PTSD from clinical interview transcripts. We compared general and mental health-specific transformer models (BERT/RoBERTa), embedding-based methods (SentenceBERT/LLaMA), and large language model prompting strategies (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset. Domain-specific end-to-end models significantly outperformed general models (Mental-RoBERTa AUPRC=0.675+/-0.084 vs. RoBERTa-base 0.599+/-0.145). SentenceBERT embeddings with neural networks achieved the highest overall performance (AUPRC=0.758+/-0.128). Few-shot prompting using DSM-5 criteria yielded competitive results with two examples (AUPRC=0.737). Performance varied significantly across symptom severity and comorbidity status with depression, with higher accuracy for severe PTSD cases and patients with comorbid depression. Our findings highlight the potential of domain-adapted embeddings and LLMs for scalable screening while underscoring the need for improved detection of nuanced presentations and offering insights for developing clinically viable AI tools for PTSD assessment.

[129] The Dual-Route Model of Induction

Sheridan Feucht, Eric Todd, Byron Wallace, David Bau

Main category: cs.CL

TL;DR: The paper introduces concept-level induction heads in LLMs, which copy entire lexical units, unlike token-level induction heads. These heads enable semantic tasks like translation, while token heads handle verbatim copying. They operate independently and contain language-independent word representations.

DetailsMotivation: To explore how LLMs handle copying at different levels (token vs. concept) and understand their roles in semantic tasks like translation.

Method: The study identifies and analyzes concept-level induction heads, compares them to token-level heads, and tests their roles through ablation and output patching.

Result: Concept induction heads handle semantic tasks (e.g., translation) and contain language-independent representations, while token heads manage verbatim copying. Ablation shows their independent operation.

Conclusion: LLMs represent abstract word meanings independently of language, with distinct induction heads for semantic and verbatim tasks.

Abstract: Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we discover a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in parallel with token-level induction heads to copy meaningful text. We show that these heads are responsible for semantic tasks like word-level translation, whereas token induction heads are vital for tasks that can only be done verbatim (like copying nonsense tokens). These two “routes” operate independently: we show that ablation of token induction heads causes models to paraphrase where they would otherwise copy verbatim. By patching concept induction head outputs, we find that they contain language-independent word representations that mediate natural language translation, suggesting that LLMs represent abstract word meanings independent of language or form.

[130] APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, Caiming Xiong

Main category: cs.CL

TL;DR: APIGen-MT is a two-phase framework for generating high-quality, verifiable multi-turn agent data, outperforming models like GPT-4o and Claude 3.5 on benchmarks.

DetailsMotivation: High-quality data for multi-turn AI agent interactions is scarce and costly to collect manually.

Method: A two-phase framework: (1) agentic pipeline creates task blueprints with ground-truth actions, (2) transforms blueprints into interaction trajectories via simulated human-agent interplay.

Result: Trained xLAM-2-fc-r models outperform GPT-4o and Claude 3.5 on benchmarks, with smaller models excelling in multi-turn settings.

Conclusion: The verified blueprint-to-details approach produces reliable training data, advancing AI agent research. Data and models are open-sourced.

Abstract: Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models – the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $\tau$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source 5K synthetic data trajectories and the trained xLAM-2-fc-r models to advance research in AI agents. Models at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4; Dataset at https://huggingface.co/datasets/Salesforce/APIGen-MT-5k and Website at https://apigen-mt.github.io

[131] Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal

Main category: cs.CL

TL;DR: The paper introduces EFAGen, a method to automatically generate Executable Functional Abstractions (EFAs) for advanced math problems, improving upon prior work limited to grade-school math.

DetailsMotivation: To automate the creation of EFAs for advanced math, which previously required human effort, enabling broader applications like problem generation and model testing.

Method: EFAGen formalizes EFA properties as unit tests, uses LLMs to sample candidate programs, and refines them via execution feedback and training.

Result: EFAGen produces faithful EFAs, generates learnable problem variations, and works across competition-level math problems.

Conclusion: EFAGen successfully automates EFA generation for advanced math, with applications in problem variant creation and data generation.

Abstract: Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from reinforcement learning (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for mathematical reasoning as problem generators for stress-testing models. However, prior work has been limited to automatically constructing abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced mathematics problems by developing EFAGen, which operationalizes the task of automatically inferring an EFA for a given seed problem and solution as a program synthesis task. We first formalize the properties of any valid EFA as executable unit tests. Using execution feedback from the unit tests, we search over candidate programs sampled from a LLM to find EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. We then apply the tests as a reward signal, training LLMs to become better writers of EFAs. We show that EFAs inferred by EFAGen are faithful to the seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across diverse sources of competition-level math problems. Finally, we show uses of model-written EFAs e.g., finding harder/easier problem variants, as well as data generation.

[132] clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

Main category: cs.CL

TL;DR: The paper introduces clem todd, a framework for evaluating dialogue systems consistently, enabling benchmarking across user simulators and systems.

DetailsMotivation: Existing research evaluates dialogue components in isolation, limiting generalizability.

Method: Proposes clem todd, a flexible framework for systematic evaluation under uniform conditions.

Result: Demonstrates clem todd’s flexibility by re-evaluating existing systems and integrating new ones, providing insights on architecture, scale, and prompting.

Conclusion: Offers practical guidance for building efficient conversational AI systems.

Abstract: The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd’s flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

[133] Towards Harmonized Uncertainty Estimation for Large Language Models

Rui Li, Jing Long, Muge Qi, Heming Xia, Lei Sha, Peiyi Wang, Zhifang Sui

Main category: cs.CL

TL;DR: Proposes CUE, a lightweight model to correct uncertainty scores in LLMs, improving accuracy by up to 60%.

DetailsMotivation: Current uncertainty estimation methods for LLMs lack balance and calibration, limiting their reliability.

Method: Introduces CUE, a lightweight model trained on aligned data to adjust uncertainty scores.

Result: Achieves consistent improvements of up to 60% over existing methods.

Conclusion: CUE effectively enhances uncertainty estimation in LLMs, addressing key limitations of prior approaches.

Abstract: To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM’s performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.

[134] Controlling Language Confusion in Multilingual LLMs

Nahyun Lee, Yeongseo Woo, Hyunwoo Ko, Guijin Son

Main category: cs.CL

TL;DR: ORPO mitigates language confusion in multilingual models by penalizing undesired outputs, improving consistency without harming performance.

DetailsMotivation: Language confusion in large models degrades user experience, especially in low-resource settings, due to limitations in fine-tuning objectives.

Method: ORPO adds penalties for unwanted outputs to standard supervised fine-tuning (SFT), suppressing language-mixed generations.

Result: ORPO maintains strong language consistency even at high decoding temperatures while preserving general QA performance.

Conclusion: Incorporating penalty terms effectively reduces language confusion, particularly in low-resource scenarios.

Abstract: Large language models often suffer from language confusion, a phenomenon in which responses are partially or entirely generated in unintended languages. This critically degrades the user experience, especially in low-resource settings. We hypothesize that this issue stems from limitations in conventional fine-tuning objectives, such as supervised learning, which optimize the likelihood of correct tokens without explicitly penalizing undesired outputs such as cross-lingual mixing. Analysis of loss trajectories during pretraining further reveals that models fail to distinguish between monolingual and language-mixed texts, highlighting the absence of inherent pressure to avoid such confusion. In this work, we apply ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppressing language-confused generations. ORPO maintains strong language consistency, even under high decoding temperatures, while preserving general QA performance. Our findings suggest that incorporating appropriate penalty terms can effectively mitigate language confusion in multilingual models, particularly in low-resource scenarios.

[135] On Entity Identification in Language Models

Masaki Sakata, Benjamin Heinzerling, Sho Yokoi, Takumi Ito, Kentaro Inui

Main category: cs.CL

TL;DR: The paper analyzes how language models (LMs) internally represent named entities, focusing on ambiguity and variability in entity mentions. It proposes a clustering-based framework to evaluate LM performance, showing high precision and recall (0.66-0.9). Entity information is compactly encoded in early layers, influencing word prediction.

DetailsMotivation: To understand how LMs internally identify and distinguish named entity mentions, addressing the challenges of ambiguity and variability.

Method: Proposes a clustering framework to analyze LM representations, evaluating how mentions of the same entity cluster and different entities separate. Tests five Transformer-based autoregressive models.

Result: LMs effectively identify entities (precision/recall: 0.66-0.9). Entity information is compactly represented in early layers and influences word prediction.

Conclusion: LM representations align with real-world entity structures, revealing how LMs organize and use entity information internally.

Abstract: We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions. We first formulate two problems of entity mentions – ambiguity and variability – and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated. Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9. Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers. Additionally, we clarify how the characteristics of entity representations influence word prediction performance. These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.

[136] Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification

Subhendu Khatuya, Shashwat Naidu, Saptarshi Ghosh, Pawan Goyal, Niloy Ganguly

Main category: cs.CL

TL;DR: A generative model framework (LAGAMC) for multi-label text classification uses label descriptions and a dual-objective loss function, achieving state-of-the-art performance.

DetailsMotivation: Manual document classification is challenging due to the explosion of textual data, necessitating an efficient, domain-agnostic solution.

Method: The model generates label descriptions from input text and matches them to predefined labels using a sentence transformer, combined with a dual-objective loss function.

Result: LAGAMC achieves 13.94% and 24.85% improvements in Micro-F1 and Macro-F1, respectively, outperforming baselines.

Conclusion: LAGAMC is efficient, versatile, and highly effective for multi-label text classification.

Abstract: The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the pre-defined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94% in Micro-F1 and 24.85% in Macro-F1 compared to the closest baseline across all datasets.

[137] Draft-based Approximate Inference for LLMs

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

Main category: cs.CL

TL;DR: A novel framework for optimizing LLM inference uses small draft models to predict token and KV pair importance, improving accuracy and efficiency.

DetailsMotivation: Address the inefficiency of existing approximation methods for LLM inference by leveraging draft models for better importance predictions.

Method: Introduces SpecKV for KV cache dropping and SpecPC for prompt compression, both using draft models to guide importance decisions.

Result: Outperforms baselines in accuracy while maintaining improvements in memory, latency, and throughput.

Conclusion: The proposed framework offers a more effective approach to approximate LLM inference, validated by theoretical and empirical results.

Abstract: Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, the first method that leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model’s attention activations to identify and discard unimportant prompt tokens. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.

[138] Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation

Jubin Abhishek Soni, Amit Anand, Rajesh Kumar Pandey, Aniket Abhishek Soni

Main category: cs.CL

TL;DR: DCT extends RAG for dynamic domains with multi-turn dialogue and evolving tools, improving accuracy and reducing hallucinations.

DetailsMotivation: Existing RAG systems are static and unsuitable for dynamic domains like healthcare and smart homes.

Method: DCT uses an attention-based context cache, LoRA-based retrieval, and context compression to adapt without retraining.

Result: DCT improves plan accuracy by 14%, reduces hallucinations by 37%, and matches GPT-4 at lower cost.

Conclusion: DCT enables scalable, adaptable AI assistants for dynamic environments, generalizing to unseen tools.

Abstract: Retrieval-Augmented Generation (RAG) has significantly advanced large language models (LLMs) by grounding their outputs in external tools and knowledge sources. However, existing RAG systems are typically constrained to static, single-turn interactions with fixed toolsets, making them ill-suited for dynamic domains such as healthcare and smart homes, where user intent, available tools, and contextual factors evolve over time. We present Dynamic Context Tuning (DCT), a lightweight framework that extends RAG to support multi-turn dialogue and evolving tool environments without requiring retraining. DCT integrates an attention-based context cache to track relevant past information, LoRA-based retrieval to dynamically select domain-specific tools, and efficient context compression to maintain inputs within LLM context limits. Experiments on both synthetic and real-world benchmarks show that DCT improves plan accuracy by 14% and reduces hallucinations by 37%, while matching GPT-4 performance at significantly lower cost. Furthermore, DCT generalizes to previously unseen tools, enabling scalable and adaptable AI assistants across a wide range of dynamic environments.

[139] Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Omer Luxembourg, Haim Permuter, Eliya Nachmani

Main category: cs.CL

TL;DR: Dilated Unmasking Scheduler (DUS) improves parallel text generation in masked diffusion models by minimizing joint entropy gain, outperforming confidence-based methods.

DetailsMotivation: Existing samplers for masked diffusion models reduce to slow autoregressive behavior by ignoring token interactions during parallel unmasking.

Method: DUS partitions sequence positions into non-adjacent dilated groups and unmaskes them in parallel to minimize joint entropy gain.

Result: DUS outperforms confidence-based planners across math, code, and general-knowledge benchmarks without modifying the denoiser.

Conclusion: DUS reveals the true speed-quality frontier of masked diffusion language models.

Abstract: Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.

[140] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: KaLM-Embedding-V2 is a compact embedding model with innovative training techniques, achieving top performance on MTEB benchmarks despite its small size.

DetailsMotivation: To develop a versatile and compact embedding model that outperforms larger models by leveraging superior training techniques and diverse data.

Method: Uses a bidirectional transformer with mean-pooling, multi-stage training (pre-training, fine-tuning, model-soup), focal-style reweighting, and online hard-negative mixing.

Result: Significantly outperforms comparable models and competes with much larger ones on MTEB benchmarks.

Conclusion: KaLM-Embedding-V2 sets a new standard for compact embedding models, demonstrating efficiency and robustness.

Abstract: In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.

[141] A Mathematical Theory of Discursive Networks

Juan B. Gutiérrez

Main category: cs.CL

TL;DR: The paper explores LLMs as a discursive network where humans and models interact equally, identifying hazards like misinformation and proposing peer review (FOO algorithm) to stabilize truth.

DetailsMotivation: To understand and mitigate the risks of misinformation in human-LLM interactions by treating them as a networked system.

Method: Develops a mathematical model of discursive networks, introduces the FOO algorithm for peer review, and analyzes hazards like drift and fabrication.

Result: Networks with drift and self-repair stabilize at modest error rates; peer review shifts systems toward truth dominance.

Conclusion: Reliability in LLM interactions requires networked accountability, not perfecting individual models.

Abstract: Large language models (LLMs) turn writing into a live exchange between humans and software. We characterize this new medium as a discursive network that treats people and LLMs as equal nodes and tracks how their statements circulate. We define the generation of erroneous information as invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. We develop a general mathematical model of discursive networks that shows that a network governed only by drift and self-repair stabilizes at a modest error rate. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source Flaws-of-Others (FOO) algorithm: a configurable loop in which any set of agents critique one another while a harmonizer merges their verdicts. We identify an ethical transgression, epithesis, that occurs when humans fail to engage in the discursive network. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from connecting imperfect ones into networks that enforce mutual accountability.

[142] KAT-V1: Kwai-AutoThink Technical Report

Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Xuxing Chen, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Xiaojiang Zhang, Jinghui Wang, Zheng Lin, Mengtong Li, Huiming Wang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotian Zhang, Bin Chen, Bing Yu

Main category: cs.CL

TL;DR: Kwaipilot-AutoThink (KAT) is a 40B open-source LLM addressing overthinking in reasoning tasks via dynamic mode-switching, efficient training, and reinforcement learning, outperforming state-of-the-art models.

DetailsMotivation: To solve the overthinking problem in reasoning-intensive tasks by dynamically adjusting reasoning modes based on task complexity.

Method: Uses a dual-regime dataset, MTP-enhanced knowledge distillation, cold-start initialization, and Step-SRPO reinforcement learning for mode selection and accuracy.

Result: KAT matches or outperforms top models like DeepSeek-R1-0528 and Qwen3-235B-A22B, reducing token usage and excelling in real-world deployment.

Conclusion: KAT’s AutoThink paradigm is scalable, with promising early results for a 200B MoE model, demonstrating efficiency and effectiveness in reasoning tasks.

Abstract: We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage. Notably, KAT outperforms all open-source models and even surpasses o3-mini on the leakage-controlled LiveCodeBench Pro. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou’s internal coding assistant), where it improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) model with 40B active parameters, and early results already show significant gains, further demonstrating the scalability of the AutoThink paradigm.

[143] Lizard: An Efficient Linearization Framework for Large Language Models

Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen

Main category: cs.CL

TL;DR: Lizard is a framework transforming Transformer-based LLMs into subquadratic architectures for infinite-context generation, addressing memory and computational bottlenecks with a hybrid attention mechanism.

DetailsMotivation: Transformer-based LLMs face memory and computational inefficiencies with increasing context lengths due to quadratic complexity in attention and KV cache growth.

Method: Lizard introduces a subquadratic attention mechanism with a gating module, combining gated linear attention and sliding window attention with meta memory.

Result: Lizard achieves near-lossless performance recovery of teacher models, outperforms prior linearization methods by 18 points on MMLU, and excels in associative recall tasks.

Conclusion: Lizard offers a flexible, efficient solution for infinite-context generation, balancing long-range dependencies and local interactions while improving training speed.

Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model’s performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.

[144] Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun

Main category: cs.CL

TL;DR: Mixture-of-Recursions (MoR) combines parameter sharing and adaptive computation in a Recursive Transformer, improving efficiency and performance across model scales.

DetailsMotivation: Addressing the high computational and memory costs of scaling language models by unifying parameter sharing and adaptive computation.

Method: Introduces MoR, a framework with shared layers across recursion steps and lightweight routers for adaptive token-level recursion depths, optimizing attention computation and memory usage.

Result: MoR achieves better validation perplexity, few-shot accuracy, and throughput compared to baselines, forming a new Pareto frontier.

Conclusion: MoR effectively delivers large-model quality at reduced cost, making it a promising approach for efficient language model scaling.

Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

[145] Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

Andrei Niculae, Adrian Cosma, Cosmin Dumitrache, Emilian Rǎdoi

Main category: cs.CL

TL;DR: Dr. Copilot is a multi-agent LLM system for Romanian-speaking doctors, improving the presentation quality of telemedicine responses along 17 axes, not clinical accuracy. It uses three LLM agents with optimized prompts, shows measurable improvements in user reviews, and is among the first LLM deployments in Romanian medical settings.

DetailsMotivation: The quality of medical advice in text-based telemedicine is often judged on communication rather than clinical accuracy. This gap motivates the development of Dr. Copilot to enhance presentation quality.

Method: The system uses three LLM agents with prompts optimized via DSPy, designed for low-resource Romanian data and deployed with open-weight models to provide real-time feedback.

Result: Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality.

Conclusion: Dr. Copilot successfully enhances the presentation quality of telemedicine responses, marking a pioneering LLM deployment in Romanian healthcare.

Abstract: Text-based telemedicine has become increasingly common, yet the quality of medical advice in doctor-patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce Dr. Copilot , a multi-agent large language model (LLM) system that supports Romanian-speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, Dr. Copilot provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low-resource Romanian data and deployed using open-weight models, it delivers real-time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real-world deployments of LLMs in Romanian medical settings.

[146] A Survey of Context Engineering for Large Language Models

Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu

Main category: cs.CL

TL;DR: The paper introduces Context Engineering for optimizing LLM performance, detailing foundational components and system implementations, while highlighting a research gap in model output sophistication.

DetailsMotivation: To formalize and optimize contextual information for LLMs, addressing the asymmetry between understanding and generating complex outputs.

Method: A systematic analysis of over 1400 papers, decomposing Context Engineering into foundational components (retrieval, generation, processing, management) and system implementations (RAG, memory systems, multi-agent systems).

Result: Establishes a technical roadmap and identifies a critical gap: LLMs excel in context understanding but lag in generating sophisticated long-form outputs.

Conclusion: The survey provides a unified framework for advancing context-aware AI, prioritizing future research to address the output generation gap.

Abstract: The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1400 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

cs.CV

[147] Comparative Analysis of Algorithms for the Fitting of Tessellations to 3D Image Data

Andreas Alpers, Orkun Furat, Christian Jung, Matthias Neumann, Claudia Redenbach, Aigerim Saken, Volker Schmidt

Main category: cs.CV

TL;DR: Comparative analysis of algorithmic strategies for fitting tessellation models to 3D image data, evaluating trade-offs between model complexity, optimization methods, and approximation quality.

DetailsMotivation: To assess and guide the selection of optimization-based methods for approximating voxel-based grain structures in materials like polycrystals and foams.

Method: Review and evaluation of linear/nonlinear programming, stochastic optimization (cross-entropy method), and gradient descent for generating Voronoi, Laguerre, and GBPDs.

Result: Trade-offs identified between model complexity, optimization complexity, and approximation quality, with guidance for method selection based on data and application needs.

Conclusion: Provides insights for choosing appropriate tessellation model fitting methods, balancing complexity and accuracy for real-world datasets.

Abstract: This paper presents a comparative analysis of algorithmic strategies for fitting tessellation models to 3D image data of materials such as polycrystals and foams. In this steadily advancing field, we review and assess optimization-based methods – including linear and nonlinear programming, stochastic optimization via the cross-entropy method, and gradient descent – for generating Voronoi, Laguerre, and generalized balanced power diagrams (GBPDs) that approximate voxelbased grain structures. The quality of fit is evaluated on real-world datasets using discrepancy measures that quantify differences in grain volume, surface area, and topology. Our results highlight trade-offs between model complexity, the complexity of the optimization routines involved, and the quality of approximation, providing guidance for selecting appropriate methods based on data characteristics and application needs.

[148] Semantic Segmentation based Scene Understanding in Autonomous Vehicles

Ehsan Rassekh

Main category: cs.CV

TL;DR: The paper explores efficient models for scene understanding via semantic segmentation in self-driving cars, emphasizing the impact of backbone choice on performance.

DetailsMotivation: To enhance scene understanding in autonomous vehicles by improving semantic segmentation models, leveraging AI and deep learning.

Method: Proposes several models using the BDD100k dataset and evaluates different backbones as encoders.

Result: Results show backbone selection significantly affects model performance, improving accuracy, mean IoU, and loss metrics.

Conclusion: Choosing the right backbone is crucial for effective semantic segmentation, enhancing scene understanding in autonomous systems.

Abstract: In recent years, the concept of artificial intelligence (AI) has become a prominent keyword because it is promising in solving complex tasks. The need for human expertise in specific areas may no longer be needed because machines have achieved successful results using artificial intelligence and can make the right decisions in critical situations. This process is possible with the help of deep learning (DL), one of the most popular artificial intelligence technologies. One of the areas in which the use of DL is used is in the development of self-driving cars, which is very effective and important. In this work, we propose several efficient models to investigate scene understanding through semantic segmentation. We use the BDD100k dataset to investigate these models. Another contribution of this work is the usage of several Backbones as encoders for models. The obtained results show that choosing the appropriate backbone has a great effect on the performance of the model for semantic segmentation. Better performance in semantic segmentation allows us to understand better the scene and the environment around the agent. In the end, we analyze and evaluate the proposed models in terms of accuracy, mean IoU, and loss function, and the results show that these metrics are improved.

[149] CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

Marc Lafon, Gustavo Adolfo Vargas Hakim, Clément Rambour, Christian Desrosier, Nicolas Thome

Main category: cs.CV

TL;DR: CLIPTTA is a gradient-based test-time adaptation method for vision-language models, using a soft contrastive loss to align with CLIP’s training. It outperforms entropy-based methods and shows stable performance across diverse shifts.

DetailsMotivation: Vision-language models (VLMs) like CLIP struggle with distribution shifts. Existing test-time adaptation (TTA) methods, like entropy minimization, misalign with CLIP's contrastive training, leading to poor adaptation.

Method: CLIPTTA introduces a soft contrastive loss aligned with CLIP’s pre-training. It includes a theoretical analysis of gradients and extends to open-set scenarios with an Outlier Contrastive Exposure (OCE) loss for OOD detection.

Result: Evaluated on 75 datasets, CLIPTTA outperforms entropy-based methods and competes with state-of-the-art TTA methods, showing stable performance across diverse shifts.

Conclusion: CLIPTTA effectively adapts VLMs at test time, addressing limitations of entropy-based methods and improving generalization under distribution shifts.

Abstract: Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP’s pre-training objective. We provide a theoretical analysis of CLIPTTA’s gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered, using an Outlier Contrastive Exposure (OCE) loss to improve OOD detection. Evaluated on 75 datasets spanning diverse distribution shifts, CLIPTTA consistently outperforms entropy-based objectives and is highly competitive with state-of-the-art TTA methods, outperforming them on a large number of datasets and exhibiting more stable performance across diverse shifts.

[150] A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention

Qiyu Xu, Zhanxuan Hu, Yu Duan, Ercheng Pei, Yonghang Tai

Main category: cs.CV

TL;DR: The paper introduces Attention Focusing (AF), a mechanism to improve Generalized Category Discovery (GCD) by reducing distracted attention in models, leading to better feature extraction and performance.

DetailsMotivation: Existing GCD methods often suffer from distracted attention, where models focus on irrelevant background regions, hindering optimal feature extraction.

Method: AF consists of Token Importance Measurement (TIME) and Token Adaptive Pruning (TAP) to quantify and prune non-informative tokens, sharpening the model’s focus.

Result: AF improves performance by up to 15.4% when integrated into SimGCD, a prominent GCD method, with minimal computational overhead.

Conclusion: AF is a lightweight, plug-and-play solution that effectively addresses distracted attention in GCD, enhancing model performance.

Abstract: Generalized Category Discovery (GCD) aims to classify unlabeled data from both known and unknown categories by leveraging knowledge from labeled known categories. While existing methods have made notable progress, they often overlook a hidden stumbling block in GCD: distracted attention. Specifically, when processing unlabeled data, models tend to focus not only on key objects in the image but also on task-irrelevant background regions, leading to suboptimal feature extraction. To remove this stumbling block, we propose Attention Focusing (AF), an adaptive mechanism designed to sharpen the model’s focus by pruning non-informative tokens. AF consists of two simple yet effective components: Token Importance Measurement (TIME) and Token Adaptive Pruning (TAP), working in a cascade. TIME quantifies token importance across multiple scales, while TAP prunes non-informative tokens by utilizing the multi-scale importance scores provided by TIME. AF is a lightweight, plug-and-play module that integrates seamlessly into existing GCD methods with minimal computational overhead. When incorporated into one prominent GCD method, SimGCD, AF achieves up to 15.4% performance improvement over the baseline with minimal computational overhead. The implementation code is provided in https://github.com/Afleve/AFGCD.

[151] Adaptive 3D Gaussian Splatting Video Streaming

Han Gong, Qiyue Li, Zhi Liu, Hao Zhou, Peng Yuan Zhou, Zhu Li, Jie Li

Main category: cs.CV

TL;DR: The paper introduces a framework for streaming 3D Gaussian splatting (3DGS) volumetric videos, addressing challenges like large data volume and compression complexity. It uses Gaussian deformation fields, hybrid saliency tiling, and differentiated quality modeling for efficient compression and transmission.

DetailsMotivation: 3DGS videos offer high-quality volumetric representation but pose streaming challenges due to large data size and compression complexity. The paper aims to solve these issues for efficient transmission.

Method: The framework includes a 3DGS video construction method using Gaussian deformation fields, hybrid saliency tiling, and differentiated quality modeling to compress data and adapt to bandwidth changes.

Result: Experimental results show the method outperforms existing approaches in video quality, compression effectiveness, and transmission rate.

Conclusion: The proposed framework successfully addresses 3DGS video streaming challenges, offering superior performance in quality, compression, and transmission.

Abstract: The advent of 3D Gaussian splatting (3DGS) has significantly enhanced the quality of volumetric video representation. Meanwhile, in contrast to conventional volumetric video, 3DGS video poses significant challenges for streaming due to its substantially larger data volume and the heightened complexity involved in compression and transmission. To address these issues, we introduce an innovative framework for 3DGS volumetric video streaming. Specifically, we design a 3DGS video construction method based on the Gaussian deformation field. By employing hybrid saliency tiling and differentiated quality modeling of 3DGS video, we achieve efficient data compression and adaptation to bandwidth fluctuations while ensuring high transmission quality. Then we build a complete 3DGS video streaming system and validate the transmission performance. Through experimental evaluation, our method demonstrated superiority over existing approaches in various aspects, including video quality, compression effectiveness, and transmission rate.

[152] Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution

Weiming Ren, Raghav Goyal, Zhiming Hu, Tristan Ty Aumentado-Armstrong, Iqbal Mohomed, Alex Levinshtein

Main category: cs.CV

TL;DR: The paper addresses hallucinations in generative super-resolution (GSR) models, proposing a Hallucination Score (HS) using a multimodal large language model (MLLM) to measure and mitigate these artifacts.

DetailsMotivation: GSR models produce perceptual artifacts that mismatch low-resolution or ground-truth images, limiting practical use. Existing metrics fail to capture these hallucinations.

Method: A prompt-based MLLM assesses hallucinations, generating an HS. Deep feature distances correlated with HS are used as differentiable rewards to align GSR models.

Result: HS aligns with human evaluations and complements existing SR metrics. Deep feature distances show strong correlation with HS.

Conclusion: The proposed HS and feature-based alignment effectively measure and mitigate hallucinations in GSR, improving perceptual quality and fidelity.

Abstract: Generative super-resolution (GSR) currently sets the state-of-the-art in terms of perceptual image quality, overcoming the “regression-to-the-mean” blur of prior non-generative models. However, from a human perspective, such models do not fully conform to the optimal balance between quality and fidelity. Instead, a different class of artifacts, in which generated details fail to perceptually match the low resolution image (LRI) or ground-truth image (GTI), is a critical but under studied issue in GSR, limiting its practical deployments. In this work, we focus on measuring, analyzing, and mitigating these artifacts (i.e., “hallucinations”). We observe that hallucinations are not well-characterized with existing image metrics or quality models, as they are orthogonal to both exact fidelity and no-reference quality. Instead, we take advantage of a multimodal large language model (MLLM) by constructing a prompt that assesses hallucinatory visual elements and generates a “Hallucination Score” (HS). We find that our HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models. In addition, we find certain deep feature distances have strong correlations with HS. We therefore propose to align the GSR models by using such features as differentiable reward functions to mitigate hallucinations.

[153] Adaptive 3D Gaussian Splatting Video Streaming: Visual Saliency-Aware Tiling and Meta-Learning-Based Bitrate Adaptation

Han Gong, Qiyue Li, Jie Li, Zhi Liu

Main category: cs.CV

TL;DR: The paper addresses challenges in 3D Gaussian splatting video streaming by proposing adaptive tiling, quality assessment, and bitrate adaptation solutions, outperforming existing methods.

DetailsMotivation: The need to improve 3DGS video streaming due to unresolved challenges like tiling, quality assessment, and bitrate adaptation.

Method: Proposes adaptive 3DGS tiling with saliency analysis, a quality assessment framework, and a meta-learning-based bitrate algorithm.

Result: The solutions significantly outperform state-of-the-art methods in experiments.

Conclusion: The proposed approaches effectively tackle key challenges in 3DGS video streaming, enhancing performance and quality.

Abstract: 3D Gaussian splatting video (3DGS) streaming has recently emerged as a research hotspot in both academia and industry, owing to its impressive ability to deliver immersive 3D video experiences. However, research in this area is still in its early stages, and several fundamental challenges, such as tiling, quality assessment, and bitrate adaptation, require further investigation. In this paper, we tackle these challenges by proposing a comprehensive set of solutions. Specifically, we propose an adaptive 3DGS tiling technique guided by saliency analysis, which integrates both spatial and temporal features. Each tile is encoded into versions possessing dedicated deformation fields and multiple quality levels for adaptive selection. We also introduce a novel quality assessment framework for 3DGS video that jointly evaluates spatial-domain degradation in 3DGS representations during streaming and the quality of the resulting 2D rendered images. Additionally, we develop a meta-learning-based adaptive bitrate algorithm specifically tailored for 3DGS video streaming, achieving optimal performance across varying network conditions. Extensive experiments demonstrate that our proposed approaches significantly outperform state-of-the-art methods.

[154] DUSTrack: Semi-automated point tracking in ultrasound videos

Praneeth Namburi, Roger Pallarès-López, Jessica Rosendorf, Duarte Folgado, Brian W. Anthony

Main category: cs.CV

TL;DR: DUSTrack combines deep learning and optical flow for robust point tracking in B-mode ultrasound videos, outperforming zero-shot trackers and matching specialized methods.

DetailsMotivation: Accurate tissue motion tracking in B-mode ultrasound is challenging due to speckle noise, low edge contrast, and out-of-plane movement, hindering clinical and research applications.

Method: DUSTrack integrates deep learning with optical flow, includes a GUI for training data generation, and uses optical-flow-based filtering to reduce noise while preserving motion.

Result: DUSTrack achieves superior accuracy compared to zero-shot trackers and matches specialized methods, demonstrated in cardiac, muscle, and fascicle tracking.

Conclusion: DUSTrack is a versatile, open-source tool for tissue motion quantification, promising broad clinical and biomechanical research applications.

Abstract: Ultrasound technology enables safe, non-invasive imaging of dynamic tissue behavior, making it a valuable tool in medicine, biomechanics, and sports science. However, accurately tracking tissue motion in B-mode ultrasound remains challenging due to speckle noise, low edge contrast, and out-of-plane movement. These challenges complicate the task of tracking anatomical landmarks over time, which is essential for quantifying tissue dynamics in many clinical and research applications. This manuscript introduces DUSTrack (Deep learning and optical flow-based toolkit for UltraSound Tracking), a semi-automated framework for tracking arbitrary points in B-mode ultrasound videos. We combine deep learning with optical flow to deliver high-quality and robust tracking across diverse anatomical structures and motion patterns. The toolkit includes a graphical user interface that streamlines the generation of high-quality training data and supports iterative model refinement. It also implements a novel optical-flow-based filtering technique that reduces high-frequency frame-to-frame noise while preserving rapid tissue motion. DUSTrack demonstrates superior accuracy compared to contemporary zero-shot point trackers and performs on par with specialized methods, establishing its potential as a general and foundational tool for clinical and biomechanical research. We demonstrate DUSTrack’s versatility through three use cases: cardiac wall motion tracking in echocardiograms, muscle deformation analysis during reaching tasks, and fascicle tracking during ankle plantarflexion. As an open-source solution, DUSTrack offers a powerful, flexible framework for point tracking to quantify tissue motion from ultrasound videos. DUSTrack is available at https://github.com/praneethnamburi/DUSTrack.

[155] Light Future: Multimodal Action Frame Prediction via InstructPix2Pix

Zesen Zhong, Duomin Zhang, Yijia Li

Main category: cs.CV

TL;DR: A novel, efficient method for robot action prediction using a lightweight deep learning framework, adapting InstructPix2Pix for future visual frame forecasting with single-image and text inputs.

DetailsMotivation: To enable safer and more intelligent decision-making in robotics and autonomous systems by reducing computational cost and inference latency compared to traditional video prediction models.

Method: Repurposes and fine-tunes the InstructPix2Pix model for multimodal future frame prediction, using a single image and textual instruction as input.

Result: Achieves superior SSIM and PSNR on the RoboTWin dataset, outperforming state-of-the-art baselines with faster inference and lower GPU demands.

Conclusion: The lightweight design offers flexible multimodal control and is particularly valuable for applications prioritizing motion trajectory precision over visual fidelity.

Abstract: Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.

[156] CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding

Zhou Chen, Joe Lin, Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: CRAFT is a neuro-symbolic framework for interpretable affordance grounding, combining commonsense priors and visual evidence for transparent, goal-driven decisions.

DetailsMotivation: To improve interpretability and accuracy in identifying objects enabling specific actions in scenes.

Method: Integrates structured commonsense priors (ConceptNet, language models) with visual evidence (CLIP), using an energy-based reasoning loop.

Result: Enhances accuracy and interpretability in multi-object, label-free settings.

Conclusion: CRAFT advances robust and trustworthy scene understanding by grounding symbolic and perceptual structures.

Abstract: We introduce CRAFT, a neuro-symbolic framework for interpretable affordance grounding, which identifies the objects in a scene that enable a given action (e.g., “cut”). CRAFT integrates structured commonsense priors from ConceptNet and language models with visual evidence from CLIP, using an energy-based reasoning loop to refine predictions iteratively. This process yields transparent, goal-driven decisions to ground symbolic and perceptual structures. Experiments in multi-object, label-free settings demonstrate that CRAFT enhances accuracy while improving interpretability, providing a step toward robust and trustworthy scene understanding.

[157] IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark

Zhe Cao, Jin Zhang, Ruiheng Zhang

Main category: cs.CV

TL;DR: IRGPT is a multi-modal large language model for real-world infrared images, leveraging a novel dataset (IR-TD) and a bi-cross-modal curriculum transfer learning strategy to outperform existing methods.

DetailsMotivation: Existing methods rely on synthetic infrared images, which fail to capture real-world infrared characteristics. IRGPT addresses this by using authentic infrared-text pairs.

Method: IRGPT uses a large-scale IR-TD dataset (260K real image-text pairs) and a bi-cross-modal curriculum transfer learning strategy to transfer knowledge from visible to infrared domains.

Result: IRGPT achieves state-of-the-art performance on 9 benchmark tasks, surpassing larger-scale models.

Conclusion: IRGPT demonstrates the effectiveness of authentic data and systematic transfer learning for advancing infrared vision-language models.

Abstract: Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infrared-text. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.

[158] GPI-Net: Gestalt-Guided Parallel Interaction Network via Orthogonal Geometric Consistency for Robust Point Cloud Registration

Weikang Gu, Mingyue Han, Li Xue, Heng Dong, Changcai Yang, Riqing Chen, Lifang Wei

Main category: cs.CV

TL;DR: The paper proposes GPI-Net, a Gestalt-guided network for point cloud registration, using orthogonal integration and attention mechanisms to improve correspondence quality.

DetailsMotivation: Handling local and global feature fusion in point cloud registration is challenging due to redundancy and spatial complexity. Gestalt principles offer advantages for such analysis.

Method: GPI-Net uses orthogonal integration to reduce redundancy, a Gestalt Feature Attention block for geometric features, and a Dual-path Multi-Granularity block for local-global integration.

Result: Experiments show GPI-Net outperforms existing methods in point cloud registration tasks.

Conclusion: GPI-Net effectively combines Gestalt principles with advanced attention mechanisms for superior performance in feature-based point cloud registration.

Abstract: The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at https://github.com/gwk/GPI-Net.

[159] PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing

Wenjing Huang, Shikui Tu, Lei Xu

Main category: cs.CV

TL;DR: PFB-Diff is a Progressive Feature Blending method for diffusion-based image editing, addressing artifacts in existing approaches by using multi-level feature blending and an attention masking mechanism for better semantic coherence and quality.

DetailsMotivation: Existing diffusion-based local image editing methods produce artifacts due to latent-level blending, lacking semantics for image consistency.

Method: PFB-Diff integrates text-guided content via multi-level feature blending and uses attention masking to confine word impacts to desired regions.

Result: PFB-Diff achieves superior editing accuracy and image quality for tasks like object/background replacement and attribute editing, without fine-tuning.

Conclusion: PFB-Diff offers a robust solution for high-quality, semantically coherent image editing using diffusion models.

Abstract: Diffusion models have demonstrated their ability to generate diverse and high-quality images, sparking considerable interest in their potential for real image editing applications. However, existing diffusion-based approaches for local image editing often suffer from undesired artifacts due to the latent-level blending of the noised target images and diffusion latent variables, which lack the necessary semantics for maintaining image consistency. To address these issues, we propose PFB-Diff, a Progressive Feature Blending method for Diffusion-based image editing. Unlike previous methods, PFB-Diff seamlessly integrates text-guided generated content into the target image through multi-level feature blending. The rich semantics encoded in deep features and the progressive blending scheme from high to low levels ensure semantic coherence and high quality in edited images. Additionally, we introduce an attention masking mechanism in the cross-attention layers to confine the impact of specific words to desired regions, further improving the performance of background editing and multi-object replacement. PFB-Diff can effectively address various editing tasks, including object/background replacement and object attribute editing. Our method demonstrates its superior performance in terms of editing accuracy and image quality without the need for fine-tuning or training. Our implementation is available at https://github.com/CMACH508/PFB-Diff.

[160] GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving

Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Yanjun Huang

Main category: cs.CV

TL;DR: GEMINUS is a Mixture-of-Experts framework for end-to-end autonomous driving, combining a Global Expert and Scene-Adaptive Experts with a Dual-aware Router for adaptive and robust performance in diverse scenarios.

DetailsMotivation: Single-mode planning methods struggle with diverse driving skills, necessitating a more adaptive approach.

Method: Uses a Global Expert for robustness, Scene-Adaptive Experts for adaptability, and a Dual-aware Router to dynamically activate experts.

Result: Outperforms existing methods in Bench2Drive, achieving top Driving Score and Success Rate with monocular vision.

Conclusion: GEMINUS effectively combines adaptability and robustness, significantly improving over single-expert baselines.

Abstract: End-to-end autonomous driving requires adaptive and robust handling of complex and diverse traffic environments. However, prevalent single-mode planning methods attempt to learn an overall policy while struggling to acquire diversified driving skills to handle diverse scenarios. Therefore, this paper proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework featuring a Global Expert, a Scene-Adaptive Experts Group, and equipped with a Dual-aware Router. Specifically, the Global Expert is trained on the overall dataset, possessing robust performance. The Scene-Adaptive Experts are trained on corresponding scene subsets, achieving adaptive performance. The Dual-aware Router simultaneously considers scenario-level features and routing uncertainty to dynamically activate expert modules. Through the effective coupling of the Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router, GEMINUS achieves adaptive and robust performance in diverse scenarios. GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark and achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input. Furthermore, ablation studies demonstrate significant improvements over the original single-expert baseline: 7.67% in Driving Score, 22.06% in Success Rate, and 19.41% in MultiAbility-Mean. The code will be available at https://github.com/newbrains1/GEMINUS.

[161] VisGuard: Securing Visualization Dissemination through Tamper-Resistant Data Retrieval

Huayuan Ye, Juntong Chen, Shenzhuo Zhang, Yipeng Zhang, Changbo Wang, Chenhui Li

Main category: cs.CV

TL;DR: VisGuard is a tamper-resistant framework for embedding metadata links in visualization images, ensuring recoverability even after tampering, with applications like interactive chart reconstruction and copyright protection.

DetailsMotivation: Current methods for embedding metadata in visualization images are fragile to tampering, leading to loss of critical information.

Method: VisGuard uses repetitive data tiling, invertible information broadcasting, and an anchor-based scheme for crop localization to enhance robustness.

Result: VisGuard shows superior performance in data retrieval accuracy, embedding capacity, and security against tampering and steganalysis.

Conclusion: VisGuard effectively safeguards visualization dissemination and information conveyance by reliably embedding metadata links.

Abstract: The dissemination of visualizations is primarily in the form of raster images, which often results in the loss of critical information such as source code, interactive features, and metadata. While previous methods have proposed embedding metadata into images to facilitate Visualization Image Data Retrieval (VIDR), most existing methods lack practicability since they are fragile to common image tampering during online distribution such as cropping and editing. To address this issue, we propose VisGuard, a tamper-resistant VIDR framework that reliably embeds metadata link into visualization images. The embedded data link remains recoverable even after substantial tampering upon images. We propose several techniques to enhance robustness, including repetitive data tiling, invertible information broadcasting, and an anchor-based scheme for crop localization. VisGuard enables various applications, including interactive chart reconstruction, tampering detection, and copyright protection. We conduct comprehensive experiments on VisGuard’s superior performance in data retrieval accuracy, embedding capacity, and security against tampering and steganalysis, demonstrating VisGuard’s competence in facilitating and safeguarding visualization dissemination and information conveyance.

[162] Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro

Main category: cs.CV

TL;DR: Zero-AVSR enables speech recognition in new languages without audio-visual data by using AV-Romanizer and LLMs, supported by a multilingual corpus (MARC).

DetailsMotivation: To overcome the lack of audio-visual speech data for many languages, enabling zero-shot recognition.

Method: Introduces AV-Romanizer for language-agnostic representations and cascaded/unified approaches with LLMs, using MARC corpus.

Result: Zero-AVSR can recognize speech in languages not seen during training, expanding language support.

Conclusion: The framework effectively extends AVSR to new languages without requiring language-specific data.

Abstract: We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.

[163] OptiCorNet: Optimizing Sequence-Based Context Correlation for Visual Place Recognition

Zhenyu Li, Tianyi Shang, Pengjie Xu, Ruirui Zhang, Fanchen Kong

Main category: cs.CV

TL;DR: OptiCorNet introduces a sequence modeling framework for VPR, combining spatial and temporal features via a differentiable module (DSD), outperforming existing methods in dynamic environments.

DetailsMotivation: Addressing the challenge of VPR in dynamic and perceptually aliased environments by leveraging temporal coherence in image sequences, which existing single-frame methods neglect.

Method: Uses a lightweight 1D convolutional encoder and a learnable DSD module for temporal differencing, refined with LSTM and quadruplet loss for discriminative descriptors.

Result: Outperforms state-of-the-art baselines in challenging conditions like seasonal and viewpoint variations.

Conclusion: OptiCorNet effectively integrates temporal and spatial features for robust VPR, demonstrating superior performance in dynamic environments.

Abstract: Visual Place Recognition (VPR) in dynamic and perceptually aliased environments remains a fundamental challenge for long-term localization. Existing deep learning-based solutions predominantly focus on single-frame embeddings, neglecting the temporal coherence present in image sequences. This paper presents OptiCorNet, a novel sequence modeling framework that unifies spatial feature extraction and temporal differencing into a differentiable, end-to-end trainable module. Central to our approach is a lightweight 1D convolutional encoder combined with a learnable differential temporal operator, termed Differentiable Sequence Delta (DSD), which jointly captures short-term spatial context and long-range temporal transitions. The DSD module models directional differences across sequences via a fixed-weight differencing kernel, followed by an LSTM-based refinement and optional residual projection, yielding compact, discriminative descriptors robust to viewpoint and appearance shifts. To further enhance inter-class separability, we incorporate a quadruplet loss that optimizes both positive alignment and multi-negative divergence within each batch. Unlike prior VPR methods that treat temporal aggregation as post-processing, OptiCorNet learns sequence-level embeddings directly, enabling more effective end-to-end place recognition. Comprehensive evaluations on multiple public benchmarks demonstrate that our approach outperforms state-of-the-art baselines under challenging seasonal and viewpoint variations.

[164] DFQ-ViT: Data-Free Quantization for Vision Transformers without Fine-tuning

Yujia Tong, Jingling Yuan, Tian Zhang, Jianquan Liu, Chuang Hu

Main category: cs.CV

TL;DR: DFQ-ViT improves data-free quantization for Vision Transformers by enhancing synthetic data quality and aligning activations, outperforming existing methods and matching real-data quantization performance.

DetailsMotivation: Existing DFQ methods for ViTs struggle with balancing global/local features and activation distribution mismatches, leading to performance degradation.

Method: Proposes DFQ-ViT: synthesizes samples by difficulty and uses an activation correction matrix to align quantized and full-precision model activations.

Result: DFQ-ViT outperforms state-of-the-art DFQ methods, e.g., 4.29% higher accuracy for 3-bit DeiT-T, and matches real-data quantization without fine-tuning.

Conclusion: DFQ-ViT reduces computational overhead and deployment barriers, aligning with Green Learning principles for energy-efficient edge device applications.

Abstract: Data-Free Quantization (DFQ) enables the quantization of Vision Transformers (ViTs) without requiring access to data, allowing for the deployment of ViTs on devices with limited resources. In DFQ, the quantization model must be calibrated using synthetic samples, making the quality of these synthetic samples crucial. Existing methods fail to fully capture and balance the global and local features within the samples, resulting in limited synthetic data quality. Moreover, we have found that during inference, there is a significant difference in the distributions of intermediate layer activations between the quantized and full-precision models. These issues lead to a severe performance degradation of the quantized model. To address these problems, we propose a pipeline for Data-Free Quantization for Vision Transformers (DFQ-ViT). Specifically, we synthesize samples in order of increasing difficulty, effectively enhancing the quality of synthetic data. During the calibration and inference stage, we introduce the activation correction matrix for the quantized model to align the intermediate layer activations with those of the full-precision model. Extensive experiments demonstrate that DFQ-ViT achieves remarkable superiority over existing DFQ methods and its performance is on par with models quantized through real data. For example, the performance of DeiT-T with 3-bit weights quantization is 4.29% higher than the state-of-the-art. Our method eliminates the need for fine-tuning, which not only reduces computational overhead but also lowers the deployment barriers for edge devices. This characteristic aligns with the principles of Green Learning by improving energy efficiency and facilitating real-world applications in resource-constrained environments.

[165] Benefit from Reference: Retrieval-Augmented Cross-modal Point Cloud Completion

Hongye Hou, Liu Zhan, Yang Yang

Main category: cs.CV

TL;DR: A novel retrieval-augmented point cloud completion framework is proposed, leveraging cross-modal retrieval to enhance structural feature learning and generation capabilities.

DetailsMotivation: Existing methods for 3D point cloud completion are limited by their focus on specific input classes, lacking generalization. Cross-modal learning is explored to improve feature learning but remains constrained.

Method: The framework includes a Structural Shared Feature Encoder (SSFE) for cross-modal feature extraction and a Progressive Retrieval-Augmented Generator (PRAG) for hierarchical feature fusion. A dual-channel control gate in SSFE enhances relevant features and suppresses noise.

Result: The method demonstrates effectiveness in generating fine-grained point clouds and shows strong generalization with sparse data and unseen categories.

Conclusion: The proposed framework advances point cloud completion by integrating cross-modal retrieval and hierarchical feature fusion, improving both detail and generalization.

Abstract: Completing the whole 3D structure based on an incomplete point cloud is a challenging task, particularly when the residual point cloud lacks typical structural characteristics. Recent methods based on cross-modal learning attempt to introduce instance images to aid the structure feature learning. However, they still focus on each particular input class, limiting their generation abilities. In this work, we propose a novel retrieval-augmented point cloud completion framework. The core idea is to incorporate cross-modal retrieval into completion task to learn structural prior information from similar reference samples. Specifically, we design a Structural Shared Feature Encoder (SSFE) to jointly extract cross-modal features and reconstruct reference features as priors. Benefiting from a dual-channel control gate in the encoder, relevant structural features in the reference sample are enhanced and irrelevant information interference is suppressed. In addition, we propose a Progressive Retrieval-Augmented Generator (PRAG) that employs a hierarchical feature fusion mechanism to integrate reference prior information with input features from global to local. Through extensive evaluations on multiple datasets and real-world scenes, our method shows its effectiveness in generating fine-grained point clouds, as well as its generalization capability in handling sparse data and unseen categories.

[166] Efficient Whole Slide Pathology VQA via Token Compression

Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen

Main category: cs.CV

TL;DR: TCP-LLaVA introduces token compression for WSI VQA, reducing computational costs while improving accuracy.

DetailsMotivation: Existing methods for WSI analysis lack generative capabilities for VQA or are resource-intensive.

Method: Uses trainable compression tokens to aggregate visual/textual info, inspired by BERT’s [CLS] token.

Result: Outperforms baselines in VQA accuracy and reduces training resource consumption.

Conclusion: TCP-LLaVA is an efficient and effective solution for WSI VQA.

Abstract: Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.

[167] Motion Segmentation and Egomotion Estimation from Event-Based Normal Flow

Zhiyuan Hua, Dehao Yuan, Cornelia Fermüller

Main category: cs.CV

TL;DR: A framework for motion segmentation and egomotion estimation using event-based normal flow for neuromorphic vision sensors, avoiding full optical flow computation.

DetailsMotivation: Traditional methods rely on optical flow or depth estimation, which can be limiting. The goal is to leverage sparse, high-temporal-resolution event data for more efficient and accurate motion analysis.

Method: An optimization-based pipeline involving event over-segmentation, residual analysis for moving objects, and hierarchical clustering with motion similarity and temporal consistency.

Result: Accurate segmentation and translational motion estimation validated on the EVIMO2v2 dataset, with advantages at object boundaries.

Conclusion: The method is promising for scalable, real-time robotic and navigation applications due to its efficiency and accuracy.

Abstract: This paper introduces a robust framework for motion segmentation and egomotion estimation using event-based normal flow, tailored specifically for neuromorphic vision sensors. In contrast to traditional methods that rely heavily on optical flow or explicit depth estimation, our approach exploits the sparse, high-temporal-resolution event data and incorporates geometric constraints between normal flow, scene structure, and inertial measurements. The proposed optimization-based pipeline iteratively performs event over-segmentation, isolates independently moving objects via residual analysis, and refines segmentations using hierarchical clustering informed by motion similarity and temporal consistency. Experimental results on the EVIMO2v2 dataset validate that our method achieves accurate segmentation and translational motion estimation without requiring full optical flow computation. This approach demonstrates significant advantages at object boundaries and offers considerable potential for scalable, real-time robotic and navigation applications.

[168] Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey

Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao-Xiao Long, Hanxue Liang, Zexiang Xu, Hao Su, Christian Theobalt, Christian Rupprecht, Andrea Vedaldi, Hanspeter Pfister, Shijian Lu, Fangneng Zhan

Main category: cs.CV

TL;DR: A survey on feed-forward deep learning techniques for 3D reconstruction and view synthesis, covering representations like point clouds, 3DGS, and NeRF, along with applications and future challenges.

DetailsMotivation: Traditional methods for 3D reconstruction and view synthesis are computationally intensive, limiting real-world use. Feed-forward deep learning approaches offer faster, more generalizable solutions.

Method: The paper reviews feed-forward techniques, categorizing them by representation architectures (e.g., point clouds, 3DGS, NeRF) and tasks like pose-free reconstruction and dynamic 3D reconstruction.

Result: The survey highlights advancements in feed-forward methods, their applications in digital humans, SLAM, and robotics, and provides datasets and evaluation protocols.

Conclusion: Feed-forward approaches show promise for advancing 3D vision, but open challenges remain, requiring future research to address them.

Abstract: 3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.

[169] DCHM: Depth-Consistent Human Modeling for Multiview Detection

Jiahao Ma, Tianyu Wang, Miaomiao Liu, David Ahmedt-Aristizabal, Chuong Nguyen

Main category: cs.CV

TL;DR: DCHM improves multiview pedestrian detection by ensuring depth consistency and reducing noise in human modeling without relying on costly annotations.

DetailsMotivation: Existing methods for human modeling in multiview pedestrian detection introduce noise and lack precision, often requiring expensive annotations and struggling with generalization.

Method: Proposes Depth-Consistent Human Modeling (DCHM) with superpixel-wise Gaussian Splatting for consistent depth estimation and multiview fusion in global coordinates.

Result: DCHM significantly reduces noise, outperforms state-of-the-art baselines, and is the first to reconstruct pedestrians and perform multiview segmentation in challenging scenarios.

Conclusion: DCHM offers a robust solution for accurate human modeling in multiview pedestrian detection, eliminating the need for human-labeled annotations.

Abstract: Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistent Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline with superpixel-wise Gaussian Splatting achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive validations demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to our knowledge, DCHM is the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting. Code is available on the \href{https://jiahao-ma.github.io/DCHM/}{project page}.

[170] ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, Yihao Liu

Main category: cs.CV

TL;DR: ArtiMuse is a new MLLM-based IAA model offering joint scoring and expert-level understanding, paired with ArtiMuse-10K, a curated dataset of 10,000 images with detailed annotations.

DetailsMotivation: The need for advanced IAA methods that provide both quantitative scoring and professional understanding due to growing applications in education, art, and AIGC.

Method: Development of ArtiMuse, an MLLM-based IAA model, and ArtiMuse-10K, a dataset with expert annotations.

Result: ArtiMuse addresses modality bias and lack of fine-grained analysis in existing IAA methods.

Conclusion: The model and dataset will be made public to advance IAA research.

Abstract: The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:(1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public to advance the field.

[171] Real Time Captioning of Sign Language Gestures in Video Meetings

Sharanya Mukherjee, Md Hishaam Akhtar, Kannadasan R

Main category: cs.CV

TL;DR: A browser extension is proposed to translate sign language to subtitles in video calls, addressing communication barriers for the hearing impaired.

DetailsMotivation: The pandemic increased reliance on video calls, but hearing-impaired individuals prefer signing over typing, highlighting the need for better communication tools.

Method: Uses a large-scale dataset of 2000+ Word-Level ASL videos from over 100 signers to train the system.

Result: The extension aims to provide real-time sign language translation as subtitles during video calls.

Conclusion: The proposed solution bridges communication gaps for the hearing impaired in digital interactions.

Abstract: It has always been a rather tough task to communicate with someone possessing a hearing impairment. One of the most tested ways to establish such a communication is through the use of sign based languages. However, not many people are aware of the smaller intricacies involved with sign language. Sign language recognition using computer vision aims at eliminating the communication barrier between deaf-mute and ordinary people so that they can properly communicate with others. Recently the pandemic has left the whole world shaken up and has transformed the way we communicate. Video meetings have become essential for everyone, even people with a hearing disability. In recent studies, it has been found that people with hearing disabilities prefer to sign over typing during these video calls. In this paper, we are proposing a browser extension that will automatically translate sign language to subtitles for everyone else in the video call. The Large-scale dataset which contains more than 2000 Word-Level ASL videos, which were performed by over 100 signers will be used.

[172] Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

Sujata Gaihre, Amir Thapa Magar, Prasuna Pokharel, Laxmi Tiwari

Main category: cs.CV

TL;DR: The paper presents an approach using the Florence model for visual question answering in gastrointestinal endoscopy, achieving strong results on the KASVIR dataset.

DetailsMotivation: To address the challenge of visual question answering (VQA) in medical endoscopy by leveraging a large multimodal model for accurate and clinically relevant answers.

Method: Adopts the Florence model, a multimodal foundation model, with domain-specific augmentations to enhance training diversity while preserving medical features.

Result: Fine-tuning Florence yields accurate responses on the KASVIR dataset, demonstrating the potential of large multimodal models in medical VQA.

Conclusion: The work establishes a strong baseline for future research on explainability, robustness, and clinical integration in medical VQA.

Abstract: This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: https://github.com/TiwariLaxuu/VQA-Florence.git

[173] Grounding Degradations in Natural Language for All-In-One Video Restoration

Muhammad Kamran Janjua, Amirhosein Ghasemabadi, Kunlin Zhang, Mohammad Salameh, Chao Gao, Di Niu

Main category: cs.CV

TL;DR: Proposes an all-in-one video restoration framework using foundation models for interpretable guidance, introduces new benchmarks, and achieves state-of-the-art results.

DetailsMotivation: To create a flexible, interpretable video restoration method without requiring prior degradation knowledge, and to standardize benchmarks in the field.

Method: Uses foundation models to ground degradation-aware semantic context in natural language, learning approximations to avoid extra inference costs.

Result: Achieves state-of-the-art performance on proposed benchmarks, including multi-degradation and time-varying composite degradation settings.

Conclusion: The framework is effective, interpretable, and cost-efficient, with new benchmarks advancing the field of all-in-one video restoration.

Abstract: In this work, we propose an all-in-one video restoration framework that grounds degradation-aware semantic context of video frames in natural language via foundation models, offering interpretable and flexible guidance. Unlike prior art, our method assumes no degradation knowledge in train or test time and learns an approximation to the grounded knowledge such that the foundation model can be safely disentangled during inference adding no extra cost. Further, we call for standardization of benchmarks in all-in-one video restoration, and propose two benchmarks in multi-degradation setting, three-task (3D) and four-task (4D), and two time-varying composite degradation benchmarks; one of the latter being our proposed dataset with varying snow intensity, simulating how weather degradations affect videos naturally. We compare our method with prior works and report state-of-the-art performance on all benchmarks.

[174] Synthesizing Images on Perceptual Boundaries of ANNs for Uncovering Human Perceptual Variability on Facial Expressions

Haotian Deng, Chi Zhang, Chen Wei, Quanying Liu

Main category: cs.CV

TL;DR: The study explores how ambiguous facial expression stimuli, confusing for ANNs, also cause perceptual variability in humans, linking ANN decision boundaries to human emotion perception.

DetailsMotivation: To understand the relationship between ANN models and human perceptual variability in emotion categorization, especially for ambiguous stimuli.

Method: Introduced a perceptual boundary sampling method to create ambiguous facial expression stimuli (varEmotion dataset) and conducted large-scale human experiments.

Result: ANN-confusing stimuli increased human perceptual uncertainty, and fine-tuning ANNs with behavioral data aligned predictions with human patterns.

Conclusion: The findings connect ANN decision boundaries to human perceptual variability, advancing personalized emotion modeling.

Abstract: A fundamental challenge in affective cognitive science is to develop models that accurately capture the relationship between external emotional stimuli and human internal experiences. While ANNs have demonstrated remarkable accuracy in facial expression recognition, their ability to model inter-individual differences in human perception remains underexplored. This study investigates the phenomenon of high perceptual variability-where individuals exhibit significant differences in emotion categorization even when viewing the same stimulus. Inspired by the similarity between ANNs and human perception, we hypothesize that facial expression samples that are ambiguous for ANN classifiers also elicit divergent perceptual judgments among human observers. To examine this hypothesis, we introduce a novel perceptual boundary sampling method to generate facial expression stimuli that lie along ANN decision boundaries. These ambiguous samples form the basis of the varEmotion dataset, constructed through large-scale human behavioral experiments. Our analysis reveals that these ANN-confusing stimuli also provoke heightened perceptual uncertainty in human participants, highlighting shared computational principles in emotion perception. Finally, by fine-tuning ANN representations using behavioral data, we achieve alignment between ANN predictions and both group-level and individual-level human perceptual patterns. Our findings establish a systematic link between ANN decision boundaries and human perceptual variability, offering new insights into personalized modeling of emotional interpretation.

[175] Compress-Align-Detect: onboard change detection from unregistered images

Gabriele Inzerillo, Diego Valsesia, Aniello Fiengo, Enrico Magli

Main category: cs.CV

TL;DR: The paper proposes an onboard satellite framework for real-time change detection using a deep neural network with three submodules: compression, co-registration, and change detection.

DetailsMotivation: To eliminate delays in change detection caused by ground-based processing, enabling real-time applications.

Method: A deep neural network with three interlinked submodules: image compression, lightweight co-registration, and a temporally-invariant change detection model.

Result: Achieves high F1 scores and sustains 0.7 Mpixel/s throughput on low-power hardware.

Conclusion: The framework successfully addresses onboard processing constraints, offering efficient real-time change detection.

Abstract: Change detection from satellite images typically incurs a delay ranging from several hours up to days because of latency in downlinking the acquired images and generating orthorectified image products at the ground stations; this may preclude real- or near real-time applications. To overcome this limitation, we propose shifting the entire change detection workflow onboard satellites. This requires to simultaneously solve challenges in data storage, image registration and change detection with a strict complexity constraint. In this paper, we present a novel and efficient framework for onboard change detection that addresses the aforementioned challenges in an end-to-end fashion with a deep neural network composed of three interlinked submodules: (1) image compression, tailored to minimize onboard data storage resources; (2) lightweight co-registration of non-orthorectified multi-temporal image pairs; and (3) a novel temporally-invariant and computationally efficient change detection model. This is the first approach in the literature combining all these tasks in a single end-to-end framework with the constraints dictated by onboard processing. Experimental results compare each submodule with the current state-of-the-art, and evaluate the performance of the overall integrated system in realistic setting on low-power hardware. Compelling change detection results are obtained in terms of F1 score as a function of compression rate, sustaining a throughput of 0.7 Mpixel/s on a 15W accelerator.

[176] Clutter Detection and Removal by Multi-Objective Analysis for Photographic Guidance

Xiaoran Wu

Main category: cs.CV

TL;DR: A camera guidance system helps users identify and remove clutter in photos using aesthetics evaluation and image inpainting, improving photo quality.

DetailsMotivation: Clutter in photos distracts from intended emotions or stories, especially for amateur photographers who lack experience in decluttering scenes.

Method: The system uses a clutter distinguishment algorithm with aesthetics evaluations and an iterative image inpainting algorithm based on GANs to remove clutter.

Result: User studies show the system helps users identify distractions and take higher quality photos faster.

Conclusion: The system effectively aids in clutter removal and enhances photographic work through interactive guidance and computational tools.

Abstract: Clutter in photos is a distraction preventing photographers from conveying the intended emotions or stories to the audience. Photography amateurs frequently include clutter in their photos due to unconscious negligence or the lack of experience in creating a decluttered, aesthetically appealing scene for shooting. We are thus motivated to develop a camera guidance system that provides solutions and guidance for clutter identification and removal. We estimate and visualize the contribution of objects to the overall aesthetics and content of a photo, based on which users can interactively identify clutter. Suggestions on getting rid of clutter, as well as a tool that removes cluttered objects computationally, are provided to guide users to deal with different kinds of clutter and improve their photographic work. Two technical novelties underpin interactions in our system: a clutter distinguishment algorithm with aesthetics evaluations for objects and an iterative image inpainting algorithm based on generative adversarial nets that reconstructs missing regions of removed objects for high-resolution images. User studies demonstrate that our system provides flexible interfaces and accurate algorithms that allow users to better identify distractions and take higher quality images within less time.

[177] DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting

Hung Nguyen, Runfa Li, An Le, Truong Nguyen

Main category: cs.CV

TL;DR: DWTGS improves sparse-view 3D Gaussian Splatting by using wavelet-space losses for better spatial supervision, outperforming Fourier-based methods.

DetailsMotivation: Overfitting to high-frequency details in sparse training views hampers 3DGS reconstruction quality.

Method: DWTGS leverages wavelet-space losses, supervising low-frequency subbands and enforcing sparsity on high-frequency subbands.

Result: DWTGS outperforms Fourier-based methods, improving generalization and reducing high-frequency hallucinations.

Conclusion: Wavelet-based frequency regularization is more effective than Fourier-based approaches for sparse-view 3DGS.

Abstract: Sparse-view 3D Gaussian Splatting (3DGS) presents significant challenges in reconstructing high-quality novel views, as it often overfits to the widely-varying high-frequency (HF) details of the sparse training views. While frequency regularization can be a promising approach, its typical reliance on Fourier transforms causes difficult parameter tuning and biases towards detrimental HF learning. We propose DWTGS, a framework that rethinks frequency regularization by leveraging wavelet-space losses that provide additional spatial supervision. Specifically, we supervise only the low-frequency (LF) LL subbands at multiple DWT levels, while enforcing sparsity on the HF HH subband in a self-supervised manner. Experiments across benchmarks show that DWTGS consistently outperforms Fourier-based counterparts, as this LF-centric strategy improves generalization and reduces HF hallucinations.

[178] Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

Jintang Xue, Ganning Zhao, Jie-En Yao, Hong-En Chen, Yue Hu, Meida Chen, Suya You, C. -C. Jay Kuo

Main category: cs.CV

TL;DR: Descrip3D introduces a framework using natural language to encode object relationships in 3D scenes, improving performance on tasks like grounding and QA.

DetailsMotivation: Current 3D scene-language models lack relational understanding due to reliance on visual embeddings alone.

Method: Descrip3D enhances objects with textual descriptions and integrates them via embedding fusion and prompt-level injection.

Result: Outperforms baselines on five benchmark datasets, including ScanRefer and ScanQA.

Conclusion: Language-guided relational representation is effective for complex 3D scene understanding.

Abstract: Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes.

[179] LEAD: Exploring Logit Space Evolution for Model Selection

Zixuan Hu, Xiaotong Li, Shixiang Tang, Jun Liu, Yichun Hu, Ling-Yu Duan

Main category: cs.CV

TL;DR: LEAD is a finetuning-aligned method that models transferability of pre-trained models using logits and ODEs, outperforming linear methods by capturing nonlinear optimization dynamics.

DetailsMotivation: The challenge of efficiently selecting suitable pre-trained models for downstream tasks, as existing linear methods fail to capture fine-tuning dynamics accurately.

Method: LEAD uses logits and derives an ODE to model nonlinear optimization. It includes class-aware decomposition for practical applicability.

Result: Outperforms existing methods on 24 pre-trained models across 10 datasets, even in low-data scenarios.

Conclusion: LEAD effectively bridges the optimization gap in model transferability, offering a concise and adaptable solution.

Abstract: The remarkable success of pretrain-then-finetune paradigm has led to a proliferation of available pre-trained models for vision tasks. This surge presents a significant challenge in efficiently choosing the most suitable pre-trained models for downstream tasks. The critical aspect of this challenge lies in effectively predicting the model transferability by considering the underlying fine-tuning dynamics. Existing methods often model fine-tuning dynamics in feature space with linear transformations, which do not precisely align with the fine-tuning objective and fail to grasp the essential nonlinearity from optimization. To this end, we present LEAD, a finetuning-aligned approach based on the network output of logits. LEAD proposes a theoretical framework to model the optimization process and derives an ordinary differential equation (ODE) to depict the nonlinear evolution toward the final logit state. Additionally, we design a class-aware decomposition method to consider the varying evolution dynamics across classes and further ensure practical applicability. Integrating the closely aligned optimization objective and nonlinear modeling capabilities derived from the differential equation, our method offers a concise solution to effectively bridge the optimization gap in a single step, bypassing the lengthy fine-tuning process. The comprehensive experiments on 24 supervised and self-supervised pre-trained models across 10 downstream datasets demonstrate impressive performances and showcase its broad adaptability even in low-data scenarios.

[180] Benchmarking GANs, Diffusion Models, and Flow Matching for T1w-to-T2w MRI Translation

Andrea Moschetto, Lemuel Puglisi, Alec Sargood, Pierluigi Dell’Acqua, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì

Main category: cs.CV

TL;DR: A benchmark study compares GANs, diffusion models, and flow matching for T1w-to-T2w MRI synthesis, finding GAN-based Pix2Pix superior in fidelity, quality, and efficiency.

DetailsMotivation: Reducing MRI scan time and cost by computationally synthesizing missing contrasts from acquired ones.

Method: Comprehensive benchmark of generative models (GANs, diffusion models, flow matching) for 2D MRI image-to-image translation, evaluated on three public datasets.

Result: GAN-based Pix2Pix outperforms diffusion and flow matching methods in structural fidelity, image quality, and computational efficiency.

Conclusion: GANs are currently more practical for MRI synthesis, while flow-based methods may need more data. Findings guide real-world deployment and future research.

Abstract: Magnetic Resonance Imaging (MRI) enables the acquisition of multiple image contrasts, such as T1-weighted (T1w) and T2-weighted (T2w) scans, each offering distinct diagnostic insights. However, acquiring all desired modalities increases scan time and cost, motivating research into computational methods for cross-modal synthesis. To address this, recent approaches aim to synthesize missing MRI contrasts from those already acquired, reducing acquisition time while preserving diagnostic quality. Image-to-image (I2I) translation provides a promising framework for this task. In this paper, we present a comprehensive benchmark of generative models$\unicode{x2013}$specifically, Generative Adversarial Networks (GANs), diffusion models, and flow matching (FM) techniques$\unicode{x2013}$for T1w-to-T2w 2D MRI I2I translation. All frameworks are implemented with comparable settings and evaluated on three publicly available MRI datasets of healthy adults. Our quantitative and qualitative analyses show that the GAN-based Pix2Pix model outperforms diffusion and FM-based methods in terms of structural fidelity, image quality, and computational efficiency. Consistent with existing literature, these results suggest that flow-based models are prone to overfitting on small datasets and simpler tasks, and may require more data to match or surpass GAN performance. These findings offer practical guidance for deploying I2I translation techniques in real-world MRI workflows and highlight promising directions for future research in cross-modal medical image synthesis. Code and models are publicly available at https://github.com/AndreaMoschetto/medical-I2I-benchmark.

[181] Performance comparison of medical image classification systems using TensorFlow Keras, PyTorch, and JAX

Merjem Bećirović, Amina Kurtović, Nordin Smajlović, Medina Kapo, Amila Akagić

Main category: cs.CV

TL;DR: The paper compares TensorFlow, PyTorch, and JAX for blood cell image classification, analyzing inference time and accuracy.

DetailsMotivation: To address the lack of detailed performance analysis of deep learning frameworks in medical imaging, specifically for blood cell classification.

Method: Comparison of TensorFlow, PyTorch, and JAX using the BloodMNIST dataset, focusing on inference time and classification performance for varying image sizes.

Result: JAX and PyTorch showed comparable accuracy to benchmarks, with performance variations due to image resolution and framework optimizations.

Conclusion: JAX and PyTorch are efficient for medical image classification, with framework choice impacting performance based on specific needs.

Abstract: Medical imaging plays a vital role in early disease diagnosis and monitoring. Specifically, blood microscopy offers valuable insights into blood cell morphology and the detection of hematological disorders. In recent years, deep learning-based automated classification systems have demonstrated high potential in enhancing the accuracy and efficiency of blood image analysis. However, a detailed performance analysis of specific deep learning frameworks appears to be lacking. This paper compares the performance of three popular deep learning frameworks, TensorFlow with Keras, PyTorch, and JAX, in classifying blood cell images from the publicly available BloodMNIST dataset. The study primarily focuses on inference time differences, but also classification performance for different image sizes. The results reveal variations in performance across frameworks, influenced by factors such as image resolution and framework-specific optimizations. Classification accuracy for JAX and PyTorch was comparable to current benchmarks, showcasing the efficiency of these frameworks for medical image classification.

[182] DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF

Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe

Main category: cs.CV

TL;DR: DiSCO-3D is a novel method for 3D Open-Vocabulary Sub-concepts Discovery, combining unsupervised segmentation with weak open-vocabulary guidance to adapt to both scene content and user queries.

DetailsMotivation: Traditional methods are limited to either task-specific goals or scene content, lacking adaptability to both. DiSCO-3D aims to bridge this gap.

Method: DiSCO-3D uses Neural Fields representations, integrating unsupervised segmentation with weak open-vocabulary guidance.

Result: The method achieves effective performance in Open-Vocabulary Sub-concepts Discovery and state-of-the-art results in edge cases of open-vocabulary and unsupervised segmentation.

Conclusion: DiSCO-3D successfully addresses the broader problem of 3D semantic segmentation by adapting to both scene content and user queries.

Abstract: 3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, \textit{etc}. Traditional methods adapt exclusively to either task-specific goals (open-vocabulary segmentation) or scene content (unsupervised semantic segmentation). We propose DiSCO-3D, the first method addressing the broader problem of 3D Open-Vocabulary Sub-concepts Discovery, which aims to provide a 3D semantic segmentation that adapts to both the scene and user queries. We build DiSCO-3D on Neural Fields representations, combining unsupervised segmentation with weak open-vocabulary guidance. Our evaluations demonstrate that DiSCO-3D achieves effective performance in Open-Vocabulary Sub-concepts Discovery and exhibits state-of-the-art results in the edge cases of both open-vocabulary and unsupervised segmentation.

[183] Exp-Graph: How Connections Learn Facial Attributes in Graph-based Expression Recognition

Nandani Sharma, Dinesh Singh

Main category: cs.CV

TL;DR: Exp-Graph is a graph-based framework for facial expression recognition, using facial landmarks and vision transformers to model structural relationships, achieving high accuracy across datasets.

DetailsMotivation: Facial expression recognition is vital for applications like human-computer interaction and affective computing. Structural information of facial attributes is key to accurate recognition.

Method: Exp-Graph uses facial landmarks as graph vertices and proximity/similarity for edges. It combines vision transformers and graph convolutional networks to capture local and global dependencies.

Result: Achieved accuracies of 98.09%, 79.01%, and 56.39% on Oulu-CASIA, eNTERFACE05, and AFEW datasets, showing strong generalization.

Conclusion: Exp-Graph is effective for practical facial expression recognition, performing well in both controlled and real-world settings.

Abstract: Facial expression recognition is crucial for human-computer interaction applications such as face animation, video surveillance, affective computing, medical analysis, etc. Since the structure of facial attributes varies with facial expressions, incorporating structural information into facial attributes is essential for facial expression recognition. In this paper, we propose Exp-Graph, a novel framework designed to represent the structural relationships among facial attributes using graph-based modeling for facial expression recognition. For facial attributes graph representation, facial landmarks are used as the graph’s vertices. At the same time, the edges are determined based on the proximity of the facial landmark and the similarity of the local appearance of the facial attributes encoded using the vision transformer. Additionally, graph convolutional networks are utilized to capture and integrate these structural dependencies into the encoding of facial attributes, thereby enhancing the accuracy of expression recognition. Thus, Exp-Graph learns from the facial attribute graphs highly expressive semantic representations. On the other hand, the vision transformer and graph convolutional blocks help the framework exploit the local and global dependencies among the facial attributes that are essential for the recognition of facial expressions. We conducted comprehensive evaluations of the proposed Exp-Graph model on three benchmark datasets: Oulu-CASIA, eNTERFACE05, and AFEW. The model achieved recognition accuracies of 98.09%, 79.01%, and 56.39%, respectively. These results indicate that Exp-Graph maintains strong generalization capabilities across both controlled laboratory settings and real-world, unconstrained environments, underscoring its effectiveness for practical facial expression recognition applications.

[184] Depthwise-Dilated Convolutional Adapters for Medical Object Tracking and Segmentation Using the Segment Anything Model 2

Guoping Xu, Christopher Kabat, You Zhang

Main category: cs.CV

TL;DR: DD-SAM2 is an efficient adaptation framework for SAM2, enhancing medical video segmentation with minimal parameter overhead, achieving high Dice scores.

DetailsMotivation: Existing medical image segmentation methods lack adaptability to dynamic scenarios and require large datasets for retraining, leading to high costs and forgetting risks.

Method: Proposes DD-SAM2 with a Depthwise-Dilated Adapter (DD-Adapter) for multi-scale feature extraction, enabling fine-tuning on limited medical video data.

Result: Achieves Dice scores of 0.93 (TrackRad2025) and 0.97 (EchoNet-Dynamic), outperforming existing methods.

Conclusion: DD-SAM2 is a pioneering adapter-based fine-tuning solution for SAM2 in medical video segmentation and tracking, with public code and datasets.

Abstract: Recent advances in medical image segmentation have been driven by deep learning; however, most existing methods remain limited by modality-specific designs and exhibit poor adaptability to dynamic medical imaging scenarios. The Segment Anything Model 2 (SAM2) and its related variants, which introduce a streaming memory mechanism for real-time video segmentation, present new opportunities for prompt-based, generalizable solutions. Nevertheless, adapting these models to medical video scenarios typically requires large-scale datasets for retraining or transfer learning, leading to high computational costs and the risk of catastrophic forgetting. To address these challenges, we propose DD-SAM2, an efficient adaptation framework for SAM2 that incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction with minimal parameter overhead. This design enables effective fine-tuning of SAM2 on medical videos with limited training data. Unlike existing adapter-based methods focused solely on static images, DD-SAM2 fully exploits SAM2’s streaming memory for medical video object tracking and segmentation. Comprehensive evaluations on TrackRad2025 (tumor segmentation) and EchoNet-Dynamic (left ventricle tracking) datasets demonstrate superior performance, achieving Dice scores of 0.93 and 0.97, respectively. To the best of our knowledge, this work provides an initial attempt at systematically exploring adapter-based SAM2 fine-tuning for medical video segmentation and tracking. Code, datasets, and models will be publicly available at https://github.com/apple1986/DD-SAM2.

[185] BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng

Main category: cs.CV

TL;DR: BusterX++ is a cross-modal framework for detecting and explaining synthetic media, using reinforcement learning and hybrid reasoning to outperform single-modality methods. GenBuster++ is a benchmark for evaluating such systems.

DetailsMotivation: The rise of generative AI has increased misinformation risks through sophisticated fake content, but current detection systems are limited by single-modality designs.

Method: BusterX++ employs reinforcement learning post-training (Multi-stage Training, Thinking Reward, Hybrid Reasoning) for cross-modal detection. GenBuster++ provides a benchmark with 4,000 curated images/videos.

Result: BusterX++ achieves stable and substantial performance improvements, validated by extensive experiments.

Conclusion: The framework and benchmark address limitations of single-modality detection, offering enhanced generalizability and effectiveness against synthetic media.

Abstract: Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbf{BusterX++}, a novel framework designed specifically for cross-modal detection and explanation of synthetic media. Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start. Through Multi-stage Training, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable and substantial performance improvements. To enable comprehensive evaluation, we also present \textbf{GenBuster++}, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts using a novel filtering methodology to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.

[186] Multispectral State-Space Feature Fusion: Bridging Shared and Cross-Parametric Interactions for Object Detection

Jifeng Shen, Haibo Zhan, Shaohua Dong, Xin Zuo, Wankou Yang, Haibin Ling

Main category: cs.CV

TL;DR: MS2Fusion proposes a dual-path SSM-based framework for multispectral feature fusion, addressing limitations in local feature preference and computational complexity, achieving superior performance in object detection and other tasks.

DetailsMotivation: Current multispectral feature fusion methods overly focus on local complementary features and struggle with the trade-off between receptive field size and computational complexity, limiting generalization and scalability.

Method: MS2Fusion uses a state space model (SSM) with dual-path parametric interaction: one for cross-modal complementary features and another for shared semantics, jointly optimized in a unified framework.

Result: MS2Fusion outperforms state-of-the-art methods on benchmarks like FLIR, M3FD, and LLVIP, and shows strong performance in RGB-T semantic segmentation and RGBT salient object detection.

Conclusion: MS2Fusion effectively balances complementary and shared features, offering a scalable and generalizable solution for multispectral perception tasks.

Abstract: Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross-modal shared semantics adversely affects generalization performance; and (2) The trade-off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State-Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual-path parametric interaction mechanism. More specifically, the first cross-parameter interaction branch inherits the advantage of cross-attention in mining complementary information with cross-modal hidden state decoding in SSM. The second shared-parameter branch explores cross-modal alignment with joint embedding to obtain cross-modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state-of-the-art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state-of-the-art results on RGB-T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at https://github.com/61s61min/MS2Fusion.git.

[187] AI-Powered Precision in Sport Taekwondo: Enhancing Fairness, Speed, and Trust in Competition (FST.ai)

Keivan Shariatmadar, Ahmad Osman

Main category: cs.CV

TL;DR: FST.ai is an AI framework for real-time head kick detection in Taekwondo, reducing decision latency and improving fairness. It uses computer vision and deep learning, with potential applications in other sports.

DetailsMotivation: Traditional sports officiating suffers from latency, subjectivity, and inconsistency, undermining fairness. AI can automate and improve decision-making.

Method: The framework uses computer vision, deep learning, and edge inference for pose estimation, motion classification, and impact analysis.

Result: FST.ai reduces decision time to seconds, enhances consistency, and is adaptable to other sports like judo, football, and basketball.

Conclusion: FST.ai demonstrates robustness and scalability, offering a transformative solution for officiating in Taekwondo and beyond.

Abstract: The integration of Artificial Intelligence (AI) into sports officiating represents a paradigm shift in how decisions are made in competitive environments. Traditional manual systems, even when supported by Instant Video Replay (IVR), often suffer from latency, subjectivity, and inconsistent enforcement, undermining fairness and athlete trust. This paper introduces FST.ai, a novel AI-powered framework designed to enhance officiating in Sport Taekwondo, particularly focusing on the complex task of real-time head kick detection and scoring. Leveraging computer vision, deep learning, and edge inference, the system automates the identification and classification of key actions, significantly reducing decision time from minutes to seconds while improving consistency and transparency. Importantly, the methodology is not limited to Taekwondo. The underlying framework – based on pose estimation, motion classification, and impact analysis – can be adapted to a wide range of sports requiring action detection, such as judo, karate, fencing, or even team sports like football and basketball, where foul recognition or performance tracking is critical. By addressing one of Taekwondo’s most challenging scenarios – head kick scoring – we demonstrate the robustness, scalability, and sport-agnostic potential of FST.ai to transform officiating standards across multiple disciplines.

[188] Artificial Intelligence in the Food Industry: Food Waste Estimation based on Computer Vision, a Brief Case Study in a University Dining Hall

Shayan Rokhva, Babak Teimourpour

Main category: cs.CV

TL;DR: A cost-effective computer vision framework estimates plate-level food waste using semantic segmentation of RGB images, achieving strong performance with supervised models, though limitations exist.

DetailsMotivation: Quantifying food waste in institutional dining is crucial for data-driven sustainability strategies.

Method: Utilizes semantic segmentation of RGB images with four supervised models (U-Net, U-Net++, and lightweight variants), trained using dynamic inverse-frequency loss and AdamW optimizer, evaluated with metrics like Pixel Accuracy, Dice, IoU, and DPA.

Result: Models achieved high performance (≥90% DPA for some foods), with lighter models enabling real-time inference. Segmentation worked best for dry/rigid foods (e.g., rice) but struggled with complex/viscous dishes (e.g., stews).

Conclusion: The framework is scalable and pioneering for automated food waste monitoring, offering actionable insights for reducing institutional waste despite current limitations.

Abstract: Quantifying post-consumer food waste in institutional dining settings is essential for supporting data-driven sustainability strategies. This study presents a cost-effective computer vision framework that estimates plate-level food waste by utilizing semantic segmentation of RGB images taken before and after meal consumption across five Iranian dishes. Four fully supervised models (U-Net, U-Net++, and their lightweight variants) were trained using a capped dynamic inverse-frequency loss and AdamW optimizer, then evaluated through a comprehensive set of metrics, including Pixel Accuracy, Dice, IoU, and a custom-defined Distributional Pixel Agreement (DPA) metric tailored to the task. All models achieved satisfying performance, and for each food type, at least one model approached or surpassed 90% DPA, demonstrating strong alignment in pixel-wise proportion estimates. Lighter models with reduced parameter counts offered faster inference, achieving real-time throughput on an NVIDIA T4 GPU. Further analysis showed superior segmentation performance for dry and more rigid components (e.g., rice and fries), while more complex, fragmented, or viscous dishes, such as stews, showed reduced performance, specifically post-consumption. Despite limitations such as reliance on 2D imaging, constrained food variety, and manual data collection, the proposed framework is pioneering and represents a scalable, contactless solution for continuous monitoring of food consumption. This research lays foundational groundwork for automated, real-time waste tracking systems in large-scale food service environments and offers actionable insights and outlines feasible future directions for dining hall management and policymakers aiming to reduce institutional food waste.

[189] Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images

Yaxuan Song, Jianan Fan, Hang Chang, Weidong Cai

Main category: cs.CV

TL;DR: Gene-DML improves gene expression prediction from histopathology images by aligning cross-modal representations at multiple levels, outperforming existing methods.

DetailsMotivation: Existing methods fail to fully utilize cross-modal alignment between histopathology images and gene expression, limiting prediction performance.

Method: Gene-DML uses Dual-pathway Multi-Level discrimination: multi-scale instance-level and cross-level instance-group discrimination to align modalities.

Result: Gene-DML achieves state-of-the-art performance in gene expression prediction on public datasets.

Conclusion: Gene-DML enhances predictive accuracy and generalization by robust cross-modal representation learning.

Abstract: Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modelling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code and checkpoints will be released soon.

[190] Docopilot: Improving Multimodal Models for Document-Level Understanding

Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, Wenhai Wang

Main category: cs.CV

TL;DR: The paper introduces Doc-750K, a high-quality dataset for multimodal document understanding, and Docopilot, a native multimodal model that outperforms RAG methods in coherence, accuracy, and efficiency.

DetailsMotivation: Current MLLMs struggle with complex document comprehension due to lack of quality datasets and RAG methods' limitations like fragmented contexts and error accumulation.

Method: Developed Doc-750K dataset with diverse structures and cross-page dependencies, then built Docopilot, a native multimodal model avoiding RAG.

Result: Docopilot excels in document understanding tasks and multi-turn interactions, setting a new benchmark.

Conclusion: The work advances document-level multimodal understanding with a robust dataset and model, outperforming existing methods.

Abstract: Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot

[191] Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts

Yizhou Huang, Fan Yang, Guoliang Zhu, Gen Li, Hao Shi, Yukun Zuo, Wenrui Chen, Zhiyong Li, Kailun Yang

Main category: cs.CV

TL;DR: The paper proposes BiT-Align, a framework for multimodal affordance mapping, improving accuracy and reducing model parameters.

DetailsMotivation: Existing methods for affordance perception in robots are limited by structural simplicity, basic fusion techniques, and large model sizes, hindering practical deployment.

Method: Introduces BiT-Align with BPM for integrating depth images as prompts and TFG for attention selection guided by text features.

Result: Achieves 6.0% better KLD on AGD20K and reduces parameters by 88.8%.

Conclusion: BiT-Align offers practical improvements in affordance mapping, with significant performance gains and efficiency.

Abstract: Affordance refers to the functional properties that an agent perceives and utilizes from its environment, and is key perceptual information required for robots to perform actions. This information is rich and multimodal in nature. Existing multimodal affordance methods face limitations in extracting useful information, mainly due to simple structural designs, basic fusion methods, and large model parameters, making it difficult to meet the performance requirements for practical deployment. To address these issues, this paper proposes the BiT-Align image-depth-text affordance mapping framework. The framework includes a Bypass Prompt Module (BPM) and a Text Feature Guidance (TFG) attention selection mechanism. BPM integrates the auxiliary modality depth image directly as a prompt to the primary modality RGB image, embedding it into the primary modality encoder without introducing additional encoders. This reduces the model’s parameter count and effectively improves functional region localization accuracy. The TFG mechanism guides the selection and enhancement of attention heads in the image encoder using textual features, improving the understanding of affordance characteristics. Experimental results demonstrate that the proposed method achieves significant performance improvements on public AGD20K and HICO-IIF datasets. On the AGD20K dataset, compared with the current state-of-the-art method, we achieve a 6.0% improvement in the KLD metric, while reducing model parameters by 88.8%, demonstrating practical application values. The source code will be made publicly available at https://github.com/DAWDSE/BiT-Align.

[192] WSI-Agents: A Collaborative Multi-Agent System for Multi-Modal Whole Slide Image Analysis

Xinheng Lyu, Yuci Liang, Wenting Chen, Meidan Ding, Jiaqi Yang, Guolin Huang, Daokun Zhang, Xiangjian He, Linlin Shen

Main category: cs.CV

TL;DR: WSI-Agents is a collaborative multi-agent system for multi-modal WSI analysis, improving accuracy and versatility through task allocation, verification, and summarization.

DetailsMotivation: Current MLLMs underperform in WSI analysis compared to task-specific models, and multi-agent systems' potential in pathology is underexplored.

Method: WSI-Agents uses specialized agents with task allocation, verification (internal checks and external validation), and summarization modules.

Result: Outperforms current WSI MLLMs and medical agent frameworks in multi-modal WSI benchmarks.

Conclusion: WSI-Agents effectively balances accuracy and versatility in WSI analysis.

Abstract: Whole slide images (WSIs) are vital in digital pathology, enabling gigapixel tissue analysis across various pathological tasks. While recent advancements in multi-modal large language models (MLLMs) allow multi-task WSI analysis through natural language, they often underperform compared to task-specific models. Collaborative multi-agent systems have emerged as a promising solution to balance versatility and accuracy in healthcare, yet their potential remains underexplored in pathology-specific domains. To address these issues, we propose WSI-Agents, a novel collaborative multi-agent system for multi-modal WSI analysis. WSI-Agents integrates specialized functional agents with robust task allocation and verification mechanisms to enhance both task-specific accuracy and multi-task versatility through three components: (1) a task allocation module assigning tasks to expert agents using a model zoo of patch and WSI level MLLMs, (2) a verification mechanism ensuring accuracy through internal consistency checks and external validation using pathology knowledge bases and domain-specific models, and (3) a summary module synthesizing the final summary with visual interpretation maps. Extensive experiments on multi-modal WSI benchmarks show WSI-Agents’s superiority to current WSI MLLMs and medical agent frameworks across diverse tasks.

[193] EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera

Luming Wang, Hao Shi, Xiaoting Yin, Kailun Yang, Kaiwei Wang, Jian Bai

Main category: cs.CV

TL;DR: Proposes a novel network for egocentric gesture recognition using event cameras, addressing challenges like head movement noise and sparse event fusion, achieving higher accuracy with fewer parameters.

DetailsMotivation: Traditional RGB-based gesture recognition struggles with motion blur and illumination variations. Event cameras offer advantages but require new architectures to handle asynchronous data and head movement noise.

Method: Introduces a lightweight CNN with asymmetric depthwise convolutions, a state-space model to decouple head movement noise, and a Bins-Temporal Shift Module (BTSM) for efficient event fusion.

Result: Achieves 62.7% accuracy on unseen subjects (7M parameters) and 97.0% on DVS128 Gesture, outperforming state-of-the-art methods.

Conclusion: The proposed method effectively addresses challenges in egocentric gesture recognition with event cameras, demonstrating superior performance and generalization.

Abstract: Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that includes events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BTSM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further establish the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy tested on unseen subjects with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 97.0% on the DVS128 Gesture, demonstrating the effectiveness and generalization capability of our method on public datasets. The dataset and models are made available at https://github.com/3190105222/EgoEv_Gesture.

[194] From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Situation Recognition

Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew

Main category: cs.CV

TL;DR: The paper introduces Open-vocabulary Grounded Situation Recognition (Ov-GSR) and proposes Multimodal Interactive Prompt Distillation (MIPD) to transfer knowledge from a teacher MLLM to a smaller GSR model, enhancing generalization and zero-shot abilities.

DetailsMotivation: Current MLLMs struggle with complex GSR tasks and are resource-heavy, while conventional GSR models lack generalization for unseen and rare situations.

Method: MIPD distills multimodal knowledge from a teacher MLLM using LLM-based rationales and scene-aware prompts, aligning them via Negative-Guided Multimodal Prompting Alignment (NMPA).

Result: MIPD achieves superior performance on seen, rare, and unseen situations in the Ov-SWiG dataset and improves unseen detection in HICO-DET.

Conclusion: MIPD effectively enhances GSR models’ generalization and zero-shot abilities, bridging gaps between seen and unseen scenarios and mitigating bias in rare cases.

Abstract: Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.

[195] GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset

Zhiwei Zhang, Zi Ye, Yibin Wen, Shuai Yuan, Haohuan Fu, Jianxi Huang, Juepeng Zheng

Main category: cs.CV

TL;DR: The paper introduces GTPBD, a fine-grained terraced parcel dataset for complex terrains, addressing gaps in existing datasets and supporting multiple tasks like semantic segmentation and domain adaptation.

DetailsMotivation: Existing datasets lack representation of complex terraced terrains, limiting precision agriculture research. GTPBD aims to fill this gap with high-resolution, manually annotated data.

Method: GTPBD includes 47,537 high-resolution images with three-level labels (boundary, mask, parcel) covering diverse terrains and climatic regions. It benchmarks various methods for segmentation, edge detection, parcel extraction, and UDA.

Result: GTPBD provides 200,000+ complex terraced parcels, challenging due to terrain diversity, irregular objects, and domain styles. It supports four tasks and outperforms existing datasets.

Conclusion: GTPBD is a foundational resource for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer, advancing terraced remote sensing research.

Abstract: Agricultural parcels serve as basic units for conducting agricultural practices and applications, which is vital for land ownership registration, food security assessment, soil erosion monitoring, etc. However, existing agriculture parcel extraction studies only focus on mid-resolution mapping or regular plain farmlands while lacking representation of complex terraced terrains due to the demands of precision agriculture.In this paper, we introduce a more fine-grained terraced parcel dataset named GTPBD (Global Terraced Parcel and Boundary Dataset), which is the first fine-grained dataset covering major worldwide terraced regions with more than 200,000 complex terraced parcels with manual annotation. GTPBD comprises 47,537 high-resolution images with three-level labels, including pixel-level boundary labels, mask labels, and parcel labels. It covers seven major geographic zones in China and transcontinental climatic regions around the world.Compared to the existing datasets, the GTPBD dataset brings considerable challenges due to the: (1) terrain diversity; (2) complex and irregular parcel objects; and (3) multiple domain styles. Our proposed GTPBD dataset is suitable for four different tasks, including semantic segmentation, edge detection, terraced parcel extraction, and unsupervised domain adaptation (UDA) tasks.Accordingly, we benchmark the GTPBD dataset on eight semantic segmentation methods, four edge extraction methods, three parcel extraction methods, and five UDA methods, along with a multi-dimensional evaluation framework integrating pixel-level and object-level metrics. GTPBD fills a critical gap in terraced remote sensing research, providing a basic infrastructure for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer.

[196] DAA*: Deep Angular A Star for Image-based Path Planning

Zhiwei Xu

Main category: cs.CV

TL;DR: The paper introduces Deep Angular A* (DAA*), a method improving path smoothness in imitation learning by incorporating Path Angular Freedom (PAF) into A*. It enhances path similarity and optimality, outperforming existing methods in evaluations.

DetailsMotivation: Path smoothness is often neglected in imitation learning from expert demonstrations, limiting path similarity and optimality.

Method: DAA* integrates PAF into A* to balance path shortening and smoothing, optimizing heuristic distance and angular freedom.

Result: DAA* improves path similarity metrics (SPR, ASIM, PSIM) by 3.9-9.0% over baselines and outperforms TransPath by 3.7-6.7%.

Conclusion: DAA* effectively balances path optimality and smoothness, demonstrating superior performance in diverse datasets with minor trade-offs in search efficiency.

Abstract: Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.7% SPR, 6.5% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable. Our code and model weights are available at https://github.com/zwxu064/DAAStar.git.

[197] MultiRetNet: A Multimodal Vision Model and Deferral System for Staging Diabetic Retinopathy

Jeannie She, Katie Spivakovsky

Main category: cs.CV

TL;DR: MultiRetNet improves diabetic retinopathy staging by combining retinal imaging, socioeconomic data, and comorbidities, using multimodal fusion and a deferral system for clinician review.

DetailsMotivation: Address disparities in DR diagnosis, especially in underserved populations with limited screening access and higher risk of advanced disease.

Method: Proposes MultiRetNet, integrating retinal imaging, socioeconomic factors, and comorbidities. Tests three fusion methods, uses contrastive learning for deferral system training.

Result: Fully connected layer fusion is most versatile. System maintains accuracy on suboptimal images and identifies cases needing clinician review.

Conclusion: MultiRetNet enhances early detection, reduces costs, and promotes healthcare equity, particularly for underserved populations.

Abstract: Diabetic retinopathy (DR) is a leading cause of preventable blindness, affecting over 100 million people worldwide. In the United States, individuals from lower-income communities face a higher risk of progressing to advanced stages before diagnosis, largely due to limited access to screening. Comorbid conditions further accelerate disease progression. We propose MultiRetNet, a novel pipeline combining retinal imaging, socioeconomic factors, and comorbidity profiles to improve DR staging accuracy, integrated with a clinical deferral system for a clinical human-in-the-loop implementation. We experiment with three multimodal fusion methods and identify fusion through a fully connected layer as the most versatile methodology. We synthesize adversarial, low-quality images and use contrastive learning to train the deferral system, guiding the model to identify out-of-distribution samples that warrant clinician review. By maintaining diagnostic accuracy on suboptimal images and integrating critical health data, our system can improve early detection, particularly in underserved populations where advanced DR is often first identified. This approach may reduce healthcare costs, increase early detection rates, and address disparities in access to care, promoting healthcare equity.

[198] Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model

Osher Rafaeli, Tal Svoray, Ariel Nahlieli

Main category: cs.CV

TL;DR: A framework for high-resolution DEM estimation using prompt-based monocular depth estimation, achieving 100x resolution gain and robust generalization.

DetailsMotivation: High-resolution elevation data is crucial for hydrology, urban studies, and ecosystem monitoring, but existing methods have limitations like upscaling constraints or lack of global context.

Method: Uses low-resolution SRTM data as prompts with high-resolution NAIP imagery, fine-tuning a vision transformer encoder with LiDAR-derived DEMs for tasks like DEM estimation, void filling, and updating.

Result: Achieves 30-cm resolution (100x gain), <5 m MAE relative to LiDAR, and improves over SRTM by up to 18%. Validated across diverse landscapes and scalable for large regions.

Conclusion: The framework offers a scalable, accurate solution for global elevation mapping, suitable for hydrological and environmental studies, with publicly available code and models.

Abstract: High-resolution elevation estimations are essential to understand catchment and hillslope hydrology, study urban morphology and dynamics, and monitor the growth, decline, and mortality of terrestrial ecosystems. Various deep learning approaches (e.g., super-resolution techniques, monocular depth estimation) have been developed to create high-resolution Digital Elevation Models (DEMs). However, super-resolution techniques are limited by the upscaling factor, and monocular depth estimation lacks global elevation context, making its conversion to a seamless DEM restricted. The recently introduced technique of prompt-based monocular depth estimation has opened new opportunities to extract estimates of absolute elevation in a global context. We present here a framework for the estimation of high-resolution DEMs as a new paradigm for absolute global elevation mapping. It is exemplified using low-resolution Shuttle Radar Topography Mission (SRTM) elevation data as prompts and high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP). The approach fine-tunes a vision transformer encoder with LiDAR-derived DEMs and employs a versatile prompting strategy, enabling tasks such as DEM estimation, void filling, and updating. Our framework achieves a 100x resolution gain (from 30-m to 30-cm), surpassing prior methods by an order of magnitude. Evaluations across three diverse U.S. landscapes show robust generalization, capturing urban structures and fine-scale terrain features with < 5 m MAE relative to LiDAR, improving over SRTM by up to 18%. Hydrological analysis confirms suitability for hazard and environmental studies. We demonstrate scalability by applying the framework to large regions in the U.S. and Israel. All code and pretrained models are publicly available at: https://osherr1996.github.io/prompt2dem_propage/.

[199] InterAct-Video: Reasoning-Rich Video QA for Urban Traffic

Joseph Raj Vishal, Rutuja Patil, Manas Srinivas Gowda, Katha Naik, Yezhou Yang, Bharatesh Chakravarthi

Main category: cs.CV

TL;DR: The paper introduces InterAct VideoQA, a dataset for benchmarking VideoQA models in traffic monitoring, addressing challenges in real-world traffic scenes.

DetailsMotivation: Existing VideoQA models struggle with complex real-world traffic scenes, necessitating a domain-specific dataset for improved performance.

Method: The InterAct VideoQA dataset includes 8 hours of traffic footage, segmented into 10-second clips with 25,000 QA pairs, covering spatiotemporal dynamics and vehicle interactions.

Result: Evaluation on InterAct VideoQA reveals challenges in reasoning over spatiotemporal dependencies, but fine-tuning improves model performance.

Conclusion: InterAct VideoQA serves as a benchmark to advance VideoQA models for intelligent transportation systems.

Abstract: Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \textbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA

[200] LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

Xinxin Dong, Baoyun Peng, Haokai Ma, Yufei Wang, Zixuan Dong, Fei Hu, Xiaodong Wang

Main category: cs.CV

TL;DR: LeAdQA improves VideoQA by refining queries with causal awareness and fine-grained visual grounding, outperforming current methods in complex reasoning tasks.

DetailsMotivation: Current VideoQA methods struggle with task-agnostic sampling and heuristic retrieval, missing causal-temporal structures needed for complex reasoning.

Method: LeAdQA uses LLMs to refine question-option pairs, directs a temporal grounding model to retrieve salient segments, and employs adaptive fusion for evidence integration.

Result: Achieves SOTA performance on NExT-QA, IntentQA, and NExT-GQA, enhancing video-question understanding.

Conclusion: LeAdQA effectively addresses limitations in VideoQA by combining causal-aware query refinement and precise visual grounding.

Abstract: Video Question Answering (VideoQA) requires identifying sparse critical moments in long videos and reasoning about their causal relationships to answer semantically complex questions. While recent advances in multimodal learning have improved alignment and fusion, current approaches remain limited by two prevalent but fundamentally flawed strategies: (1) task-agnostic sampling indiscriminately processes all frames, overwhelming key events with irrelevant content; and (2) heuristic retrieval captures superficial patterns but misses causal-temporal structures needed for complex reasoning. To address these challenges, we introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding. Our method first leverages LLMs to reformulate question-option pairs, resolving causal ambiguities and sharpening temporal focus. These refined queries subsequently direct a temporal grounding model to precisely retrieve the most salient segments, complemented by an adaptive fusion mechanism dynamically integrating the evidence to maximize relevance. The integrated visual-textual cues are then processed by an MLLM to generate accurate, contextually-grounded answers. Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method’s precise visual grounding substantially enhances the understanding of video-question relationships, achieving state-of-the-art (SOTA) performance on complex reasoning tasks while maintaining computational efficiency.

[201] FOCUS: Fused Observation of Channels for Unveiling Spectra

Xi Xiao, Aristeidis Tsaris, Anika Tabassum, John Lagergren, Larry M. York, Tianyang Wang, Xiao Wang

Main category: cs.CV

TL;DR: FOCUS enables efficient and reliable spatial-spectral interpretability for Vision Transformers (ViTs) in hyperspectral imaging (HSI) by addressing challenges of spectral cue capture and computational constraints.

DetailsMotivation: Existing saliency methods fail to capture meaningful spectral cues in HSI, and full-spectrum ViTs are computationally prohibitive for interpretability.

Method: FOCUS introduces class-specific spectral prompts and a learnable [SINK] token to guide attention and absorb noise, enabling stable 3D saliency maps and spectral importance curves without gradient backpropagation.

Result: FOCUS improves band-level IoU by 15%, reduces attention collapse by over 40%, and aligns closely with expert annotations.

Conclusion: FOCUS bridges the gap between black-box ViTs and trustworthy HSI decision-making with minimal parameter overhead, making high-resolution interpretability practical.

Abstract: Hyperspectral imaging (HSI) captures hundreds of narrow, contiguous wavelength bands, making it a powerful tool in biology, agriculture, and environmental monitoring. However, interpreting Vision Transformers (ViTs) in this setting remains largely unexplored due to two key challenges: (1) existing saliency methods struggle to capture meaningful spectral cues, often collapsing attention onto the class token, and (2) full-spectrum ViTs are computationally prohibitive for interpretability, given the high-dimensional nature of HSI data. We present FOCUS, the first framework that enables reliable and efficient spatial-spectral interpretability for frozen ViTs. FOCUS introduces two core components: class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention. Together, these designs make it possible to generate stable and interpretable 3D saliency maps and spectral importance curves in a single forward pass, without any gradient backpropagation or backbone modification. FOCUS improves band-level IoU by 15 percent, reduces attention collapse by over 40 percent, and produces saliency results that align closely with expert annotations. With less than 1 percent parameter overhead, our method makes high-resolution ViT interpretability practical for real-world hyperspectral applications, bridging a long-standing gap between black-box modeling and trustworthy HSI decision-making.

[202] A Novel Downsampling Strategy Based on Information Complementarity for Medical Image Segmentation

Wenbo Yue, Chang Li, Guoping Xu

Main category: cs.CV

TL;DR: The paper proposes Hybrid Pooling Downsampling (HPD), a method replacing traditional downsampling in CNNs to retain spatial details, improving semantic segmentation accuracy.

DetailsMotivation: Traditional downsampling methods in CNNs lose key spatial information, affecting pixel-level prediction in semantic segmentation.

Method: HPD uses MinMaxPooling to retain image contrast and details by extracting local maximum values, replacing traditional methods like max pooling.

Result: Experiments on ACDC and Synapse datasets show HPD improves segmentation, increasing DSC by 0.5% on average.

Conclusion: HPD offers an efficient solution for semantic segmentation by preserving spatial details better than traditional methods.

Abstract: In convolutional neural networks (CNNs), downsampling operations are crucial to model performance. Although traditional downsampling methods (such as maximum pooling and cross-row convolution) perform well in feature aggregation, receptive field expansion, and computational reduction, they may lead to the loss of key spatial information in semantic segmentation tasks, thereby affecting the pixel-by-pixel prediction accuracy.To this end, this study proposes a downsampling method based on information complementarity - Hybrid Pooling Downsampling (HPD). The core is to replace the traditional method with MinMaxPooling, and effectively retain the light and dark contrast and detail features of the image by extracting the maximum value information of the local area.Experiment on various CNN architectures on the ACDC and Synapse datasets show that HPD outperforms traditional methods in segmentation performance, and increases the DSC coefficient by 0.5% on average. The results show that the HPD module provides an efficient solution for semantic segmentation tasks.

[203] Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models

Beier Zhu, Ruoyu Wang, Tong Zhao, Hanwang Zhang, Chi Zhang

Main category: cs.CV

TL;DR: The paper introduces EPD, a novel ODE solver for diffusion models that reduces sampling latency while maintaining image quality by using parallel gradient evaluations.

DetailsMotivation: Diffusion models suffer from high sampling latency due to sequential denoising, and existing acceleration methods degrade image quality under low-latency constraints.

Method: EPD incorporates multiple parallel gradient evaluations per ODE step, fully parallelizing computations. It optimizes learnable parameters via distillation with minimal training overhead.

Result: EPD achieves superior performance (e.g., FID scores of 4.47 on CIFAR-10, 7.97 on FFHQ) at low latency (5 NFE), outperforming existing solvers.

Conclusion: EPD is an effective, plug-and-play solution for high-quality, low-latency sampling in diffusion models.

Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face image quality degradation under a low-latency budget. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as \ours), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling. Our method optimizes a small set of learnable parameters in a distillation fashion, ensuring minimal training overhead. In addition, our method can serve as a plugin to improve existing ODE samplers. Extensive experiments on various image synthesis benchmarks demonstrate the effectiveness of our \ours~in achieving high-quality and low-latency sampling. For example, at the same latency level of 5 NFE, EPD achieves an FID of 4.47 on CIFAR-10, 7.97 on FFHQ, 8.17 on ImageNet, and 8.26 on LSUN Bedroom, surpassing existing learning-based solvers by a significant margin. Codes are available in https://github.com/BeierZhu/EPD.

[204] An Evaluation of DUSt3R/MASt3R/VGGT 3D Reconstruction on Photogrammetric Aerial Blocks

Xinyi Wu, Steven Landgraf, Markus Ulrich, Rongjun Qin

Main category: cs.CV

TL;DR: The paper evaluates DUSt3R, MASt3R, and VGGT models on aerial images, showing their effectiveness in sparse, low-resolution scenarios but limitations with high-resolution or large image sets.

DetailsMotivation: To assess the performance of transformer-based 3D reconstruction models (DUSt3R, MASt3R, VGGT) on photogrammetric aerial blocks, which remains unexplored despite their success in sparse image sets.

Method: Comprehensive evaluation of pre-trained DUSt3R, MASt3R, and VGGT models on the UseGeo dataset for pose estimation and dense 3D reconstruction, focusing on sparse image sets (fewer than 10 images, up to 518 pixels resolution).

Result: The models accurately reconstruct dense point clouds from sparse images, with up to +50% completeness gains over COLMAP. VGGT shows higher efficiency and reliability in pose estimation. However, performance declines with high-resolution images or large sets.

Conclusion: Transformer-based methods are promising for sparse, low-resolution scenarios but cannot fully replace traditional SfM and MVS. They serve as complementary tools in challenging conditions.

Abstract: State-of-the-art 3D computer vision algorithms continue to advance in handling sparse, unordered image sets. Recently developed foundational models for 3D reconstruction, such as Dense and Unconstrained Stereo 3D Reconstruction (DUSt3R), Matching and Stereo 3D Reconstruction (MASt3R), and Visual Geometry Grounded Transformer (VGGT), have attracted attention due to their ability to handle very sparse image overlaps. Evaluating DUSt3R/MASt3R/VGGT on typical aerial images matters, as these models may handle extremely low image overlaps, stereo occlusions, and textureless regions. For redundant collections, they can accelerate 3D reconstruction by using extremely sparsified image sets. Despite tests on various computer vision benchmarks, their potential on photogrammetric aerial blocks remains unexplored. This paper conducts a comprehensive evaluation of the pre-trained DUSt3R/MASt3R/VGGT models on the aerial blocks of the UseGeo dataset for pose estimation and dense 3D reconstruction. Results show these methods can accurately reconstruct dense point clouds from very sparse image sets (fewer than 10 images, up to 518 pixels resolution), with completeness gains up to +50% over COLMAP. VGGT also demonstrates higher computational efficiency, scalability, and more reliable camera pose estimation. However, all exhibit limitations with high-resolution images and large sets, as pose reliability declines with more images and geometric complexity. These findings suggest transformer-based methods cannot fully replace traditional SfM and MVS, but offer promise as complementary approaches, especially in challenging, low-resolution, and sparse scenarios.

[205] Exploring Scalable Unified Modeling for General Low-Level Vision

Xiangyu Chen, Kaiwen Zhu, Yuandong Pu, Shuo Cao, Xiaohui Li, Wenlong Zhang, Yihao Liu, Yu Qiao, Jiantao Zhou, Chao Dong

Main category: cs.CV

TL;DR: The paper proposes a Visual task Prompt-based Image Processing (VPIP) framework for unified modeling of diverse low-level vision tasks, achieving strong performance and scalability.

DetailsMotivation: Addressing the challenge of unified modeling across diverse low-level vision tasks (e.g., restoration, enhancement, stylization) with varying formulations and outputs.

Method: Introduces VPIP, leveraging input-target image pairs as visual prompts, with an end-to-end backbone, prompt encoder, and interaction module. Develops GenLV, a unified model, and tests scalability via model capacity and task diversity.

Result: Achieves strong performance across 100+ tasks; joint training improves generalization, especially for data-limited tasks. Demonstrates adaptability in zero-shot, few-shot, and fine-tuning scenarios.

Conclusion: VPIP is effective, scalable, and a promising foundation for general low-level vision modeling.

Abstract: Low-level vision involves a wide spectrum of tasks, including image restoration, enhancement, stylization, and feature extraction, which differ significantly in both task formulation and output domains. To address the challenge of unified modeling across such diverse tasks, we propose a Visual task Prompt-based Image Processing (VPIP) framework that leverages input-target image pairs as visual prompts to guide the model in performing a variety of low-level vision tasks. The framework comprises an end-to-end image processing backbone, a prompt encoder, and a prompt interaction module, enabling flexible integration with various architectures and effective utilization of task-specific visual representations. Based on this design, we develop a unified low-level vision model, GenLV, and evaluate its performance across multiple representative tasks. To explore the scalability of this approach, we extend the framework along two dimensions: model capacity and task diversity. We construct a large-scale benchmark consisting of over 100 low-level vision tasks and train multiple versions of the model with varying scales. Experimental results show that the proposed method achieves considerable performance across a wide range of tasks. Notably, increasing the number of training tasks enhances generalization, particularly for tasks with limited data, indicating the model’s ability to learn transferable representations through joint training. Further evaluations in zero-shot generalization, few-shot transfer, and task-specific fine-tuning scenarios demonstrate the model’s strong adaptability, confirming the effectiveness, scalability, and potential of the proposed framework as a unified foundation for general low-level vision modeling.

[206] Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection

Juan Hu, Shaojing Fan, Terence Sim

Main category: cs.CV

TL;DR: The paper introduces HICOM, a framework for detecting multi-face deepfake videos by leveraging human cognition cues, improving accuracy and interpretability.

DetailsMotivation: Existing deepfake detection methods struggle with multi-face scenarios due to a lack of contextual awareness.

Method: Human studies identified four key cues for detection, which were used to develop HICOM, a framework incorporating these cues and an LLM for explanations.

Result: HICOM improved accuracy by 3.3% in-dataset and 2.8% under perturbations, and outperformed existing methods by 5.8% on unseen datasets.

Conclusion: Incorporating human-inspired cues enhances deepfake detection, offering better accuracy, generalization, and interpretability.

Abstract: Multi-face deepfake videos are becoming increasingly prevalent, often appearing in natural social settings that challenge existing detection methods. Most current approaches excel at single-face detection but struggle in multi-face scenarios, due to a lack of awareness of crucial contextual cues. In this work, we develop a novel approach that leverages human cognition to analyze and defend against multi-face deepfake videos. Through a series of human studies, we systematically examine how people detect deepfake faces in social settings. Our quantitative analysis reveals four key cues humans rely on: scene-motion coherence, inter-face appearance compatibility, interpersonal gaze alignment, and face-body consistency. Guided by these insights, we introduce \textsf{HICOM}, a novel framework designed to detect every fake face in multi-face scenarios. Extensive experiments on benchmark datasets show that \textsf{HICOM} improves average accuracy by 3.3% in in-dataset detection and 2.8% under real-world perturbations. Moreover, it outperforms existing methods by 5.8% on unseen datasets, demonstrating the generalization of human-inspired cues. \textsf{HICOM} further enhances interpretability by incorporating an LLM to provide human-readable explanations, making detection results more transparent and convincing. Our work sheds light on involving human factors to enhance defense against deepfakes.

[207] SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Xinkui Zhao, Kingsum Chow, Gang Xiong, Lin Ye, Shuiguang Deng

Main category: cs.CV

TL;DR: SegQuant is a unified quantization framework for diffusion models, combining segment-aware and dual-scale techniques to improve efficiency without retraining.

DetailsMotivation: Existing PTQ methods for diffusion models lack generalizability and compatibility with industrial pipelines, limiting their deployment.

Method: SegQuant uses a segment-aware, graph-based strategy (SegLinear) and a dual-scale quantization scheme (DualScale) to preserve visual fidelity.

Result: SegQuant achieves strong performance across models and ensures compatibility with deployment tools.

Conclusion: SegQuant offers a versatile and efficient solution for quantizing diffusion models, enhancing their practicality in resource-constrained settings.

Abstract: Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.

[208] FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models

Dong Shu, Haoyang Yuan, Yuchen Wang, Yanguang Liu, Huopu Zhang, Haiyan Zhao, Mengnan Du

Main category: cs.CV

TL;DR: FinChart-Bench is a new benchmark for evaluating LVLMs on financial charts, revealing key limitations in their performance.

DetailsMotivation: Financial charts are complex and underexplored, necessitating a dedicated benchmark to assess LVLM capabilities.

Method: FinChart-Bench includes 1,200 financial chart images with 7,016 annotated questions (TF, MC, QA). 25 LVLMs were evaluated.

Result: Key findings: narrowing gap between open/closed-source models, performance degradation in upgrades, struggles with instruction following, spatial reasoning limitations, and unreliability as automated evaluators.

Conclusion: Current LVLMs have significant limitations in financial chart understanding, highlighting the need for further research.

Abstract: Large vision-language models (LVLMs) have made significant progress in chart understanding. However, financial charts, characterized by complex temporal structures and domain-specific terminology, remain notably underexplored. We introduce FinChart-Bench, the first benchmark specifically focused on real-world financial charts. FinChart-Bench comprises 1,200 financial chart images collected from 2015 to 2024, each annotated with True/False (TF), Multiple Choice (MC), and Question Answering (QA) questions, totaling 7,016 questions. We conduct a comprehensive evaluation of 25 state-of-the-art LVLMs on FinChart-Bench. Our evaluation reveals critical insights: (1) the performance gap between open-source and closed-source models is narrowing, (2) performance degradation occurs in upgraded models within families, (3) many models struggle with instruction following, (4) both advanced models show significant limitations in spatial reasoning abilities, and (5) current LVLMs are not reliable enough to serve as automated evaluators. These findings highlight important limitations in current LVLM capabilities for financial chart understanding. The FinChart-Bench dataset is available at https://huggingface.co/datasets/Tizzzzy/FinChart-Bench.

[209] PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing

Fu-Jen Tsai, Yan-Tsung Peng, Yen-Yu Lin, Chia-Wen Lin

Main category: cs.CV

TL;DR: PHATNet improves dehazing by transferring haze patterns from unseen domains to source images, enhancing model adaptability with novel losses.

DetailsMotivation: Existing dehazing models struggle with unseen real-world hazy images due to limited training data, prompting a need for better domain adaptation.

Method: Proposes PHATNet, which transfers haze patterns to source-domain images, and introduces Haze-Transfer-Consistency and Content-Leakage Losses for better disentanglement.

Result: PHATNet significantly enhances state-of-the-art dehazing models on real-world datasets.

Conclusion: PHATNet offers an effective domain adaptation solution for image dehazing, improving performance on unseen data.

Abstract: Image dehazing aims to remove unwanted hazy artifacts in images. Although previous research has collected paired real-world hazy and haze-free images to improve dehazing models’ performance in real-world scenarios, these models often experience significant performance drops when handling unseen real-world hazy images due to limited training data. This issue motivates us to develop a flexible domain adaptation method to enhance dehazing performance during testing. Observing that predicting haze patterns is generally easier than recovering clean content, we propose the Physics-guided Haze Transfer Network (PHATNet) which transfers haze patterns from unseen target domains to source-domain haze-free images, creating domain-specific fine-tuning sets to update dehazing models for effective domain adaptation. Additionally, we introduce a Haze-Transfer-Consistency loss and a Content-Leakage Loss to enhance PHATNet’s disentanglement ability. Experimental results demonstrate that PHATNet significantly boosts state-of-the-art dehazing models on benchmark real-world image dehazing datasets.

[210] Paired Image Generation with Diffusion-Guided Diffusion Models

Haoxuan Zhang, Wenju Cui, Yuzhu Cao, Tao Tan, Jie Liu, Yunsong Peng, Jian Zheng

Main category: cs.CV

TL;DR: A paired image generation method for DBT images is proposed to improve lesion segmentation by generating high-quality paired images and annotations without external conditions.

DetailsMotivation: High concealment of mass lesions in DBT images makes manual annotation difficult, leading to a lack of annotated data for training. Existing diffusion models struggle with lesion feature learning and lack annotation generation.

Method: A conditional diffusion model with an extra diffusion guider is trained to generate paired DBT slices and lesion masks, enhancing supervised training.

Result: The method improves generation quality and alleviates data shortage, boosting performance in downstream segmentation tasks.

Conclusion: The proposed method effectively addresses data scarcity and enhances lesion segmentation in DBT images.

Abstract: The segmentation of mass lesions in digital breast tomosynthesis (DBT) images is very significant for the early screening of breast cancer. However, the high-density breast tissue often leads to high concealment of the mass lesions, which makes manual annotation difficult and time-consuming. As a result, there is a lack of annotated data for model training. Diffusion models are commonly used for data augmentation, but the existing methods face two challenges. First, due to the high concealment of lesions, it is difficult for the model to learn the features of the lesion area. This leads to the low generation quality of the lesion areas, thus limiting the quality of the generated images. Second, existing methods can only generate images and cannot generate corresponding annotations, which restricts the usability of the generated images in supervised training. In this work, we propose a paired image generation method. The method does not require external conditions and can achieve the generation of paired images by training an extra diffusion guider for the conditional diffusion model. During the experimental phase, we generated paired DBT slices and mass lesion masks. Then, we incorporated them into the supervised training process of the mass lesion segmentation task. The experimental results show that our method can improve the generation quality without external conditions. Moreover, it contributes to alleviating the shortage of annotated data, thus enhancing the performance of downstream tasks.

[211] Training Self-Supervised Depth Completion Using Sparse Measurements and a Single Image

Rizhao Fan, Zhigen Li, Heping Li, Ning An

Main category: cs.CV

TL;DR: A novel self-supervised depth completion method using only sparse depth and single images, eliminating the need for dense labels or multi-frame data.

DetailsMotivation: Overcoming the limitations of costly dense annotations and multi-frame dependencies in existing depth completion methods.

Method: Proposes self-supervised training with sparse depth and single images, using novel loss functions and segmentation maps from vision foundation models.

Result: Effective depth propagation from observed to unobserved regions, validated through extensive experiments.

Conclusion: The method offers a practical solution for depth completion without dense labels or multi-frame data, with promising performance.

Abstract: Depth completion is an important vision task, and many efforts have been made to enhance the quality of depth maps from sparse depth measurements. Despite significant advances, training these models to recover dense depth from sparse measurements remains a challenging problem. Supervised learning methods rely on dense depth labels to predict unobserved regions, while self-supervised approaches require image sequences to enforce geometric constraints and photometric consistency between frames. However, acquiring dense annotations is costly, and multi-frame dependencies limit the applicability of self-supervised methods in static or single-frame scenarios. To address these challenges, we propose a novel self-supervised depth completion paradigm that requires only sparse depth measurements and their corresponding image for training. Unlike existing methods, our approach eliminates the need for dense depth labels or additional images captured from neighboring viewpoints. By leveraging the characteristics of depth distribution, we design novel loss functions that effectively propagate depth information from observed points to unobserved regions. Additionally, we incorporate segmentation maps generated by vision foundation models to further enhance depth estimation. Extensive experiments demonstrate the effectiveness of our proposed method.

[212] An Uncertainty-aware DETR Enhancement Framework for Object Detection

Xingshu Chen, Sicheng Yu, Chong Cheng, Hao Wang, Ting Tian

Main category: cs.CV

TL;DR: The paper introduces an uncertainty-aware framework for DETR-based object detectors, improving localization accuracy and modeling prediction uncertainty using Gaussian distributions and Gromov-Wasserstein distance. It achieves state-of-the-art results on COCO and leukocyte detection tasks.

DetailsMotivation: Conventional object detectors lack uncertainty modeling, limiting robustness. This work aims to enhance DETR-based detectors by explicitly addressing prediction uncertainty and improving localization accuracy.

Method: Proposes modeling bounding boxes as multivariate Gaussian distributions, incorporating Gromov-Wasserstein distance in the loss function, and using Bayes Risk to filter high-risk predictions. Also introduces a method to quantify localization uncertainty.

Result: Demonstrates improved performance on COCO benchmark and achieves state-of-the-art results on leukocyte detection datasets (LISC and WBCDD).

Conclusion: The framework is scalable and effective for both general and domain-specific object detection tasks, confirming its robustness and versatility.

Abstract: This paper investigates the problem of object detection with a focus on improving both the localization accuracy of bounding boxes and explicitly modeling prediction uncertainty. Conventional detectors rely on deterministic bounding box regression, ignoring uncertainty in predictions and limiting model robustness. In this paper, we propose an uncertainty-aware enhancement framework for DETR-based object detectors. We model bounding boxes as multivariate Gaussian distributions and incorporate the Gromov-Wasserstein distance into the loss function to better align the predicted and ground-truth distributions. Building on this, we derive a Bayes Risk formulation to filter high-risk information and improve detection reliability. We also propose a simple algorithm to quantify localization uncertainty via confidence intervals. Experiments on the COCO benchmark show that our method can be effectively integrated into existing DETR variants, enhancing their performance. We further extend our framework to leukocyte detection tasks, achieving state-of-the-art results on the LISC and WBCDD datasets. These results confirm the scalability of our framework across both general and domain-specific detection tasks. Code page: https://github.com/ParadiseforAndaChen/An-Uncertainty-aware-DETR-Enhancement-Framework-for-Object-Detection.

[213] Hybrid-supervised Hypergraph-enhanced Transformer for Micro-gesture Based Emotion Recognition

Zhaoqiang Xia, Hexiang Huang, Haoyu Chen, Xiaoyi Feng, Guoying Zhao

Main category: cs.CV

TL;DR: The paper proposes a hypergraph-enhanced Transformer framework for emotion recognition from micro-gestures, combining self-supervised and supervised learning, achieving state-of-the-art results.

DetailsMotivation: Micro-gestures reveal human emotions but lack sufficient modeling. This work aims to bridge this gap by leveraging hypergraph-enhanced Transformers.

Method: A hybrid-supervised framework with hypergraph-enhanced Transformer encoder/decoder, self-reconstruction tasks, and emotion recognition heads.

Result: Outperforms existing methods on iMiGUE and SMG datasets under multiple metrics.

Conclusion: The proposed method effectively models micro-gestures for emotion recognition, demonstrating superior performance.

Abstract: Micro-gestures are unconsciously performed body gestures that can convey the emotion states of humans and start to attract more research attention in the fields of human behavior understanding and affective computing as an emerging topic. However, the modeling of human emotion based on micro-gestures has not been explored sufficiently. In this work, we propose to recognize the emotion states based on the micro-gestures by reconstructing the behavior patterns with a hypergraph-enhanced Transformer in a hybrid-supervised framework. In the framework, hypergraph Transformer based encoder and decoder are separately designed by stacking the hypergraph-enhanced self-attention and multiscale temporal convolution modules. Especially, to better capture the subtle motion of micro-gestures, we construct a decoder with additional upsampling operations for a reconstruction task in a self-supervised learning manner. We further propose a hypergraph-enhanced self-attention module where the hyperedges between skeleton joints are gradually updated to present the relationships of body joints for modeling the subtle local motion. Lastly, for exploiting the relationship between the emotion states and local motion of micro-gestures, an emotion recognition head from the output of encoder is designed with a shallow architecture and learned in a supervised way. The end-to-end framework is jointly trained in a one-stage way by comprehensively utilizing self-reconstruction and supervision information. The proposed method is evaluated on two publicly available datasets, namely iMiGUE and SMG, and achieves the best performance under multiple metrics, which is superior to the existing methods.

[214] Region-aware Depth Scale Adaptation with Sparse Measurements

Rizhao Fan, Tianfang Ma, Zhigen Li, Ning An, Jian Cheng

Main category: cs.CV

TL;DR: A non-learning-based method uses sparse depth measurements to convert relative-scale depth predictions from foundation models into metric-scale depth, preserving generalization without retraining.

DetailsMotivation: Foundation models for depth prediction often output relative-scale depth, limiting real-world application. Existing scale adaptation methods are costly and reduce generalization.

Method: Leverages sparse depth measurements to adapt relative-scale predictions to metric-scale without retraining or fine-tuning.

Result: Effectively bridges the gap between relative and metric depth, maintaining generalization and avoiding additional computational costs.

Conclusion: The approach enables metric-scale depth prediction without compromising the foundation models’ generalization or requiring costly adaptations.

Abstract: In recent years, the emergence of foundation models for depth prediction has led to remarkable progress, particularly in zero-shot monocular depth estimation. These models generate impressive depth predictions; however, their outputs are often in relative scale rather than metric scale. This limitation poses challenges for direct deployment in real-world applications. To address this, several scale adaptation methods have been proposed to enable foundation models to produce metric depth. However, these methods are typically costly, as they require additional training on new domains and datasets. Moreover, fine-tuning these models often compromises their original generalization capabilities, limiting their adaptability across diverse scenes. In this paper, we introduce a non-learning-based approach that leverages sparse depth measurements to adapt the relative-scale predictions of foundation models into metric-scale depth. Our method requires neither retraining nor fine-tuning, thereby preserving the strong generalization ability of the original foundation models while enabling them to produce metric depth. Experimental results demonstrate the effectiveness of our approach, high-lighting its potential to bridge the gap between relative and metric depth without incurring additional computational costs or sacrificing generalization ability.

[215] BeatFormer: Efficient motion-robust remote heart rate estimation through unsupervised spectral zoomed attention filters

Joaquim Comas, Federico Sukno

Main category: cs.CV

TL;DR: BeatFormer is a lightweight spectral attention model for rPPG estimation, combining deep learning and handcrafted methods for robustness and efficiency. It uses spectral contrastive learning (SCL) to train without PPG or HR labels and performs well in cross-dataset evaluations.

DetailsMotivation: Existing rPPG methods either rely on large datasets (deep learning) or linear assumptions (handcrafted methods), limiting performance. A hybrid approach is needed to combine their strengths.

Method: BeatFormer integrates zoomed orthonormal complex attention and frequency-domain energy measurement. It uses SCL for training without PPG or HR labels.

Result: Validated on PURE, UBFC-rPPG, and MMPD datasets, BeatFormer shows robustness and performance, especially in cross-dataset evaluations under motion.

Conclusion: BeatFormer offers a lightweight, efficient, and robust solution for rPPG estimation by combining spectral attention and contrastive learning.

Abstract: Remote photoplethysmography (rPPG) captures cardiac signals from facial videos and is gaining attention for its diverse applications. While deep learning has advanced rPPG estimation, it relies on large, diverse datasets for effective generalization. In contrast, handcrafted methods utilize physiological priors for better generalization in unseen scenarios like motion while maintaining computational efficiency. However, their linear assumptions limit performance in complex conditions, where deep learning provides superior pulsatile information extraction. This highlights the need for hybrid approaches that combine the strengths of both methods. To address this, we present BeatFormer, a lightweight spectral attention model for rPPG estimation, which integrates zoomed orthonormal complex attention and frequency-domain energy measurement, enabling a highly efficient model. Additionally, we introduce Spectral Contrastive Learning (SCL), which allows BeatFormer to be trained without any PPG or HR labels. We validate BeatFormer on the PURE, UBFC-rPPG, and MMPD datasets, demonstrating its robustness and performance, particularly in cross-dataset evaluations under motion scenarios.

[216] TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP

Fan Li, Zanyi Wang, Zeyi Huang, Guang Dai, Jingdong Wang, Mengmeng Wang

Main category: cs.CV

TL;DR: A unified 2D pre-trained multi-modal network simplifies 3D visual grounding by processing RGB images, text, and point clouds together, reducing parameters and improving performance.

DetailsMotivation: Existing methods rely on separate encoders for different modalities, leading to inefficiency and complexity. The goal is to unify feature extraction and fusion for better adaptability.

Method: Leverages a 2D CLIP bi-modal model with adapter-based fine-tuning, introduces a GARF module for geometric feature fusion, and uses a multi-modal decoder for cross-modal understanding.

Result: Reduces trainable parameters by ~58%, improves 3D detection by 6.52%, and 3D visual grounding by 6.25%.

Conclusion: The proposed unified approach simplifies architecture, enhances performance, and reduces training inefficiency for 3D visual grounding.

Abstract: 3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi-modal model with adapter-based fine-tuning, this framework effectively adapts to the tri-modal setting, improving both adaptability and performance across modalities. Our Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi-scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi-modal decoder to facilitate deep cross-modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end-to-end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58%, while achieving a 6.52% improvement in the 3D detection task and a 6.25% improvement in the 3D visual grounding task.

[217] Semantic-Aware Representation Learning for Multi-label Image Classification

Ren-Dong Xie, Zhi-Fen He, Bo Li, Bin Liu, Jin-Yan Hu

Main category: cs.CV

TL;DR: Proposes SARL for multi-label image classification, using semantic-aware features and optimal transport-based attention to improve precision.

DetailsMotivation: Existing methods like attention mechanisms or GCNs may produce noisy or imprecise object localization.

Method: Uses label semantic-related feature learning, optimal transport-based attention, and regional score aggregation.

Result: Outperforms existing methods on PASCAL VOC 2007 and MS-COCO datasets.

Conclusion: SARL effectively improves multi-label classification by enhancing semantic alignment and precision.

Abstract: Multi-label image classification, an important research area in computer vision, focuses on identifying multiple labels or concepts within an image. Existing approaches often employ attention mechanisms or graph convolutional networks (GCNs) to learn image representation. However, this representation may contain noise and may not locate objects precisely. Therefore, this paper proposes a Semantic-Aware Representation Learning (SARL) for multi-label image classification. First, a label semantic-related feature learning module is utilized to extract semantic-related features. Then, an optimal transport-based attention mechanism is designed to obtain semantically aligned image representation. Finally, a regional score aggregation strategy is used for multi-label prediction. Experimental results on two benchmark datasets, PASCAL VOC 2007 and MS-COCO, demonstrate the superiority of SARL over existing methods.

[218] Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction

Xiufeng Huang, Ka Chun Cheung, Runmin Cong, Simon See, Renjie Wan

Main category: cs.CV

TL;DR: A disentangled framework for efficient 3D Gaussian prediction, reducing computational demands while maintaining high-quality 3D reconstruction.

DetailsMotivation: Current methods for 3D Gaussian Splatting reconstruction are resource-intensive and slow due to entangled geometry and appearance prediction, relying on data-driven priors.

Method: Uses a stereo vision backbone to extract local image pair features, fuses them via global attention, and employs dedicated heads for geometry and appearance prediction, refined for high-quality output.

Result: Achieves pose-free 3D reconstruction, improving robustness and practicality with reduced resource demands.

Conclusion: The proposed method offers an efficient, scalable solution for real-world 3D content generation.

Abstract: Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.

[219] 3-Dimensional CryoEM Pose Estimation and Shift Correction Pipeline

Kaishva Chintan Shah, Virajith Boddapati, Karthik S. Gurumoorthy, Sandip Kaledhonkar, Ajit Rajwade

Main category: cs.CV

TL;DR: Error: OutputParser failed

DetailsMotivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Accurate pose estimation and shift correction are key challenges in cryo-EM due to the very low SNR, which directly impacts the fidelity of 3D reconstructions. We present an approach for pose estimation in cryo-EM that leverages multi-dimensional scaling (MDS) techniques in a robust manner to estimate the 3D rotation matrix of each particle from pairs of dihedral angles. We express the rotation matrix in the form of an axis of rotation and a unit vector in the plane perpendicular to the axis. The technique leverages the concept of common lines in 3D reconstruction from projections. However, common line estimation is ridden with large errors due to the very low SNR of cryo-EM projection images. To address this challenge, we introduce two complementary components: (i) a robust joint optimization framework for pose estimation based on an $\ell_1$-norm objective or a similar robust norm, which simultaneously estimates rotation axes and in-plane vectors while exactly enforcing unit norm and orthogonality constraints via projected coordinate descent; and (ii) an iterative shift correction algorithm that estimates consistent in-plane translations through a global least-squares formulation. While prior approaches have leveraged such embeddings and common-line geometry for orientation recovery, existing formulations typically rely on $\ell_2$-based objectives that are sensitive to noise, and enforce geometric constraints only approximately. These choices, combined with a sequential pipeline structure, can lead to compounding errors and suboptimal reconstructions in low-SNR regimes. Our pipeline consistently outperforms prior methods in both Euler angle accuracy and reconstruction fidelity, as measured by the Fourier Shell Correlation (FSC).

[220] Probabilistic smooth attention for deep multiple instance learning in medical imaging

Francisco M. Castro-Macías, Pablo Morales-Álvarez, Yunan Wu, Rafael Molina, Aggelos K. Katsaggelos

Main category: cs.CV

TL;DR: A probabilistic framework for Multiple Instance Learning (MIL) in medical imaging improves predictive performance and provides interpretable uncertainty maps.

DetailsMotivation: Addressing the deterministic treatment of attention values in deep MIL methods, which may overlook uncertainty in instance contributions.

Method: Proposes a probabilistic framework estimating distributions over attention values, capturing global and local interactions.

Result: Achieves top predictive performance across three medical datasets and eleven baselines, with interpretable uncertainty maps.

Conclusion: The probabilistic approach enhances MIL in medical imaging by improving accuracy and interpretability.

Abstract: The Multiple Instance Learning (MIL) paradigm is attracting plenty of attention in medical imaging classification, where labeled data is scarce. MIL methods cast medical images as bags of instances (e.g. patches in whole slide images, or slices in CT scans), and only bag labels are required for training. Deep MIL approaches have obtained promising results by aggregating instance-level representations via an attention mechanism to compute the bag-level prediction. These methods typically capture both local interactions among adjacent instances and global, long-range dependencies through various mechanisms. However, they treat attention values deterministically, potentially overlooking uncertainty in the contribution of individual instances. In this work we propose a novel probabilistic framework that estimates a probability distribution over the attention values, and accounts for both global and local interactions. In a comprehensive evaluation involving {\color{review} eleven} state-of-the-art baselines and three medical datasets, we show that our approach achieves top predictive performance in different metrics. Moreover, the probabilistic treatment of the attention provides uncertainty maps that are interpretable in terms of illness localization.

[221] Open-set Cross Modal Generalization via Multimodal Unified Representation

Hai Huang, Yan Xia, Shulei Wang, Hanting Wang, Minghui Fang, Shengpeng Ji, Sashuai Zhou, Tao Jin, Zhou Zhao

Main category: cs.CV

TL;DR: The paper introduces Open-set Cross Modal Generalization (OSCMG), a more challenging task than CMG, to evaluate multimodal unified representations in open-set conditions. It proposes MICU with FCMI and CUJP components to address the limitations of existing methods.

DetailsMotivation: Prior work lacks consideration for open-set environments in multimodal unified representations, which is crucial for real-world applications.

Method: Proposes MICU with two components: FCMI for multimodal alignment via contrastive learning and masking, and CUJP for feature diversity and uncertainty handling.

Result: Extensive experiments validate MICU’s effectiveness on both CMG and OSCMG tasks.

Conclusion: MICU successfully addresses the challenges of OSCMG, enhancing generalization to unseen classes in open-set conditions.

Abstract: This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance generalization. CUJP enhances feature diversity and model uncertainty by integrating modality-agnostic feature selection with self-supervised learning, thereby strengthening the model’s ability to handle unknown categories in open-set tasks. Extensive experiments on CMG and the newly proposed OSCMG validate the effectiveness of our approach. The code is available at https://github.com/haihuangcode/CMG.

[222] Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices

Saeid Ghafouri, Mohsen Fayyaz, Xiangchen Li, Deepu John, Bo Ji, Dimitrios Nikolopoulos, Hans Vandierendonck

Main category: cs.CV

TL;DR: Polymorph is a context-aware framework for real-time multi-label video classification on embedded devices, using lightweight Low Rank Adapters (LoRA) to reduce energy and improve efficiency.

DetailsMotivation: Limited compute and energy budgets on embedded devices require efficient inference for real-time multi-label video classification.

Method: Polymorph dynamically activates minimal sets of LoRA adapters per frame, specialized for subsets of classes based on co-occurrence patterns, avoiding full-model switching.

Result: Polymorph reduces energy consumption by 40% and improves mAP by 9 points on the TAO dataset.

Conclusion: Polymorph offers a scalable, efficient solution for real-time video classification on resource-constrained devices.

Abstract: Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.

[223] Decision PCR: Decision version of the Point Cloud Registration task

Yaojie Zhang, Tianlun Huang, Weijun Wang, Wei Feng

Main category: cs.CV

TL;DR: The paper addresses low-overlap point cloud registration (PCR) by proposing a data-driven deep learning classifier to evaluate registration quality, improving performance of existing methods.

DetailsMotivation: Traditional metrics like Maximum Inlier Count fail under low inlier ratios, prompting the need for a better evaluation method.

Method: A deep learning-based classifier is trained on a dataset derived from 3DMatch to assess registration quality, integrated into standard PCR pipelines.

Result: Integration with GeoTransformer achieves 86.97% registration recall on 3DLoMatch and generalizes well on the ETH dataset.

Conclusion: The proposed classifier enhances PCR performance and generalizes effectively, setting a new benchmark for low-overlap registration.

Abstract: Low-overlap point cloud registration (PCR) remains a significant challenge in 3D vision. Traditional evaluation metrics, such as Maximum Inlier Count, become ineffective under extremely low inlier ratios. In this paper, we revisit the registration result evaluation problem and identify the Decision version of the PCR task as the fundamental problem. To address this Decision PCR task, we propose a data-driven approach. First, we construct a corresponding dataset based on the 3DMatch dataset. Then, a deep learning-based classifier is trained to reliably assess registration quality, overcoming the limitations of traditional metrics. To our knowledge, this is the first comprehensive study to address this task through a deep learning framework. We incorporate this classifier into standard PCR pipelines. When integrated with our approach, existing state-of-the-art PCR methods exhibit significantly enhanced registration performance. For example, combining our framework with GeoTransformer achieves a new SOTA registration recall of 86.97% on the challenging 3DLoMatch benchmark. Our method also demonstrates strong generalization capabilities on the unseen outdoor ETH dataset.

[224] Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang

Main category: cs.CV

TL;DR: HiCroPL is a hierarchical cross-modal prompt learning framework addressing modality isolation and hierarchical semantic decay in VLMs, improving generalization through bidirectional knowledge flow between text and vision.

DetailsMotivation: Adapting large-scale VLMs like CLIP to downstream tasks without losing generalization is challenging due to modality isolation and semantic decay.

Method: HiCroPL uses bidirectional knowledge flow, hierarchical knowledge mapping, and layer-specific proxies to refine semantics between text and vision.

Result: Achieves state-of-the-art results on 11 benchmarks across four tasks.

Conclusion: HiCroPL effectively enhances generalization in VLMs by addressing key bottlenecks in prompt learning.

Abstract: Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL’s superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Code is available at: https://github.com/zzeoZheng/HiCroPL.

[225] Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

Roy H. Jennings, Genady Paikin, Roy Shaul, Evgeny Soloveichik

Main category: cs.CV

TL;DR: Current MLLM approaches for image-based regression underperform due to generic prompts and preset vocabularies. RvTC, a bin-based method, outperforms by using flexible bins and data-specific prompts, achieving state-of-the-art results.

DetailsMotivation: Existing methods fail to leverage textual input's semantic understanding, performing no better than image-only models. The study aims to improve MLLM performance by addressing these limitations.

Method: Proposes RvTC, replacing vocabulary-constrained classification with a flexible bin-based approach. Uses data-specific prompts to enhance cross-modal understanding.

Result: RvTC achieves state-of-the-art performance on four datasets. Semantic prompts (e.g., challenge titles) boost correlations from 0.83 to 0.90 on AVA.

Conclusion: Semantic textual context is crucial for MLLMs in regression tasks. RvTC and tailored prompts significantly outperform generic approaches.

Abstract: Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., “How would you rate this image?”), assuming this mimics human rating behavior. Our analysis reveals these approaches provide no benefit over image-only training. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. More importantly, we demonstrate that data-specific prompts dramatically improve performance. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts improves correlations from 0.83 to 0.90, a new state-of-the-art. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information surpassing mere statistical biases. This underscores the importance of incorporating meaningful textual context in multimodal regression tasks.

[226] Axis-Aligned Document Dewarping

Chaoyun Wang, I-Chao Shen, Takeo Igarashi, Nanning Zheng, Caigui Jiang

Main category: cs.CV

TL;DR: The paper introduces a novel method for document dewarping by leveraging axis-aligned geometric constraints, improving performance on benchmarks.

DetailsMotivation: Existing methods rely on supervised regression without utilizing inherent geometric properties of documents.

Method: Proposes axis-aligned geometric constraints during training and axis alignment preprocessing during inference.

Result: Achieves SOTA results on benchmarks with 18.2%~34.5% improvements on the new AAD metric.

Conclusion: The method effectively enhances document dewarping by incorporating geometric properties and human visual perception.

Abstract: Document dewarping is crucial for many applications. However, existing learning-based methods primarily rely on supervised regression with annotated data without leveraging the inherent geometric properties in physical documents to the dewarping process. Our key insight is that a well-dewarped document is characterized by transforming distorted feature lines into axis-aligned ones. This property aligns with the inherent axis-aligned nature of the discrete grid geometry in planar documents. In the training phase, we propose an axis-aligned geometric constraint to enhance document dewarping. In the inference phase, we propose an axis alignment preprocessing strategy to reduce the dewarping difficulty. In the evaluation phase, we introduce a new metric, Axis-Aligned Distortion (AAD), that not only incorporates geometric meaning and aligns with human visual perception but also demonstrates greater robustness. As a result, our method achieves SOTA results on multiple existing benchmarks and achieves 18.2%~34.5% improvements on the AAD metric.

[227] FastSmoothSAM: A Fast Smooth Method For Segment Anything Model

Jiasheng Xu, Yewang Chen

Main category: cs.CV

TL;DR: The paper introduces a B-Spline curve fitting method to refine jagged edges in FastSAM, improving segmentation accuracy while maintaining real-time performance.

DetailsMotivation: FastSAM achieves real-time segmentation but produces jagged edges, limiting its accuracy. This work aims to enhance edge quality without sacrificing speed.

Method: A four-stage refining process using B-Spline curve fitting is applied to smooth edges in FastSAM, involving two rounds of curve fitting.

Result: The method improves visual quality and analytical accuracy of edges, maintaining real-time capabilities.

Conclusion: The refinement enhances FastSAM’s practical utility for applications like industrial automation and medical imaging, where precise edge recognition is vital.

Abstract: Accurately identifying and representing object edges is a challenging task in computer vision and image processing. The Segment Anything Model (SAM) has significantly influenced the field of image segmentation, but suffers from high memory consumption and long inference times, limiting its efficiency in real-time applications. To address these limitations, Fast Segment Anything (FastSAM) was proposed, achieving real-time segmentation. However, FastSAM often generates jagged edges that deviate from the true object shapes. Therefore, this paper introduces a novel refinement approach using B-Spline curve fitting techniques to enhance the edge quality in FastSAM. Leveraging the robust shape control and flexible geometric construction of B-Splines, a four-stage refining process involving two rounds of curve fitting is employed to effectively smooth jagged edges. This approach significantly improves the visual quality and analytical accuracy of object edges without compromising critical geometric information. The proposed method improves the practical utility of FastSAM by improving segmentation accuracy while maintaining real-time processing capabilities. This advancement unlocks greater potential for FastSAM technology in various real-world scenarios, such as industrial automation, medical imaging, and autonomous systems, where precise and efficient edge recognition is crucial.

[228] Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces Video-TT, a benchmark to evaluate video LLMs’ correctness and robustness in video understanding, revealing a significant performance gap compared to humans.

DetailsMotivation: Existing benchmarks fail to measure the gap between video LLMs and human intelligence in video interpretation, particularly in correctness and robustness.

Method: Video-TT includes 1,000 YouTube Shorts videos with open-ended and adversarial questions to test visual and narrative understanding.

Result: Evaluation shows video LLMs lag behind human performance in interpreting complex visual narratives and handling adversarial questions.

Conclusion: Video-TT highlights the need for improved video LLMs to bridge the gap with human-level video understanding.

Abstract: Human intelligence requires correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Thinking Test (Video-TT), to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance.

[229] OpenBreastUS: Benchmarking Neural Operators for Wave Imaging Using Breast Ultrasound Computed Tomography

Zhijun Zeng, Youjia Zheng, Hao Hu, Zeyuan Dong, Yihang Zheng, Xinliang Liu, Jinzhuo Wang, Zuoqiang Shi, Linfeng Zhang, Yubing Li, He Sun

Main category: cs.CV

TL;DR: OpenBreastUS introduces a large-scale dataset for wave equation simulations, enabling benchmarking of neural operators for realistic medical imaging tasks like ultrasound computed tomography (USCT).

DetailsMotivation: Traditional wave equation solvers are computationally intensive and unstable, while existing neural operator datasets oversimplify real-world complexity, limiting practical applications.

Method: The paper presents OpenBreastUS, a dataset with 8,000 realistic breast phantoms and 16 million frequency-domain wave simulations, using real USCT configurations.

Result: The dataset allows benchmarking neural operators for forward simulation and inverse imaging, demonstrating efficient in vivo breast imaging with neural solvers.

Conclusion: OpenBreastUS bridges the gap between theory and practice, facilitating development and deployment of neural PDE solvers in real-world medical imaging.

Abstract: Accurate and efficient simulation of wave equations is crucial in computational wave imaging applications, such as ultrasound computed tomography (USCT), which reconstructs tissue material properties from observed scattered waves. Traditional numerical solvers for wave equations are computationally intensive and often unstable, limiting their practical applications for quasi-real-time image reconstruction. Neural operators offer an innovative approach by accelerating PDE solving using neural networks; however, their effectiveness in realistic imaging is limited because existing datasets oversimplify real-world complexity. In this paper, we present OpenBreastUS, a large-scale wave equation dataset designed to bridge the gap between theoretical equations and practical imaging applications. OpenBreastUS includes 8,000 anatomically realistic human breast phantoms and over 16 million frequency-domain wave simulations using real USCT configurations. It enables a comprehensive benchmarking of popular neural operators for both forward simulation and inverse imaging tasks, allowing analysis of their performance, scalability, and generalization capabilities. By offering a realistic and extensive dataset, OpenBreastUS not only serves as a platform for developing innovative neural PDE solvers but also facilitates their deployment in real-world medical imaging problems. For the first time, we demonstrate efficient in vivo imaging of the human breast using neural operator solvers.

[230] EBA-AI: Ethics-Guided Bias-Aware AI for Efficient Underwater Image Enhancement and Coral Reef Monitoring

Lyes Saad Saoud, Irfan Hussain

Main category: cs.CV

TL;DR: EBA-AI is an ethics-guided, bias-aware AI framework for underwater image enhancement, addressing dataset bias, computational costs, and transparency issues. It uses CLIP embeddings for bias mitigation and adaptive processing for efficiency, validated on multiple datasets.

DetailsMotivation: AI-based underwater image enhancement faces challenges like dataset bias, high computational costs, and lack of transparency, which can lead to misinterpretations in marine conservation efforts.

Method: EBA-AI leverages CLIP embeddings for bias detection and mitigation, integrates adaptive processing for energy efficiency, and employs uncertainty estimation and explainability techniques.

Result: Experiments show a controlled PSNR drop of 1.0 dB but significant computational savings, enabling real-time feasibility. EBA-AI outperforms existing methods in efficiency, fairness, and interpretability.

Conclusion: EBA-AI advances sustainable, bias-aware, and computationally efficient underwater image enhancement, supporting marine conservation with improved transparency and fairness.

Abstract: Underwater image enhancement is vital for marine conservation, particularly coral reef monitoring. However, AI-based enhancement models often face dataset bias, high computational costs, and lack of transparency, leading to potential misinterpretations. This paper introduces EBA-AI, an ethics-guided bias-aware AI framework to address these challenges. EBA-AI leverages CLIP embeddings to detect and mitigate dataset bias, ensuring balanced representation across varied underwater environments. It also integrates adaptive processing to optimize energy efficiency, significantly reducing GPU usage while maintaining competitive enhancement quality. Experiments on LSUI400, Oceanex, and UIEB100 show that while PSNR drops by a controlled 1.0 dB, computational savings enable real-time feasibility for large-scale marine monitoring. Additionally, uncertainty estimation and explainability techniques enhance trust in AI-driven environmental decisions. Comparisons with CycleGAN, FunIEGAN, RAUNENet, WaterNet, UGAN, PUGAN, and UTUIE validate EBA-AI’s effectiveness in balancing efficiency, fairness, and interpretability in underwater image processing. By addressing key limitations of AI-driven enhancement, this work contributes to sustainable, bias-aware, and computationally efficient marine conservation efforts. For interactive visualizations, animations, source code, and access to the preprint, visit: https://lyessaadsaoud.github.io/EBA-AI/

[231] OmniVTON: Training-Free Universal Virtual Try-On

Zhaotong Yang, Yuhui Li, Shengfeng He, Xinzhe Li, Yangyang Xu, Junyu Dong, Yong Du

Main category: cs.CV

TL;DR: OmniVTON is a training-free universal VTON framework that decouples garment and pose conditioning for high fidelity and adaptability across diverse settings.

DetailsMotivation: Existing VTON methods face challenges in cross-domain generalization (supervised) or data biases (unsupervised). A unified, training-free solution is needed.

Method: OmniVTON uses garment prior generation and boundary stitching for texture fidelity, and DDIM inversion for pose alignment, disentangling garment and pose constraints.

Result: OmniVTON outperforms in diverse datasets, garment types, and scenarios, including multi-human VTON.

Conclusion: OmniVTON is the first training-free universal VTON framework, achieving superior performance and enabling multi-human garment transfer.

Abstract: Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, which ensure high fidelity but struggle with cross-domain generalization, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality. A unified, training-free solution that works across both scenarios remains an open challenge. We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings. To preserve garment details, we introduce a garment prior generation mechanism that aligns clothing with the body, followed by continuous boundary stitching technique to achieve fine-grained texture retention. For precise pose alignment, we utilize DDIM inversion to capture structural cues while suppressing texture interference, ensuring accurate body alignment independent of the original image textures. By disentangling garment and pose constraints, OmniVTON eliminates the bias inherent in diffusion models when handling multiple conditions simultaneously. Experimental results demonstrate that OmniVTON achieves superior performance across diverse datasets, garment types, and application scenarios. Notably, it is the first framework capable of multi-human VTON, enabling realistic garment transfer across multiple individuals in a single scene. Code is available at https://github.com/Jerome-Young/OmniVTON

[232] Rethinking Pan-sharpening: Principled Design, Unified Training, and a Universal Loss Surpass Brute-Force Scaling

Ran Zhang, Xuanhua He, Li Xueheng, Ke Cao, Liu Liu, Wenbo Xu, Fang Jiabin, Yang Qize, Jie Zhang

Main category: cs.CV

TL;DR: PanTiny is a lightweight, efficient pan-sharpening framework trained on multiple satellite datasets, outperforming larger models with better generalization and a novel composite loss function.

DetailsMotivation: Address the inefficiency and poor generalization of large, complex pan-sharpening models trained on single datasets.

Method: Propose PanTiny, a single-step framework with multiple-in-one training on three satellite datasets (WV2, WV3, GF2) and a composite loss function.

Result: PanTiny achieves superior performance-to-efficiency balance, outperforming larger models and improving generalization on full-resolution data.

Conclusion: Principled engineering in model design, training, and loss functions can surpass brute-force scaling, advocating for efficient, generalizable models in pan-sharpening.

Abstract: The field of pan-sharpening has recently seen a trend towards increasingly large and complex models, often trained on single, specific satellite datasets. This approach, however, leads to high computational overhead and poor generalization on full resolution data, a paradigm we challenge in this paper. In response to this issue, we propose PanTiny, a lightweight, single-step pan-sharpening framework designed for both efficiency and robust performance. More critically, we introduce multiple-in-one training paradigm, where a single, compact model is trained simultaneously on three distinct satellite datasets (WV2, WV3, and GF2) with different resolution and spectral information. Our experiments show that this unified training strategy not only simplifies deployment but also significantly boosts generalization on full-resolution data. Further, we introduce a universally powerful composite loss function that elevates the performance of almost all of models for pan-sharpening, pushing state-of-the-art metrics into a new era. Our PanTiny model, benefiting from these innovations, achieves a superior performance-to-efficiency balance, outperforming most larger, specialized models. Through extensive ablation studies, we validate that principled engineering in model design, training paradigms, and loss functions can surpass brute-force scaling. Our work advocates for a community-wide shift towards creating efficient, generalizable, and data-conscious models for pan-sharpening. The code is available at https://github.com/Zirconium233/PanTiny .

[233] StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: StableAnimator++ is a video diffusion framework for human image animation that preserves identity (ID) consistency through learnable pose alignment and advanced modules, outperforming existing methods.

DetailsMotivation: Current diffusion models struggle with ID consistency when reference images and driving videos differ in body size or position. StableAnimator++ addresses this gap.

Method: The framework uses learnable pose alignment via SVD-guided similarity transformation matrices, image/face embeddings, a Face Encoder, and a distribution-aware ID Adapter. It also integrates HJB-based face optimization during inference.

Result: Experiments demonstrate StableAnimator++’s superior performance in maintaining ID consistency and generating high-quality videos.

Conclusion: StableAnimator++ effectively solves ID inconsistency in human image animation, offering a robust solution for realistic video generation.

Abstract: Current diffusion models for human image animation often struggle to maintain identity (ID) consistency, especially when the reference image and driving video differ significantly in body size or position. We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment, capable of generating high-quality videos conditioned on a reference image and a pose sequence without any post-processing. Building upon a video diffusion model, StableAnimator++ contains carefully designed modules for both training and inference, striving for identity consistency. In particular, StableAnimator++ first uses learnable layers to predict the similarity transformation matrices between the reference image and the driven poses via injecting guidance from Singular Value Decomposition (SVD). These matrices align the driven poses with the reference image, mitigating misalignment to a great extent. StableAnimator++ then computes image and face embeddings using off-the-shelf encoders, refining the face embeddings via a global content-aware Face Encoder. To further maintain ID, we introduce a distribution-aware ID Adapter that counteracts interference caused by temporal layers while preserving ID via distribution alignment. During the inference stage, we propose a novel Hamilton-Jacobi-Bellman (HJB) based face optimization integrated into the denoising process, guiding the diffusion trajectory for enhanced facial fidelity. Experiments on benchmarks show the effectiveness of StableAnimator++ both qualitatively and quantitatively.

[234] Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

Peirong Zhang, Haowei Xu, Jiaxin Zhang, Guitao Xu, Xuhan Zheng, Zhenhua Yang, Junle Liu, Yuyi Zhang, Lianwen Jin

Main category: cs.CV

TL;DR: The paper evaluates state-of-the-art generative models for text image generation and editing, identifying weaknesses and advocating for integrating these skills into general-domain models.

DetailsMotivation: Assess whether advanced generative models can handle the complexities of text image generation and editing, given their growing capabilities in other areas.

Method: Evaluates six models using 33 OCR tasks across five categories, with tailored inputs and prompts.

Result: Identifies weaknesses in current models and emphasizes the need for foundational text image skills in general-domain models.

Conclusion: Photorealistic text image generation should be a core skill in general models, not just specialized ones, with ongoing updates via GitHub.

Abstract: Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models’ capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex & layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.

[235] Visual Place Recognition for Large-Scale UAV Applications

Ioannis Tsampikos Papapetros, Ioannis Kansizoglou, Antonios Gasteratos

Main category: cs.CV

TL;DR: The paper introduces LASED, a large-scale aerial dataset, and steerable CNNs to improve visual place recognition (vPR) for UAV navigation, addressing challenges like dataset scarcity and rotational ambiguity.

DetailsMotivation: Aerial vPR lacks large-scale datasets and struggles with rotational variance in UAV imagery, limiting model generalization.

Method: Proposes LASED, a structured dataset with ~1M images, and integrates steerable CNNs to handle rotational variance.

Result: Models trained on LASED achieve higher recall, and steerable CNNs outperform conventional CNNs by 12% in recall.

Conclusion: Combining large-scale datasets with rotation-equivariant networks enhances robustness and generalization in aerial vPR.

Abstract: Visual Place Recognition (vPR) plays a crucial role in Unmanned Aerial Vehicle (UAV) navigation, enabling robust localization across diverse environments. Despite significant advancements, aerial vPR faces unique challenges due to the limited availability of large-scale, high-altitude datasets, which limits model generalization, along with the inherent rotational ambiguity in UAV imagery. To address these challenges, we introduce LASED, a large-scale aerial dataset with approximately one million images, systematically sampled from 170,000 unique locations throughout Estonia over a decade, offering extensive geographic and temporal diversity. Its structured design ensures clear place separation significantly enhancing model training for aerial scenarios. Furthermore, we propose the integration of steerable Convolutional Neural Networks (CNNs) to explicitly handle rotational variance, leveraging their inherent rotational equivariance to produce robust, orientation-invariant feature representations. Our extensive benchmarking demonstrates that models trained on LASED achieve significantly higher recall compared to those trained on smaller, less diverse datasets, highlighting the benefits of extensive geographic coverage and temporal diversity. Moreover, steerable CNNs effectively address rotational ambiguity inherent in aerial imagery, consistently outperforming conventional convolutional architectures, achieving on average 12% recall improvement over the best-performing non-steerable network. By combining structured, large-scale datasets with rotation-equivariant neural networks, our approach significantly enhances model robustness and generalization for aerial vPR.

[236] BleedOrigin: Dynamic Bleeding Source Localization in Endoscopic Submucosal Dissection via Dual-Stage Detection and Tracking

Mengya Xu, Rulin Zhou, An Wang, Chaoyang Lyu, Zhen Li, Ning Zhong, Hongliang Ren

Main category: cs.CV

TL;DR: The paper introduces BleedOrigin-Bench, a dataset for bleeding source detection in ESD, and BleedOrigin-Net, a dual-stage AI framework for localization and tracking, achieving high accuracy.

DetailsMotivation: Current AI methods lack focus on bleeding source detection and tracking in ESD, compounded by the absence of specialized datasets.

Method: Proposes BleedOrigin-Net, a dual-stage detection-tracking framework, and BleedOrigin-Bench dataset with expert annotations.

Result: Achieves 96.85% frame-level accuracy for onset detection, 70.24% for initial source detection, and 96.11% for tracking.

Conclusion: The work addresses critical gaps in ESD bleeding management with a robust dataset and AI framework, improving accuracy and efficiency.

Abstract: Intraoperative bleeding during Endoscopic Submucosal Dissection (ESD) poses significant risks, demanding precise, real-time localization and continuous monitoring of the bleeding source for effective hemostatic intervention. In particular, endoscopists have to repeatedly flush to clear blood, allowing only milliseconds to identify bleeding sources, an inefficient process that prolongs operations and elevates patient risks. However, current Artificial Intelligence (AI) methods primarily focus on bleeding region segmentation, overlooking the critical need for accurate bleeding source detection and temporal tracking in the challenging ESD environment, which is marked by frequent visual obstructions and dynamic scene changes. This gap is widened by the lack of specialized datasets, hindering the development of robust AI-assisted guidance systems. To address these challenges, we introduce BleedOrigin-Bench, the first comprehensive ESD bleeding source dataset, featuring 1,771 expert-annotated bleeding sources across 106,222 frames from 44 procedures, supplemented with 39,755 pseudo-labeled frames. This benchmark covers 8 anatomical sites and 6 challenging clinical scenarios. We also present BleedOrigin-Net, a novel dual-stage detection-tracking framework for the bleeding source localization in ESD procedures, addressing the complete workflow from bleeding onset detection to continuous spatial tracking. We compare with widely-used object detection models (YOLOv11/v12), multimodal large language models, and point tracking methods. Extensive evaluation demonstrates state-of-the-art performance, achieving 96.85% frame-level accuracy ($\pm\leq8$ frames) for bleeding onset detection, 70.24% pixel-level accuracy ($\leq100$ px) for initial source detection, and 96.11% pixel-level accuracy ($\leq100$ px) for point tracking.

[237] LoopNet: A Multitasking Few-Shot Learning Approach for Loop Closure in Large Scale SLAM

Mohammad-Maher Nakshbandi, Ziad Sharawy, Sorin Grigorescu

Main category: cs.CV

TL;DR: LoopNet improves SLAM loop closure detection with multitasking ResNet, online retraining, and DISK descriptors, outperforming traditional methods.

DetailsMotivation: Addressing challenges in SLAM loop closure: accuracy and real-time computation on embedded hardware.

Method: Multitasking ResNet variant with online retraining (few-shot learning) and DISK descriptors for feature extraction.

Result: Better performance under varying conditions compared to handcrafted features and traditional deep learning.

Conclusion: LoopNet and LoopDB dataset advance SLAM loop closure detection, with code and dataset publicly available.

Abstract: One of the main challenges in the Simultaneous Localization and Mapping (SLAM) loop closure problem is the recognition of previously visited places. In this work, we tackle the two main problems of real-time SLAM systems: 1) loop closure detection accuracy and 2) real-time computation constraints on the embedded hardware. Our LoopNet method is based on a multitasking variant of the classical ResNet architecture, adapted for online retraining on a dynamic visual dataset and optimized for embedded devices. The online retraining is designed using a few-shot learning approach. The architecture provides both an index into the queried visual dataset, and a measurement of the prediction quality. Moreover, by leveraging DISK (DIStinctive Keypoints) descriptors, LoopNet surpasses the limitations of handcrafted features and traditional deep learning methods, offering better performance under varying conditions. Code is available at https://github.com/RovisLab/LoopNet. Additinally, we introduce a new loop closure benchmarking dataset, coined LoopDB, which is available at https://github.com/RovisLab/LoopDB.

[238] Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

Ce Zhang, Yale Song, Ruta Desai, Michael Louis Iuzzolino, Joseph Tighe, Gedas Bertasius, Satwik Kottur

Main category: cs.CV

TL;DR: VideoPlan improves visual planning for assistance by addressing data scarcity and action space modeling, achieving state-of-the-art results.

DetailsMotivation: Addressing challenges in training MLLMs for long-horizon visual planning due to scarce procedural annotations and inefficient next-token prediction.

Method: Uses Auxiliary Task Augmentation and Multi-token Prediction to enhance planning ability and model structured action spaces.

Result: Achieves SOTA performance on COIN and CrossTask datasets, surpassing prior methods by 7.3% and 3.4%, respectively.

Conclusion: VideoPlan effectively tackles key challenges in VPA and performs competitively on extended tasks without specialized features.

Abstract: Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user’s progress. Although recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding, long-horizon visual planning remains a challenging problem. We identify two challenges in training large MLLMs for video-based planning tasks: (1) scarcity of procedural annotations, limiting the model’s ability to learn procedural task dynamics effectively, and (2) inefficiency of next-token prediction objective to explicitly capture the structured action space for visual planning when compared to free-form, natural language. To tackle data scarcity, we introduce Auxiliary Task Augmentation. We design and train our model on auxiliary tasks relevant to long-horizon video-based planning (e.g., goal prediction) to augment the model’s planning ability. To more explicitly model the structured action space unique to visual planning tasks, we leverage Multi-token Prediction, extending traditional next-token prediction by using multiple heads to predict multiple future tokens during training. Our approach, VideoPlan, achieves state-of-the-art VPA performance on the COIN and CrossTask datasets, surpassing prior methods by 7.3% and 3.4%, respectively, when predicting 3 future actions. We further extend our method to the challenging Ego4D Long-term Action Anticipation task, and show that it is on par with the state-of-the-art approaches despite not using specialized egocentric features. Code will be made available.

[239] Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection

Aayush Atul Verma, Arpitsinh Vaghela, Bharatesh Chakravarthi, Kaustav Chanda, Yezhou Yang

Main category: cs.CV

TL;DR: The paper proposes a spatiotemporal multigraph representation for event-based data, improving detection accuracy and efficiency over previous methods.

DetailsMotivation: Event-based sensors' sparse, asynchronous data loses advantages when converted to dense tensors for standard neural networks, prompting research into graph representations. Existing graph methods underperform due to poor spatiotemporal modeling.

Method: A novel spatiotemporal multigraph is introduced, with decoupled spatial (B-spline basis functions) and temporal (motion vector-based attention) graphs, enabling efficient 2D kernels instead of 3D ones.

Result: Tested on Gen1 automotive and eTraM datasets, the method achieves >6% higher detection accuracy, 5x speedup, fewer parameters, and no added computational cost.

Conclusion: Structured graph modeling effectively enhances asynchronous vision tasks, as demonstrated by the proposed method’s superior performance.

Abstract: Event-based sensors offer high temporal resolution and low latency by generating sparse, asynchronous data. However, converting this irregular data into dense tensors for use in standard neural networks diminishes these inherent advantages, motivating research into graph representations. While such methods preserve sparsity and support asynchronous inference, their performance on downstream tasks remains limited due to suboptimal modeling of spatiotemporal dynamics. In this work, we propose a novel spatiotemporal multigraph representation to better capture spatial structure and temporal changes. Our approach constructs two decoupled graphs: a spatial graph leveraging B-spline basis functions to model global structure, and a temporal graph utilizing motion vector-based attention for local dynamic changes. This design enables the use of efficient 2D kernels in place of computationally expensive 3D kernels. We evaluate our method on the Gen1 automotive and eTraM datasets for event-based object detection, achieving over a 6% improvement in detection accuracy compared to previous graph-based works, with a 5x speedup, reduced parameter count, and no increase in computational cost. These results highlight the effectiveness of structured graph modeling for asynchronous vision. Project page: eventbasedvision.github.io/eGSMV.

[240] MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction

Yusuke Yoshiyasu, Leyuan Sun, Ryusuke Sagawa

Main category: cs.CV

TL;DR: MeshMamba uses Mamba-SSMs for efficient 3D mesh learning, enabling large-scale generation and reconstruction of articulated meshes with over 10,000 vertices. It introduces MambaDiff3D for mesh generation and Mamba-HMR for single-image reconstruction, outperforming prior methods.

DetailsMotivation: To address the inefficiency and scalability challenges in learning 3D articulated mesh models, especially for large vertex counts and complex geometries like clothing and hands.

Method: MeshMamba serializes mesh vertices into structured orderings (e.g., by body parts or 3D locations) for Mamba-SSM processing. It includes MambaDiff3D (diffusion model) for mesh generation and Mamba-HMR for single-image mesh recovery.

Result: MambaDiff3D generates dense 3D human meshes with clothing and hands, outperforming previous methods. Mamba-HMR extends whole-body reconstruction (including face and hands) with competitive real-time performance.

Conclusion: MeshMamba advances 3D mesh learning by efficiently handling large vertex counts and complex geometries, with applications in generation and reconstruction tasks.

Abstract: In this paper, we introduce MeshMamba, a neural network model for learning 3D articulated mesh models by employing the recently proposed Mamba State Space Models (Mamba-SSMs). MeshMamba is efficient and scalable in handling a large number of input tokens, enabling the generation and reconstruction of body mesh models with more than 10,000 vertices, capturing clothing and hand geometries. The key to effectively learning MeshMamba is the serialization technique of mesh vertices into orderings that are easily processed by Mamba. This is achieved by sorting the vertices based on body part annotations or the 3D vertex locations of a template mesh, such that the ordering respects the structure of articulated shapes. Based on MeshMamba, we design 1) MambaDiff3D, a denoising diffusion model for generating 3D articulated meshes and 2) Mamba-HMR, a 3D human mesh recovery model that reconstructs a human body shape and pose from a single image. Experimental results showed that MambaDiff3D can generate dense 3D human meshes in clothes, with grasping hands, etc., and outperforms previous approaches in the 3D human shape generation task. Additionally, Mamba-HMR extends the capabilities of previous non-parametric human mesh recovery approaches, which were limited to handling body-only poses using around 500 vertex tokens, to the whole-body setting with face and hands, while achieving competitive performance in (near) real-time.

[241] Improving Joint Embedding Predictive Architecture with Diffusion Noise

Yuping Qiu, Rui Zhu, Ying-cong Chen

Main category: cs.CV

TL;DR: The paper proposes N-JEPA, combining diffusion noise with masked image modeling (MIM) to enhance self-supervised learning (SSL) for better recognition tasks.

DetailsMotivation: To bridge SSL and generative models for improved representation capacity, leveraging diffusion noise for semantic understanding.

Method: Introduces N-JEPA, integrating diffusion noise into MIM via masked tokens’ position embedding and multi-level noise schedules.

Result: Demonstrates effectiveness in downstream classification tasks.

Conclusion: N-JEPA successfully enhances SSL by combining diffusion models, with promising results and public code release planned.

Abstract: Self-supervised learning has become an incredibly successful method for feature learning, widely applied to many downstream tasks. It has proven especially effective for discriminative tasks, surpassing the trending generative models. However, generative models perform better in image generation and detail enhancement. Thus, it is natural for us to find a connection between SSL and generative models to further enhance the representation capacity of SSL. As generative models can create new samples by approximating the data distribution, such modeling should also lead to a semantic understanding of the raw visual data, which is necessary for recognition tasks. This enlightens us to combine the core principle of the diffusion model: diffusion noise, with SSL to learn a competitive recognition model. Specifically, diffusion noise can be viewed as a particular state of mask that reveals a close relationship between masked image modeling (MIM) and diffusion models. In this paper, we propose N-JEPA (Noise-based JEPA) to incorporate diffusion noise into MIM by the position embedding of masked tokens. The multi-level noise schedule is a series of feature augmentations to further enhance the robustness of our model. We perform a comprehensive study to confirm its effectiveness in the classification of downstream tasks. Codes will be released soon in public.

[242] Hierarchical Part-based Generative Model for Realistic 3D Blood Vessel

Siqi Chen, Guoqing Zhang, Jiahao Lai, Bingzhi Shen, Sihong Zhang, Caixia Dong, Xuejin Chen, Yang Li

Main category: cs.CV

TL;DR: A hierarchical part-based framework for 3D vessel generation separates global topology from local geometry, outperforming existing methods.

DetailsMotivation: Accurate representation of complex blood vessel geometry and topology is challenging due to intricate branching patterns and shapes.

Method: Three-stage approach: key graph generation for global structure, vessel segment generation for local details, and hierarchical assembly.

Result: Superior performance in modeling complex vascular networks compared to existing methods.

Conclusion: First successful part-based generative approach for 3D vessel modeling, setting a new benchmark.

Abstract: Advancements in 3D vision have increased the impact of blood vessel modeling on medical applications. However, accurately representing the complex geometry and topology of blood vessels remains a challenge due to their intricate branching patterns, curvatures, and irregular shapes. In this study, we propose a hierarchical part-based frame work for 3D vessel generation that separates the global binary tree-like topology from local geometric details. Our approach proceeds in three stages: (1) key graph generation to model the overall hierarchical struc ture, (2) vessel segment generation conditioned on geometric properties, and (3) hierarchical vessel assembly by integrating the local segments according to the global key graph. We validate our framework on real world datasets, demonstrating superior performance over existing methods in modeling complex vascular networks. This work marks the first successful application of a part-based generative approach for 3D vessel modeling, setting a new benchmark for vascular data generation. The code is available at: https://github.com/CybercatChen/PartVessel.git.

[243] Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse Autoencoders

Krishna Kanth Nakka

Main category: cs.CV

TL;DR: The paper introduces Sparse Autoencoder (SAE)-based interpretability to breast imaging using Mammo-CLIP, identifying clinically relevant features and confounding factors in model decisions.

DetailsMotivation: Interpretability is crucial in medical imaging for clinical adoption, especially in understanding model decisions.

Method: A patch-level Mammo-SAE is trained on Mammo-CLIP to probe latent features linked to breast concepts like mass and suspicious calcification.

Result: Top activated latent neurons align with ground truth regions, and confounding factors are identified. The study also reveals which neurons aid in downstream finetuning.

Conclusion: SAE-based interpretability offers deeper insights into foundation models for breast imaging, aiding clinical adoption.

Abstract: Interpretability is critical in high-stakes domains such as medical imaging, where understanding model decisions is essential for clinical adoption. In this work, we introduce Sparse Autoencoder (SAE)-based interpretability to breast imaging by analyzing {Mammo-CLIP}, a vision–language foundation model pretrained on large-scale mammogram image–report pairs. We train a patch-level \texttt{Mammo-SAE} on Mammo-CLIP to identify and probe latent features associated with clinically relevant breast concepts such as \textit{mass} and \textit{suspicious calcification}. Our findings reveal that top activated class level latent neurons in the SAE latent space often tend to align with ground truth regions, and also uncover several confounding factors influencing the model’s decision-making process. Additionally, we analyze which latent neurons the model relies on during downstream finetuning for improving the breast concept prediction. This study highlights the promise of interpretable SAE latent representations in providing deeper insight into the internal workings of foundation models at every layer for breast imaging.

[244] Cross-Domain Few-Shot Learning with Coalescent Projections and Latent Space Reservation

Naeem Paeedeh, Mahardhika Pratama, Wolfgang Mayer, Jimmy Cao, Ryszard Kowlczyk

Main category: cs.CV

TL;DR: A new method, Coalescent Projection (CP), combined with pseudo-class generation and Self-Supervised Transformations (SSTs), outperforms SOTA in CD-FSL by addressing overfitting and domain shift.

DetailsMotivation: Overcoming overfitting in transformers due to scarce labeled samples and improving performance in cross-domain few-shot learning.

Method: Proposes Coalescent Projection (CP) as a successor to soft prompts and introduces pseudo-class generation with SSTs, using only base domain data.

Result: Demonstrates effectiveness on the BSCD-FSL benchmark, especially in extreme domain shift scenarios.

Conclusion: The proposed CP and SSTs method significantly improves CD-FSL performance, addressing key limitations of existing approaches.

Abstract: Despite the progress in Cross-Domain Few-Shot Learning (CD-FSL), a model pre-trained with DINO combined with a prototypical classifier outperforms the latest SOTA methods. A crucial limitation that needs to be overcome is that updating too many parameters of the transformers leads to overfitting due to the scarcity of labeled samples. To address this challenge, we propose a new concept, Coalescent Projection (CP), as an effective successor to soft prompts. Additionally, we propose a novel pseudo-class generation method combined with Self-Supervised Transformations (SSTs) that relies solely on the base domain to prepare the network for encountering unseen samples from different domains. The proposed method exhibits its effectiveness in comprehensive experiments on the extreme domain shift scenario of the BSCD-FSL benchmark. Our code is published at https://github.com/Naeem-Paeedeh/CPLSR.

[245] FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers

Yanbing Zhang, Zhe Wang, Qin Zhou, Mengping Yang

Main category: cs.CV

TL;DR: FreeCus is a training-free framework for subject-driven text-to-image generation, leveraging diffusion transformers (DiT) with innovations like attention sharing, dynamic shifting analysis, and MLLM integration for zero-shot synthesis.

DetailsMotivation: Existing methods rely on training procedures, limiting practical use and failing to exploit DiT's zero-shot potential for subject-driven synthesis.

Method: FreeCus introduces attention sharing, an upgraded DiT variant for feature extraction, and MLLM integration for cross-modal semantics.

Result: Achieves state-of-the-art or comparable results without training, with seamless compatibility for inpainting and control modules.

Conclusion: FreeCus unlocks DiT’s zero-shot ability for consistent subject synthesis, offering a practical and flexible solution.

Abstract: In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT’s capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject’s layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT’s dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT’s zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework demonstrates seamless compatibility with existing inpainting pipelines and control modules, facilitating more compelling experiences. Our code is available at: https://github.com/Monalissaa/FreeCus.

[246] MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP

Pei An, Jiaqi Yang, Muyao Peng, You Yang, Qiong Liu, Xiaolin Wu, Liangliang Nan

Main category: cs.CV

TL;DR: The paper proposes MinCD-PnP, a robust image-to-point-cloud registration method using approximated blind PnP to handle noise and outliers, outperforming state-of-the-art methods.

DetailsMotivation: Differential PnP is sensitive to noise and outliers, limiting correspondence learning effectiveness. Blind PnP is robust but computationally expensive.

Method: Simplifies blind PnP to minimize Chamfer distance (MinCD-PnP) and introduces MinCD-Net, a lightweight multi-task learning module.

Result: MinCD-Net achieves higher inlier ratio and registration recall across diverse datasets.

Conclusion: The proposed method effectively addresses noise and outlier issues, improving I2P registration performance.

Abstract: Image-to-point-cloud (I2P) registration is a fundamental problem in computer vision, focusing on establishing 2D-3D correspondences between an image and a point cloud. The differential perspective-n-point (PnP) has been widely used to supervise I2P registration networks by enforcing the projective constraints on 2D-3D correspondences. However, differential PnP is highly sensitive to noise and outliers in the predicted correspondences. This issue hinders the effectiveness of correspondence learning. Inspired by the robustness of blind PnP against noise and outliers in correspondences, we propose an approximated blind PnP based correspondence learning approach. To mitigate the high computational cost of blind PnP, we simplify blind PnP to an amenable task of minimizing Chamfer distance between learned 2D and 3D keypoints, called MinCD-PnP. To effectively solve MinCD-PnP, we design a lightweight multi-task learning module, named as MinCD-Net, which can be easily integrated into the existing I2P registration architectures. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected datasets demonstrate that MinCD-Net outperforms state-of-the-art methods and achieves a higher inlier ratio (IR) and registration recall (RR) in both cross-scene and cross-dataset settings.

[247] Conditional Video Generation for High-Efficiency Video Compression

Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: A video compression framework using conditional diffusion models for perceptually optimized reconstruction, outperforming traditional and neural codecs in perceptual quality metrics.

DetailsMotivation: To leverage conditional diffusion models' ability to reconstruct video content aligned with human perception for improved video compression.

Method: Reframes video compression as a conditional generation task with three modules: multi-granular conditioning, compact representations, and multi-condition training.

Result: Significantly outperforms traditional and neural codecs on perceptual quality metrics like FVD and LPIPS, especially at high compression ratios.

Conclusion: The proposed framework effectively combines conditional diffusion models with specialized modules for superior perceptual video compression.

Abstract: Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fr'echet Video Distance (FVD) and LPIPS, especially under high compression ratios.

[248] In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems

Lazaro Janier Gonzalez-Soler, Maciej Salwowski, Christoph Busch

Main category: cs.CV

TL;DR: The paper explores Vision Language Models (VLMs) for detecting biometric attacks, proposing an in-context learning framework that outperforms traditional CNNs without extensive training.

DetailsMotivation: Addressing the limitations of deep learning models in adapting to diverse biometric attacks and environmental conditions, while tackling challenges in data collection and privacy.

Method: Proposes an in-context learning framework using VLMs for detecting physical and digital attacks, evaluated on open-source models and freely available databases.

Result: The framework achieves competitive performance in attack detection, outperforming some traditional CNNs without resource-intensive training.

Conclusion: VLMs with in-context learning are a promising tool for improving generalization in biometric attack detection.

Abstract: Recent advances in biometric systems have significantly improved the detection and prevention of fraudulent activities. However, as detection methods improve, attack techniques become increasingly sophisticated. Attacks on face recognition systems can be broadly divided into physical and digital approaches. Traditionally, deep learning models have been the primary defence against such attacks. While these models perform exceptionally well in scenarios for which they have been trained, they often struggle to adapt to different types of attacks or varying environmental conditions. These subsystems require substantial amounts of training data to achieve reliable performance, yet biometric data collection faces significant challenges, including privacy concerns and the logistical difficulties of capturing diverse attack scenarios under controlled conditions. This work investigates the application of Vision Language Models (VLM) and proposes an in-context learning framework for detecting physical presentation attacks and digital morphing attacks in biometric systems. Focusing on open-source models, the first systematic framework for the quantitative evaluation of VLMs in security-critical scenarios through in-context learning techniques is established. The experimental evaluation conducted on freely available databases demonstrates that the proposed subsystem achieves competitive performance for physical and digital attack detection, outperforming some of the traditional CNNs without resource-intensive training. The experimental results validate the proposed framework as a promising tool for improving generalisation in attack detection.

[249] Minutiae-Anchored Local Dense Representation for Fingerprint Matching

Zhiyu Pan, Xiongjun Guan, Yongjie Duan, Jianjiang Feng, Jie Zhou

Main category: cs.CV

TL;DR: Proposes DMD, a minutiae-anchored local dense representation for robust fingerprint matching under diverse conditions, achieving state-of-the-art accuracy.

DetailsMotivation: Addressing the challenge of fingerprint matching under varied capture conditions by leveraging both minutiae and ridge textures.

Method: Extracts descriptors from minutiae-centered patches, forming a 3D tensor for multi-level feature aggregation. Uses foreground masks for efficient matching.

Result: Demonstrates effectiveness on diverse datasets, achieving top accuracy with high computational efficiency.

Conclusion: DMD shows strong potential for large-scale fingerprint recognition, with code publicly available.

Abstract: Fingerprint matching under diverse capture conditions remains a fundamental challenge in biometric recognition. To achieve robust and accurate performance in such scenarios, we propose DMD, a minutiae-anchored local dense representation which captures both fine-grained ridge textures and discriminative minutiae features in a spatially structured manner. Specifically, descriptors are extracted from local patches centered and oriented on each detected minutia, forming a three-dimensional tensor, where two dimensions represent spatial locations on the fingerprint plane and the third encodes semantic features. This representation explicitly captures abstract features of local image patches, enabling a multi-level, fine-grained description that aggregates information from multiple minutiae and their surrounding ridge structures. Furthermore, thanks to its strong spatial correspondence with the patch image, DMD allows for the use of foreground segmentation masks to identify valid descriptor regions. During matching, comparisons are then restricted to overlapping foreground areas, improving efficiency and robustness. Extensive experiments on rolled, plain, parital, contactless, and latent fingerprint datasets demonstrate the effectiveness and generalizability of the proposed method. It achieves state-of-the-art accuracy across multiple benchmarks while maintaining high computational efficiency, showing strong potential for large-scale fingerprint recognition. Corresponding code is available at https://github.com/Yu-Yy/DMD.

[250] Few-Shot Object Detection via Spatial-Channel State Space Model

Zhimeng Xin, Tianxu Wu, Yixiong Zou, Shiming Chen, Dingjie Fu, Xinge You

Main category: cs.CV

TL;DR: The paper proposes a Spatial-Channel State Space Modeling (SCSM) module to address feature extraction challenges in few-shot object detection (FSOD) by leveraging inter-channel correlations and Mamba-based modeling.

DetailsMotivation: Current FSOD methods struggle with extracting effective features due to limited training samples, leading to issues with channel weight accuracy.

Method: The SCSM module includes Spatial Feature Modeling (SFM) for spatial-channel balance and Channel State Modeling (CSM) using Mamba for channel correlation.

Result: Experiments on VOC and COCO datasets show improved feature representation and state-of-the-art performance.

Conclusion: SCSM effectively enhances feature extraction in FSOD by addressing channel weight issues and leveraging inter-channel correlations.

Abstract: Due to the limited training samples in few-shot object detection (FSOD), we observe that current methods may struggle to accurately extract effective features from each channel. Specifically, this issue manifests in two aspects: i) channels with high weights may not necessarily be effective, and ii) channels with low weights may still hold significant value. To handle this problem, we consider utilizing the inter-channel correlation to facilitate the novel model’s adaptation process to novel conditions, ensuring the model can correctly highlight effective channels and rectify those incorrect ones. Since the channel sequence is also 1-dimensional, its similarity with the temporal sequence inspires us to take Mamba for modeling the correlation in the channel sequence. Based on this concept, we propose a Spatial-Channel State Space Modeling (SCSM) module for spatial-channel state modeling, which highlights the effective patterns and rectifies those ineffective ones in feature channels. In SCSM, we design the Spatial Feature Modeling (SFM) module to balance the learning of spatial relationships and channel relationships, and then introduce the Channel State Modeling (CSM) module based on Mamba to learn correlation in channels. Extensive experiments on the VOC and COCO datasets show that the SCSM module enables the novel detector to improve the quality of focused feature representation in channels and achieve state-of-the-art performance.

[251] BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

Zhenyu Li, Haotong Lin, Jiashi Feng, Peter Wonka, Bingyi Kang

Main category: cs.CV

TL;DR: BenchDepth introduces a new benchmark for evaluating depth foundation models (DFMs) using five downstream tasks, avoiding biases of traditional alignment-based metrics.

DetailsMotivation: Existing depth evaluation protocols are inconsistent and biased, favoring certain representations and complicating fair comparisons.

Method: Proposes BenchDepth, evaluating DFMs through five proxy tasks: depth completion, stereo matching, 3D reconstruction, SLAM, and vision-language spatial understanding.

Result: Benchmarked eight state-of-the-art DFMs, providing insights into their practical utility.

Conclusion: BenchDepth offers a fairer evaluation method, encouraging better practices and future advancements in depth estimation.

Abstract: Depth estimation is a fundamental task in computer vision with diverse applications. Recent advancements in deep learning have led to powerful depth foundation models (DFMs), yet their evaluation remains challenging due to inconsistencies in existing protocols. Traditional benchmarks rely on alignment-based metrics that introduce biases, favor certain depth representations, and complicate fair comparisons. In this work, we propose BenchDepth, a new benchmark that evaluates DFMs through five carefully selected downstream proxy tasks: depth completion, stereo matching, monocular feed-forward 3D scene reconstruction, SLAM, and vision-language spatial understanding. Unlike conventional evaluation protocols, our approach assesses DFMs based on their practical utility in real-world applications, bypassing problematic alignment procedures. We benchmark eight state-of-the-art DFMs and provide an in-depth analysis of key findings and observations. We hope our work sparks further discussion in the community on best practices for depth model evaluation and paves the way for future research and advancements in depth estimation.

[252] ExDD: Explicit Dual Distribution Learning for Surface Defect Detection via Diffusion Synthesis

Muhammad Aqeel, Federico Leonardi, Francesco Setti

Main category: cs.CV

TL;DR: ExDD (Explicit Dual Distribution) is a framework for industrial defect detection that models dual feature distributions, uses synthetic defect generation, and achieves high performance (94.2% I-AUROC, 97.7% P-AUROC).

DetailsMotivation: One-class anomaly detection paradigms fail in real-world manufacturing due to uniform outlier assumptions and data scarcity.

Method: ExDD uses parallel memory banks for normality and anomaly patterns, latent diffusion models for synthetic defect generation, and a neighborhood-aware scoring mechanism.

Result: Achieves 94.2% I-AUROC and 97.7% P-AUROC on KSDD2, with optimal performance at 100 synthetic samples.

Conclusion: ExDD effectively addresses limitations of traditional anomaly detection in industrial settings.

Abstract: Industrial defect detection systems face critical limitations when confined to one-class anomaly detection paradigms, which assume uniform outlier distributions and struggle with data scarcity in realworld manufacturing environments. We present ExDD (Explicit Dual Distribution), a novel framework that transcends these limitations by explicitly modeling dual feature distributions. Our approach leverages parallel memory banks that capture the distinct statistical properties of both normality and anomalous patterns, addressing the fundamental flaw of uniform outlier assumptions. To overcome data scarcity, we employ latent diffusion models with domain-specific textual conditioning, generating in-distribution synthetic defects that preserve industrial context. Our neighborhood-aware ratio scoring mechanism elegantly fuses complementary distance metrics, amplifying signals in regions exhibiting both deviation from normality and similarity to known defect patterns. Experimental validation on KSDD2 demonstrates superior performance (94.2% I-AUROC, 97.7% P-AUROC), with optimal augmentation at 100 synthetic samples.

[253] RoadFusion: Latent Diffusion Model for Pavement Defect Detection

Muhammad Aqeel, Kidus Dagnaw Bellete, Francesco Setti

Main category: cs.CV

TL;DR: RoadFusion addresses pavement defect detection challenges using synthetic anomaly generation and dual-path feature adaptation, achieving state-of-the-art performance.

DetailsMotivation: Challenges include limited annotated data, domain shift, and variability in defect appearances across road conditions.

Method: Uses a latent diffusion model for synthetic defect generation and dual-path feature adaptors for robust representation. A lightweight discriminator refines defect detection.

Result: Achieves strong performance on six benchmark datasets in classification and localization tasks.

Conclusion: RoadFusion sets new state-of-the-art metrics for real-world road inspection.

Abstract: Pavement defect detection faces critical challenges including limited annotated data, domain shift between training and deployment environments, and high variability in defect appearances across different road conditions. We propose RoadFusion, a framework that addresses these limitations through synthetic anomaly generation with dual-path feature adaptation. A latent diffusion model synthesizes diverse, realistic defects using text prompts and spatial masks, enabling effective training under data scarcity. Two separate feature adaptors specialize representations for normal and anomalous inputs, improving robustness to domain shift and defect variability. A lightweight discriminator learns to distinguish fine-grained defect patterns at the patch level. Evaluated on six benchmark datasets, RoadFusion achieves consistently strong performance across both classification and localization tasks, setting new state-of-the-art in multiple metrics relevant to real-world road inspection.

[254] DAViD: Data-efficient and Accurate Vision Models from Synthetic Data

Fatemeh Saleh, Sadegh Aliakbarian, Charlie Hewitt, Lohit Petikam, Xiao-Xian, Antonio Criminisi, Thomas J. Cashman, Tadas Baltrušaitis

Main category: cs.CV

TL;DR: Training high-accuracy human-centric vision models on small synthetic datasets, achieving efficiency and fairness without compromising performance.

DetailsMotivation: Address the high cost and data requirements of large-scale models by leveraging synthetic datasets for efficiency and control.

Method: Use high-fidelity synthetic datasets with perfect labels and procedural diversity to train models for dense prediction tasks.

Result: Models achieve comparable accuracy to large-scale counterparts at a fraction of the cost, with added fairness benefits.

Conclusion: Synthetic datasets offer a viable, efficient alternative for training human-centric vision models, with advantages in cost, control, and fairness.

Abstract: The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. Our human-centric synthetic dataset and trained models are available at https://aka.ms/DAViD.

[255] Rethinking Occlusion in FER: A Semantic-Aware Perspective and Go Beyond

Huiyu Zhai, Xingxing Yang, Yalan Ye, Chenyang Li, Bin Fan, Changze Li

Main category: cs.CV

TL;DR: ORSANet improves facial expression recognition (FER) under occlusion by using multi-modal semantic guidance, a multi-scale fusion module, and a dynamic loss function, achieving state-of-the-art results.

DetailsMotivation: Existing FER models struggle with occlusion and dataset biases, leading to inaccurate classifications.

Method: ORSANet introduces multi-modal semantic guidance (semantic segmentation and facial landmarks), a Multi-scale Cross-interaction Module (MCM), and a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss).

Result: ORSANet achieves state-of-the-art performance on public benchmarks and the new Occlu-FER dataset.

Conclusion: ORSANet effectively addresses occlusion and bias challenges in FER, demonstrating superior performance.

Abstract: Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model’s ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet-master.

[256] SurgX: Neuron-Concept Association for Explainable Surgical Phase Recognition

Ka Young Kim, Hyeon Bae Kim, Seong Tae Kim

Main category: cs.CV

TL;DR: SurgX is a concept-based explanation framework to improve interpretability in surgical phase recognition models by linking neurons to relevant concepts.

DetailsMotivation: Deep learning models for surgical phase recognition lack interpretability, hindering trust and debugging.

Method: SurgX selects example sequences for neurons, constructs a surgical-specific concept set, associates neurons with concepts, and identifies key neurons for predictions.

Result: Validated on two models, SurgX effectively explains predictions in surgical phase recognition.

Conclusion: SurgX enhances interpretability, aiding trust and debugging in surgical phase recognition models.

Abstract: Surgical phase recognition plays a crucial role in surgical workflow analysis, enabling various applications such as surgical monitoring, skill assessment, and workflow optimization. Despite significant advancements in deep learning-based surgical phase recognition, these models remain inherently opaque, making it difficult to understand how they make decisions. This lack of interpretability hinders trust and makes it challenging to debug the model. To address this challenge, we propose SurgX, a novel concept-based explanation framework that enhances the interpretability of surgical phase recognition models by associating neurons with relevant concepts. In this paper, we introduce the process of selecting representative example sequences for neurons, constructing a concept set tailored to the surgical video dataset, associating neurons with concepts and identifying neurons crucial for predictions. Through extensive experiments on two surgical phase recognition models, we validate our method and analyze the explanation for prediction. This highlights the potential of our method in explaining surgical phase recognition. The code is available at https://github.com/ailab-kyunghee/SurgX

[257] EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent

Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen

Main category: cs.CV

TL;DR: EgoPrune is a training-free token pruning method for efficient egomotion video reasoning, outperforming prior methods and reducing computational costs.

DetailsMotivation: Egomotion videos are crucial for embodied AI agents, but current vision-language models are computationally expensive for long videos. Existing pruning methods don't account for egomotion's unique spatiotemporal continuity.

Method: EgoPrune includes a keyframe selector, Perspective-Aware Redundancy Filtering, and an MMR-based token selector to prune redundant tokens efficiently.

Result: EgoPrune outperforms prior methods on benchmarks, reducing FLOPs, memory usage, and latency. It also works well on edge devices like Jetson Orin NX.

Conclusion: EgoPrune is effective and practical for real-world egomotion video reasoning, offering significant efficiency gains.

Abstract: Egomotion videos are first-person recordings where the view changes continuously due to the agent’s movement. As they serve as the primary visual input for embodied AI agents, making egomotion video reasoning more efficient is therefore essential for real-world deployment. Recent advances in vision-language models have enabled strong multimodal reasoning capabilities, but their computational cost remains prohibitive for long, redundant video inputs. Existing token pruning methods, typically designed for third-person videos, fail to leverage the spatiotemporal continuity and motion constraints inherent in egomotion settings. To address this, we propose EgoPrune, a training-free token pruning method tailored for egomotion video reasoning. EgoPrune comprises three components: a keyframe selector adapted from EmbodiedR for temporally efficient sampling; Perspective-Aware Redundancy Filtering (PARF), which aligns visual tokens using perspective transformations and removes redundant tokens; and a Maximal Marginal Relevance (MMR)-based token selector that jointly considers visual-text relevance and intra-frame diversity. Experiments on two egomotion video benchmarks show that EgoPrune consistently outperforms prior training-free methods across various pruning ratios while significantly reducing FLOPs, memory usage, and latency. Moreover, we deploy EgoPrune on an embodied agent equipped with a Jetson Orin NX 16GB edge device, demonstrating its real-world efficiency and suitability for on-device egomotion video reasoning.

[258] One Last Attention for Your Vision-Language Model

Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen

Main category: cs.CV

TL;DR: RAda is a method for fine-tuning VLMs by dynamically adjusting fused representations to improve cross-modal interactions without costly modifications.

DetailsMotivation: Existing methods neglect the role of fused representations in decision-making, limiting the potential of VLMs.

Method: RAda uses a learned mask from a lightweight attention layer to calibrate contributions in the rational matrix.

Result: RAda improves baseline performance with minimal code and matches current methods in most settings.

Conclusion: RAda is a versatile and effective fine-tuning technique for VLMs.

Abstract: Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.

[259] An aerial color image anomaly dataset for search missions in complex forested terrain

Rakesh John Amala Arokia Nathan, Matthias Gessner, Nurullah Özkan, Marius Bock, Mohamed Youssef, Maximilian Mews, Björn Piltz, Ralf Berger, Oliver Bimber

Main category: cs.CV

TL;DR: A crowd-search initiative created a dataset of hard-to-detect anomalies in dense forests, serving as a benchmark for improving anomaly detection in manhunts and rescues. Existing methods performed poorly, emphasizing the need for context-aware approaches.

DetailsMotivation: The failure to locate a suspect in a dense forest despite a massive search highlighted the limitations of automated analysis in such environments, prompting the creation of a labeled dataset for better anomaly detection.

Method: High-resolution aerial imagery was captured, and a crowd-search initiative labeled anomalies obscured by dense vegetation. The dataset supports offline processing and an interactive web interface for dynamic growth.

Result: Initial benchmark tests revealed poor performance of existing anomaly detection methods, underscoring the challenge of dense forest environments.

Conclusion: The dataset and interactive platform provide valuable resources for developing context-aware anomaly detection methods, aiding future manhunts and rescue operations.

Abstract: After a family murder in rural Germany, authorities failed to locate the suspect in a vast forest despite a massive search. To aid the search, a research aircraft captured high-resolution aerial imagery. Due to dense vegetation obscuring small clues, automated analysis was ineffective, prompting a crowd-search initiative. This effort produced a unique dataset of labeled, hard-to-detect anomalies under occluded, real-world conditions. It can serve as a benchmark for improving anomaly detection approaches in complex forest environments, supporting manhunts and rescue operations. Initial benchmark tests showed existing methods performed poorly, highlighting the need for context-aware approaches. The dataset is openly accessible for offline processing. An additional interactive web interface supports online viewing and dynamic growth by allowing users to annotate and submit new findings.

[260] Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images

JunYing Huang, Ao Xu, DongSun Yong, KeRen Li, YuanFeng Wang, Qi Qin

Main category: cs.CV

TL;DR: A novel LiDAR-Visual odometry framework integrating LiDAR and images for accurate pose estimation, outperforming state-of-the-art methods.

DetailsMotivation: Odometry is crucial for autonomous systems, but existing methods lack robustness in dynamic environments or occlusion-prone regions.

Method: Combines LiDAR point clouds and images via depth completion, uses multi-scale feature extraction with attention, and refines pose hierarchically.

Result: Achieves superior accuracy and robustness on the KITTI benchmark compared to current visual and LiDAR odometry methods.

Conclusion: The proposed framework effectively addresses challenges in odometry, offering a robust solution for autonomous navigation.

Abstract: Odometry is a critical task for autonomous systems for self-localization and navigation. We propose a novel LiDAR-Visual odometry framework that integrates LiDAR point clouds and images for accurate and robust pose estimation. Our method utilizes a dense-depth map estimated from point clouds and images through depth completion, and incorporates a multi-scale feature extraction network with attention mechanisms, enabling adaptive depth-aware representations. Furthermore, we leverage dense depth information to refine flow estimation and mitigate errors in occlusion-prone regions. Our hierarchical pose refinement module optimizes motion estimation progressively, ensuring robust predictions against dynamic environments and scale ambiguities. Comprehensive experiments on the KITTI odometry benchmark demonstrate that our approach achieves similar or superior accuracy and robustness compared to state-of-the-art visual and LiDAR odometry methods.

[261] Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization

Bingqing Zhang, Zhuo Cao, Heming Du, Yang Li, Xue Li, Jiajun Liu, Sen Wang

Main category: cs.CV

TL;DR: UMIVR is a framework for interactive text-to-video retrieval that quantifies uncertainties (text ambiguity, mapping uncertainty, frame uncertainty) and uses them to generate clarifying questions, improving retrieval accuracy.

DetailsMotivation: Current interactive TVR systems lack explicit quantification of uncertainties, limiting their effectiveness.

Method: UMIVR uses training-free metrics (TAS, MUS, TQFS) to measure uncertainties and adaptively generates clarifying questions.

Result: Achieves 69.2% Recall@1 on MSR-VTT-1k after 10 rounds, outperforming existing methods.

Conclusion: UMIVR provides a principled, uncertainty-minimizing approach for interactive TVR, enhancing retrieval performance.

Abstract: Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty-via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen-Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR’s effectiveness, achieving notable gains in Recall@1 (69.2% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.

[262] SAIGFormer: A Spatially-Adaptive Illumination-Guided Network for Low-Light Image Enhancement

Hanting Li, Fei Zhou, Xin Sun, Yang Hua, Jungong Han, Liang-Jie Zhang

Main category: cs.CV

TL;DR: SAIGFormer, a Transformer-based framework, improves low-light image enhancement by addressing non-uniform lighting issues with dynamic illumination modeling and guided attention.

DetailsMotivation: Existing methods struggle with non-uniform lighting scenarios like backlit or shadowed images, leading to over-exposure or inadequate brightness restoration.

Method: Proposes SAIGFormer with a dynamic integral image representation for illumination modeling and an Illumination-Guided Multi-head Self-Attention mechanism.

Result: Outperforms state-of-the-art methods on five datasets and a cross-domain benchmark, excelling in non-uniform lighting scenarios.

Conclusion: SAIGFormer achieves superior illumination enhancement and generalization, with code publicly available.

Abstract: Recent Transformer-based low-light enhancement methods have made promising progress in recovering global illumination. However, they still struggle with non-uniform lighting scenarios, such as backlit and shadow, appearing as over-exposure or inadequate brightness restoration. To address this challenge, we present a Spatially-Adaptive Illumination-Guided Transformer (SAIGFormer) framework that enables accurate illumination restoration. Specifically, we propose a dynamic integral image representation to model the spatially-varying illumination, and further construct a novel Spatially-Adaptive Integral Illumination Estimator ($\text{SAI}^2\text{E}$). Moreover, we introduce an Illumination-Guided Multi-head Self-Attention (IG-MSA) mechanism, which leverages the illumination to calibrate the lightness-relevant features toward visual-pleased illumination enhancement. Extensive experiments on five standard low-light datasets and a cross-domain benchmark (LOL-Blur) demonstrate that our SAIGFormer significantly outperforms state-of-the-art methods in both quantitative and qualitative metrics. In particular, our method achieves superior performance in non-uniform illumination enhancement while exhibiting strong generalization capabilities across multiple datasets. Code is available at https://github.com/LHTcode/SAIGFormer.git.

[263] Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

Syed Ahmed Mahmood, Ali Shah Ali, Umer Ahmed, Fawad Javed Fateh, M. Zeeshan Zia, Quoc-Huy Tran

Main category: cs.CV

TL;DR: A self-supervised framework for learning procedures from unlabeled videos, using fused Gromov-Wasserstein optimal transport and contrastive regularization to address order variations and degenerate solutions.

DetailsMotivation: To overcome limitations of previous methods, such as order variations, background/redundant frames, and repeated actions, in learning key steps from procedural videos.

Method: Proposes a self-supervised framework combining fused Gromov-Wasserstein optimal transport for temporal alignment and contrastive regularization to prevent degenerate solutions.

Result: Demonstrates superior performance on benchmarks (EgoProceL, ProceL, CrossTask) compared to previous methods like OPEL.

Conclusion: The framework effectively learns key steps and their order from unlabeled videos, outperforming existing approaches.

Abstract: We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation with a structural prior for computing frame-to-frame mapping between videos. However, optimizing exclusively for the above temporal alignment term may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and hence every video is associated with only one key step. To address that limitation, we further integrate a contrastive regularization term, which maps different frames to different points in the embedding space, avoiding the collapse to trivial solutions. Finally, we conduct extensive experiments on large-scale egocentric (i.e., EgoProceL) and third-person (i.e., ProceL and CrossTask) benchmarks to demonstrate superior performance by our approach against previous methods, including OPEL which relies on a traditional Kantorovich optimal transport formulation with an optimality prior.

[264] Towards Holistic Surgical Scene Graph

Jongmin Shin, Enki Cho, Ka Yong Kim, Jung Yong Kim, Seong Tae Kim, Namkee Oh

Main category: cs.CV

TL;DR: The paper introduces Endoscapes-SG201 dataset and SSG-Com, a graph-based method, to enhance surgical scene understanding by incorporating tool-action-target combinations and hand identity into graph representations.

DetailsMotivation: Surgical scene understanding is vital for computer-assisted intervention systems, but existing graph-based representations lack exploration of tool-action-target combinations and hand identity.

Method: Proposes Endoscapes-SG201 dataset with annotations for tool-action-target and hand identity, and introduces SSG-Com, a graph-based method to model these elements.

Result: Experiments show the importance of integrating these components for tasks like critical view of safety assessment and action triplet recognition.

Conclusion: The study highlights the significant contribution of tool-action-target and hand identity in surgical scene understanding, with code and dataset made available.

Abstract: Surgical scene understanding is crucial for computer-assisted intervention systems, requiring visual comprehension of surgical scenes that involves diverse elements such as surgical tools, anatomical structures, and their interactions. To effectively represent the complex information in surgical scenes, graph-based approaches have been explored to structurally model surgical entities and their relationships. Previous surgical scene graph studies have demonstrated the feasibility of representing surgical scenes using graphs. However, certain aspects of surgical scenes-such as diverse combinations of tool-action-target and the identity of the hand operating the tool-remain underexplored in graph-based representations, despite their importance. To incorporate these aspects into graph representations, we propose Endoscapes-SG201 dataset, which includes annotations for tool-action-target combinations and hand identity. We also introduce SSG-Com, a graph-based method designed to learn and represent these critical elements. Through experiments on downstream tasks such as critical view of safety assessment and action triplet recognition, we demonstrated the importance of integrating these essential scene graph components, highlighting their significant contribution to surgical scene understanding. The code and dataset are available at https://github.com/ailab-kyunghee/SSG-Com

[265] HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation

Qinqian Lei, Bo Wang, Robby T. Tan

Main category: cs.CV

TL;DR: HOLa improves zero-shot HOI detection by decomposing VLM text features into class-shared basis and adaptable weights, enhancing generalization and action distinction.

DetailsMotivation: Existing methods struggle with distinguishing similar actions or generalizing to unseen classes in HOI detection.

Method: HOLa uses low-rank decomposition of VLM text features, adapts weights for each HOI class, and enriches visual representations with human-object tokens.

Result: Achieves 27.91 mAP on unseen classes in HICO-DET, setting a new state-of-the-art.

Conclusion: HOLa effectively generalizes to unseen classes and improves action distinction in zero-shot HOI detection.

Abstract: Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.

[266] DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Xiaoyi Bao, Chenwei Xie, Hao Tang, Tingyu Weng, Xiaofeng Wang, Yun Zheng, Xingang Wang

Main category: cs.CV

TL;DR: The paper introduces Dynamic-Image (DynImg), a novel video representation method using non-key frames as temporal prompts to improve spatial feature extraction for fast-moving objects, enhancing spatio-temporal interaction in video understanding.

DetailsMotivation: Existing methods struggle with accurately representing spatial information of rapidly moving objects due to issues like motion blur, leading to underemphasized temporally important regions and hindered video understanding.

Method: Proposes DynImg, which uses non-key frames as temporal prompts to highlight spatial areas of fast-moving objects and employs 4D video Rotary Position Embedding to maintain spatio-temporal order.

Result: DynImg outperforms state-of-the-art methods by ~2% on multiple video understanding benchmarks.

Conclusion: DynImg effectively addresses the challenge of integrating temporal information in video understanding, proving the value of temporal prompts in enhancing comprehension.

Abstract: In recent years, the introduction of Multi-modal Large Language Models (MLLMs) into video understanding tasks has become increasingly prevalent. However, how to effectively integrate temporal information remains a critical research focus. Traditional approaches treat spatial and temporal information separately. Due to issues like motion blur, it is challenging to accurately represent the spatial information of rapidly moving objects. This can lead to temporally important regions being underemphasized during spatial feature extraction, which in turn hinders accurate spatio-temporal interaction and video understanding. To address this limitation, we propose an innovative video representation method called Dynamic-Image (DynImg). Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects. During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions. Moreover, to maintain the correct sequence for DynImg, we employ a corresponding 4D video Rotary Position Embedding. This retains both the temporal and spatial adjacency of DynImg, helping MLLM understand the spatio-temporal order within this combined format. Experimental evaluations reveal that DynImg surpasses the state-of-the-art methods by approximately 2% across multiple video understanding benchmarks, proving the effectiveness of our temporal prompts in enhancing video comprehension.

[267] GeMix: Conditional GAN-Based Mixup for Improved Medical Image Augmentation

Hugo Carlesso, Maria Eliza Patulea, Moncef Garouani, Radu Tudor Ionescu, Josiane Mothe

Main category: cs.CV

TL;DR: GeMix replaces naive pixel-wise mixup with a learned, label-aware interpolation using class-conditional GANs, improving image realism and performance in medical image classification.

DetailsMotivation: Traditional mixup produces unrealistic images, hindering learning, especially in high-stakes medical applications. GeMix aims to address this by generating visually coherent images.

Method: GeMix uses a two-stage framework: a StyleGAN2-ADA generator is trained, then interpolated label vectors condition the generator to synthesize realistic images.

Result: GeMix outperforms traditional mixup on the COVIDx-CT-3 dataset, improving macro-F1 and reducing false negatives for COVID-19 detection.

Conclusion: GeMix is a drop-in replacement for pixel-space mixup, offering better regularization and semantic fidelity without disrupting training pipelines.

Abstract: Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at https://github.com/hugocarlesso/GeMix to foster reproducibility and further research.

[268] SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging

Salah Eddine Bekhouche, Gaby Maroun, Fadi Dornaika, Abdenour Hadid

Main category: cs.CV

TL;DR: SegDT, a diffusion transformer-based model for skin lesion segmentation, achieves state-of-the-art results on low-cost hardware with fast inference, advancing medical image analysis.

DetailsMotivation: Improving skin lesion segmentation for accurate disease diagnosis and treatment planning in healthcare.

Method: Introduces SegDT, combining diffusion transformer (DiT) with Rectified Flow for efficient, high-quality segmentation on low-cost hardware.

Result: Achieves state-of-the-art performance on three datasets with fast inference speeds.

Conclusion: SegDT enhances medical image analysis, offering faster, more accurate tools for healthcare professionals; code is publicly available.

Abstract: Medical image segmentation is crucial for many healthcare tasks, including disease diagnosis and treatment planning. One key area is the segmentation of skin lesions, which is vital for diagnosing skin cancer and monitoring patients. In this context, this paper introduces SegDT, a new segmentation model based on diffusion transformer (DiT). SegDT is designed to work on low-cost hardware and incorporates Rectified Flow, which improves the generation quality at reduced inference steps and maintains the flexibility of standard diffusion models. Our method is evaluated on three benchmarking datasets and compared against several existing works, achieving state-of-the-art results while maintaining fast inference speeds. This makes the proposed model appealing for real-world medical applications. This work advances the performance and capabilities of deep learning models in medical image analysis, enabling faster, more accurate diagnostic tools for healthcare professionals. The code is made publicly available at \href{https://github.com/Bekhouche/SegDT}{GitHub}.

[269] Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu

Main category: cs.CV

TL;DR: Being-H0 is a Vision-Language-Action model trained on human videos to address dexterity and generalization gaps in manipulation tasks, using physical instruction tuning and part-level motion tokenization.

DetailsMotivation: Existing VLAs struggle with complex manipulation tasks due to reliance on synthetic or limited teleoperated data, lacking dexterity and scalability.

Method: Leverages human hand data, combines VLA pretraining, physical space alignment, and post-training adaptation, with part-level motion tokenization for precise action learning.

Result: Achieves millimeter-level reconstruction accuracy, excels in hand motion generation and instruction following, and scales well with model/data size.

Conclusion: Being-H0 shows promising real-world robotic manipulation performance, validated by physical instruction tuning.

Abstract: We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources – including motion capture, VR, and RGB-only videos – into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

[270] SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting

Zihui Gao, Jia-Wang Bian, Guosheng Lin, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: A hybrid method combining SDF and 3DGS improves surface reconstruction and novel view rendering by leveraging coarse geometry and fine details.

DetailsMotivation: Addressing the limitations of SDF (lacking fine details) and 3DGS (lacking global coherence) in sparse-view image tasks.

Method: Combines SDF for coarse geometry and 3DGS for detail refinement, using rendered images from 3DGS to enhance SDF accuracy.

Result: Outperforms state-of-the-art methods on DTU and MobileBrick datasets.

Conclusion: The hybrid approach effectively balances geometry and detail, advancing sparse-view reconstruction and rendering.

Abstract: Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines the strengths of both approaches: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine the details of SDF for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets. Code will be released at https://github.com/Gaozihui/SurfaceSplat.

[271] CylinderPlane: Nested Cylinder Representation for 3D-aware Image Generation

Ru Jia, Xiaozhuang Ma, Jianji Wang, Nanning Zheng

Main category: cs.CV

TL;DR: The paper introduces CylinderPlane, a cylindrical coordinate-based implicit representation, to address multi-face artifacts in Tri-plane and enable high-quality 360° image synthesis.

DetailsMotivation: Tri-plane representation causes multi-face artifacts due to shared features in symmetric regions, limiting 360° view generation.

Method: Proposes CylinderPlane, a cylindrical coordinate system, to eliminate feature ambiguity and ensure multi-view consistency. Introduces nested cylinders for multi-scale feature capture.

Result: Achieves superior performance in 360° image synthesis, with fine detail learning and robustness to varying resolutions.

Conclusion: CylinderPlane outperforms previous methods, offering a versatile solution for neural rendering pipelines.

Abstract: While the proposal of the Tri-plane representation has advanced the development of the 3D-aware image generative models, problems rooted in its inherent structure, such as multi-face artifacts caused by sharing the same features in symmetric regions, limit its ability to generate 360$^\circ$ view images. In this paper, we propose CylinderPlane, a novel implicit representation based on Cylindrical Coordinate System, to eliminate the feature ambiguity issue and ensure multi-view consistency in 360$^\circ$. Different from the inevitable feature entanglement in Cartesian coordinate-based Tri-plane representation, the cylindrical coordinate system explicitly separates features at different angles, allowing our cylindrical representation possible to achieve high-quality, artifacts-free 360$^\circ$ image synthesis. We further introduce the nested cylinder representation that composites multiple cylinders at different scales, thereby enabling the model more adaptable to complex geometry and varying resolutions. The combination of cylinders with different resolutions can effectively capture more critical locations and multi-scale features, greatly facilitates fine detail learning and robustness to different resolutions. Moreover, our representation is agnostic to implicit rendering methods and can be easily integrated into any neural rendering pipeline. Extensive experiments on both synthetic dataset and unstructured in-the-wild images demonstrate that our proposed representation achieves superior performance over previous methods.

[272] A Survey on Efficiency Optimization Techniques for DNN-based Video Analytics: Process Systems, Algorithms, and Applications

Shanjiang Tang, Rui Huang, Hsinyu Luo, Chunjiang Wang, Ce Yu, Yusen Li, Hao Fu, Chao Sun, and Jian Xiao

Main category: cs.CV

TL;DR: This survey reviews efficiency optimization techniques for DNNs in video analytics, covering hardware, data processing, and deployment, and discusses challenges.

DetailsMotivation: The rapid growth of video data demands efficient and accurate analytics, but improving DNN efficiency remains a challenge.

Method: The paper organizes existing methods in a bottom-up manner, addressing hardware support, data processing, and operational deployment.

Result: A comprehensive review of efficiency optimization techniques for DNNs in video analytics is provided.

Conclusion: The survey highlights challenges and future directions for optimizing DNN performance in video analytics.

Abstract: The explosive growth of video data in recent years has brought higher demands for video analytics, where accuracy and efficiency remain the two primary concerns. Deep neural networks (DNNs) have been widely adopted to ensure accuracy; however, improving their efficiency in video analytics remains an open challenge. Different from existing surveys that make summaries of DNN-based video mainly from the accuracy optimization aspect, in this survey, we aim to provide a thorough review of optimization techniques focusing on the improvement of the efficiency of DNNs in video analytics. We organize existing methods in a bottom-up manner, covering multiple perspectives such as hardware support, data processing, operational deployment, etc. Finally, based on the optimization framework and existing works, we analyze and discuss the problems and challenges in the performance optimization of DNN-based video analytics.

[273] Experimenting active and sequential learning in a medieval music manuscript

Sachin Sharma, Federico Simonetta, Michele Flammini

Main category: cs.CV

TL;DR: The paper explores Active Learning (AL) and Sequential Learning (SL) for OMR in medieval music manuscripts using YOLOv8, achieving comparable accuracy to full supervision with fewer labels, though uncertainty-based AL was ineffective.

DetailsMotivation: Addressing the scarcity of annotated data and complexity of historical manuscripts in OMR for cultural heritage digitization.

Method: Uses YOLOv8 for object detection and layout recognition, selecting uncertain samples for iterative labeling and retraining, starting with one annotated image.

Result: Achieves accuracy close to full supervision with fewer labels, but uncertainty-based AL was ineffective in the tested manuscript.

Conclusion: Highlights the need for more usable methods in data-scarcity scenarios, despite the success of the overall approach.

Abstract: Optical Music Recognition (OMR) is a cornerstone of music digitization initiatives in cultural heritage, yet it remains limited by the scarcity of annotated data and the complexity of historical manuscripts. In this paper, we present a preliminary study of Active Learning (AL) and Sequential Learning (SL) tailored for object detection and layout recognition in an old medieval music manuscript. Leveraging YOLOv8, our system selects samples with the highest uncertainty (lowest prediction confidence) for iterative labeling and retraining. Our approach starts with a single annotated image and successfully boosts performance while minimizing manual labeling. Experimental results indicate that comparable accuracy to fully supervised training can be achieved with significantly fewer labeled examples. We test the methodology as a preliminary investigation on a novel dataset offered to the community by the Anonymous project, which studies laude, a poetical-musical genre spread across Italy during the 12th-16th Century. We show that in the manuscript at-hand, uncertainty-based AL is not effective and advocates for more usable methods in data-scarcity scenarios.

[274] Uncovering Critical Features for Deepfake Detection through the Lottery Ticket Hypothesis

Lisan Al Amin, Md. Ismail Hossain, Thanh Thi Nguyen, Tasnim Jahan, Mahbubul Islam, Faisal Quader

Main category: cs.CV

TL;DR: The study applies the Lottery Ticket Hypothesis (LTH) to deepfake detection, identifying efficient subnetworks (winning tickets) that maintain high accuracy even at high sparsity levels.

DetailsMotivation: Deepfake technology poses threats to information integrity, and current detection methods are resource-intensive and poorly understood.

Method: The study uses LTH-based iterative magnitude pruning on MesoNet, CNN-5, and ResNet-18 architectures, tested on OpenForensic and FaceForensics++ datasets.

Result: Pruned networks (e.g., MesoNet at 80% sparsity) retain high accuracy (56.2% vs. baseline 62.6%) with fewer parameters. LTH-based pruning outperforms one-shot methods.

Conclusion: LTH enables efficient, deployable deepfake detection systems, with winning tickets transferable across datasets.

Abstract: Recent advances in deepfake technology have created increasingly convincing synthetic media that poses significant challenges to information integrity and social trust. While current detection methods show promise, their underlying mechanisms remain poorly understood, and the large sizes of their models make them challenging to deploy in resource-limited environments. This study investigates the application of the Lottery Ticket Hypothesis (LTH) to deepfake detection, aiming to identify the key features crucial for recognizing deepfakes. We examine how neural networks can be efficiently pruned while maintaining high detection accuracy. Through extensive experiments with MesoNet, CNN-5, and ResNet-18 architectures on the OpenForensic and FaceForensics++ datasets, we find that deepfake detection networks contain winning tickets, i.e., subnetworks, that preserve performance even at substantial sparsity levels. Our results indicate that MesoNet retains 56.2% accuracy at 80% sparsity on the OpenForensic dataset, with only 3,000 parameters, which is about 90% of its baseline accuracy (62.6%). The results also show that our proposed LTH-based iterative magnitude pruning approach consistently outperforms one-shot pruning methods. Using Grad-CAM visualization, we analyze how pruned networks maintain their focus on critical facial regions for deepfake detection. Additionally, we demonstrate the transferability of winning tickets across datasets, suggesting potential for efficient, deployable deepfake detection systems.

[275] Extracting Visual Facts from Intermediate Layers for Mitigating Hallucinations in Multimodal Large Language Models

Haoran Zhou, Zihan Zhang, Hao Chen

Main category: cs.CV

TL;DR: EVA is a training-free method to reduce object hallucinations in MLLMs by dynamically selecting intermediate layers with visual factual information and correcting output logits.

DetailsMotivation: MLLMs struggle with object hallucinations due to prior knowledge suppressing visual information, especially in intermediate layers.

Method: EVA selects layers with significant visual factual information, contrasts distributions, and corrects output logits.

Result: EVA significantly reduces hallucination rates compared to baselines.

Conclusion: EVA is effective, model-agnostic, and integrates seamlessly with existing decoding strategies.

Abstract: Multimodal Large Language Models (MLLMs) have made significant strides by combining visual recognition and language understanding to generate content that is both coherent and contextually accurate. However, MLLMs continue to struggle with object hallucinations, where models produce seemingly plausible but factually incorrect outputs, including objects that do not exist in the image. Recent work has revealed that the prior knowledge in MLLMs significantly suppresses visual information in deep layers, causing hallucinatory outputs. However, how these priors suppress visual information at the intermediate layer stage in MLLMs remains unclear. We observe that visual factual knowledge and the differences between intermediate-layer prior/original probability distributions show similar evolutionary trends in intermediate layers. Motivated by this, we introduce Decoding by Extracting Visual Facts (EVA), a simple, training-free method that dynamically selects intermediate layers with the most significant visual factual information. By contrasting the output distributions of the selected layer derived from the original input and pure-text input, EVA extracts visual factual knowledge and proportionally incorporates it into the final layer to correct the output logits. Importantly, EVA is model-agnostic, seamlessly integrates with various classic decoding strategies, and is applicable across different MLLMs. We validate EVA on widely-used benchmarks, and the results show that it significantly reduces hallucination rates compared to baseline methods, underscoring its effectiveness in mitigating hallucinations.

[276] HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark

Aniket Pal, Ajoy Mondal, Minesh Mathew, C. V. Jawahar

Main category: cs.CV

TL;DR: HW-MLVQA is a new benchmark for multilingual handwritten document comprehension, addressing gaps in current MLVQA models. It includes 1,600 handwritten pages and 2,400 Q&A pairs, evaluated across text, image, and integrated modalities.

DetailsMotivation: Current MLVQA models underperform with handwritten documents. HW-MLVQA aims to fill this gap by providing a robust benchmark for multilingual handwritten comprehension.

Method: HW-MLVQA introduces 1,600 handwritten pages and 2,400 Q&A pairs, evaluated across text, image, and integrated modalities. It also tests OCR models in real-world scenarios without ground truth transcriptions.

Result: The benchmark enables rigorous evaluation of multilingual handwritten document interpretation, fostering advancements in this domain.

Conclusion: HW-MLVQA is a pioneering benchmark designed to drive innovation in multilingual handwritten document comprehension.

Abstract: The proliferation of MultiLingual Visual Question Answering (MLVQA) benchmarks augments the capabilities of large language models (LLMs) and multi-modal LLMs, thereby enabling them to adeptly capture the intricate linguistic subtleties and visual complexities inherent across diverse languages. Despite its potential, the current MLVQA model struggles to fully utilize its capabilities when dealing with the extensive variety of handwritten documents. This article delineates HW-MLVQA, an avant-garde VQA benchmark meticulously crafted to mitigate the dearth of authentic Multilingual Handwritten document comprehension. HW-MLVQA encompasses an extensive collection of 1,600 handwritten Pages complemented by 2,400 question-answers. Furthermore, it provides a robust benchmark evaluation framework spanning three distinct modalities: text, image, and an integrated image & text modality. To simulate authentic real-world contexts devoid of ground truth textual transcriptions, we facilitates a rigorous assessment of proprietary and open-source OCR models. The benchmark aspires to facilitate pivotal advancements in multilingual handwritten document interpretation, fostering innovation and scholarly inquiry within this specialized domain.

[277] Visual-Language Model Knowledge Distillation Method for Image Quality Assessment

Yongkang Hou, Jiarun Song

Main category: cs.CV

TL;DR: A knowledge distillation method using CLIP for Image Quality Assessment (IQA) reduces model complexity and improves performance by leveraging quality-graded prompts and modality-adaptive distillation.

DetailsMotivation: Address CLIP's limitations in IQA (excessive parameters, poor local distortion identification) by distilling its knowledge into a more efficient model.

Method: Design quality-graded prompts, fine-tune CLIP, and use modality-adaptive distillation to transfer knowledge to a student model.

Result: Outperforms existing IQA methods while reducing model complexity, validated on multiple datasets.

Conclusion: The proposed method is effective and practical for IQA tasks.

Abstract: Image Quality Assessment (IQA) is a core task in computer vision. Multimodal methods based on vision-language models, such as CLIP, have demonstrated exceptional generalization capabilities in IQA tasks. To address the issues of excessive parameter burden and insufficient ability to identify local distorted features in CLIP for IQA, this study proposes a visual-language model knowledge distillation method aimed at guiding the training of models with architectural advantages using CLIP’s IQA knowledge. First, quality-graded prompt templates were designed to guide CLIP to output quality scores. Then, CLIP is fine-tuned to enhance its capabilities in IQA tasks. Finally, a modality-adaptive knowledge distillation strategy is proposed to achieve guidance from the CLIP teacher model to the student model. Our experiments were conducted on multiple IQA datasets, and the results show that the proposed method significantly reduces model complexity while outperforming existing IQA methods, demonstrating strong potential for practical deployment.

[278] Hi^2-GSLoc: Dual-Hierarchical Gaussian-Specific Visual Relocalization for Remote Sensing

Boni Hu, Zhenyu Xia, Lin Chen, Pengcheng Han, Shuhui Bu

Main category: cs.CV

TL;DR: Error: OutputParser failed

DetailsMotivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Visual relocalization, which estimates the 6-degree-of-freedom (6-DoF) camera pose from query images, is fundamental to remote sensing and UAV applications. Existing methods face inherent trade-offs: image-based retrieval and pose regression approaches lack precision, while structure-based methods that register queries to Structure-from-Motion (SfM) models suffer from computational complexity and limited scalability. These challenges are particularly pronounced in remote sensing scenarios due to large-scale scenes, high altitude variations, and domain gaps of existing visual priors. To overcome these limitations, we leverage 3D Gaussian Splatting (3DGS) as a novel scene representation that compactly encodes both 3D geometry and appearance. We introduce $\mathrm{Hi}^2$-GSLoc, a dual-hierarchical relocalization framework that follows a sparse-to-dense and coarse-to-fine paradigm, fully exploiting the rich semantic information and geometric constraints inherent in Gaussian primitives. To handle large-scale remote sensing scenarios, we incorporate partitioned Gaussian training, GPU-accelerated parallel matching, and dynamic memory management strategies. Our approach consists of two stages: (1) a sparse stage featuring a Gaussian-specific consistent render-aware sampling strategy and landmark-guided detector for robust and accurate initial pose estimation, and (2) a dense stage that iteratively refines poses through coarse-to-fine dense rasterization matching while incorporating reliability verification. Through comprehensive evaluation on simulation data, public datasets, and real flight experiments, we demonstrate that our method delivers competitive localization accuracy, recall rate, and computational efficiency while effectively filtering unreliable pose estimates. The results confirm the effectiveness of our approach for practical remote sensing applications.

[279] LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression

Wenjie Huang, Qi Yang, Shuting Xia, He Huang, Zhu Li, Yiling Xu

Main category: cs.CV

TL;DR: LINR-PCGC is the first INR-based lossless point cloud geometry compression method, improving encoding speed and reducing bitstream size compared to traditional and AI-based methods.

DetailsMotivation: Existing AI-based methods are limited by training data dependencies, and current INR methods only support lossy compression due to encoding time and decoder size constraints.

Method: Proposes a group-level coding framework with network initialization for faster encoding and a lightweight network using multiscale SparseConv for efficient inference.

Result: Reduces encoding time by ~60% and bitstream size by ~21% compared to G-PCC TMC13v23 and SparsePCGC.

Conclusion: LINR-PCGC achieves lossless compression with improved efficiency, making it practical for real-world deployment.

Abstract: Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on https://huangwenjie2023.github.io/LINR-PCGC/.

[280] Efficient Face Image Quality Assessment via Self-training and Knowledge Distillation

Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: A computationally efficient FIQA method using teacher-student distillation and self-training, achieving high performance with low overhead.

DetailsMotivation: To address the computational complexity of FIQA algorithms for scalable and practical deployment.

Method: Two-stage approach: (1) train a teacher model using labeled data and self-training, (2) distill a lightweight student model using pseudo-labels from the teacher.

Result: Student model matches teacher performance with low computational cost; won ICCV 2025 VQualA FIQA Challenge.

Conclusion: The method is efficient, scalable, and effective for real-world FIQA applications.

Abstract: Face image quality assessment (FIQA) is essential for various face-related applications. Although FIQA has been extensively studied and achieved significant progress, the computational complexity of FIQA algorithms remains a key concern for ensuring scalability and practical deployment in real-world systems. In this paper, we aim to develop a computationally efficient FIQA method that can be easily deployed in real-world applications. Specifically, our method consists of two stages: training a powerful teacher model and distilling a lightweight student model from it. To build a strong teacher model, we adopt a self-training strategy to improve its capacity. We first train the teacher model using labeled face images, then use it to generate pseudo-labels for a set of unlabeled images. These pseudo-labeled samples are used in two ways: (1) to distill knowledge into the student model, and (2) to combine with the original labeled images to further enhance the teacher model through self-training. The enhanced teacher model is used to further pseudo-label another set of unlabeled images for distilling the student models. The student model is trained using a combination of labeled images, pseudo-labeled images from the original teacher model, and pseudo-labeled images from the enhanced teacher model. Experimental results demonstrate that our student model achieves comparable performance to the teacher model with an extremely low computational overhead. Moreover, our method achieved first place in the ICCV 2025 VQualA FIQA Challenge. The code is available at https://github.com/sunwei925/Efficient-FIQA.git.

[281] A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot

Main category: cs.CV

TL;DR: The paper evaluates transformer-based systems for spatially-controlled image generation, comparing methods like diffusion-based, flow-based, and autoregressive models, and introduces control token prefilling as a strong baseline.

DetailsMotivation: The lack of detailed and fair comparisons in spatially-controlled image generation research motivates this work to clarify literature and address knowledge gaps.

Method: Controlled experiments on ImageNet with diffusion-based, flow-based, and autoregressive models, focusing on control token prefilling, classifier-free guidance, and softmax truncation.

Result: Control token prefilling is a strong baseline; classifier-free guidance and softmax truncation improve control-generation consistency; adapter-based approaches mitigate forgetting but underperform in consistency.

Conclusion: The study provides clear takeaways for transformer-based spatially-controlled generation, highlighting effective methods and addressing gaps in the literature.

Abstract: Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate “forgetting” and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency. Code will be released upon publication.

[282] TokensGen: Harnessing Condensed Tokens for Long Video Generation

Wenqi Ouyang, Zeqi Xiao, Danni Yang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan

Main category: cs.CV

TL;DR: TokensGen is a two-stage framework using condensed tokens to generate long videos, addressing memory bottlenecks and inconsistency by decomposing tasks into semantic control, consistency, and smooth transitions.

DetailsMotivation: The challenge of generating consistent long videos with diffusion models due to memory issues and long-term inconsistency.

Method: A two-stage approach: (1) To2V for short video diffusion with tokens, (2) T2To for token diffusion ensuring global consistency, and adaptive FIFO-Diffusion for smooth transitions.

Result: Improved long-term temporal and content coherence without excessive computational cost.

Conclusion: TokensGen offers a scalable, modular solution for long video generation, enabling new applications in storytelling and simulations.

Abstract: Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions. Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations. Please see our project page at https://vicky0522.github.io/tokensgen-webpage/ .

[283] Appearance Harmonization via Bilateral Grid Prediction with Transformers for 3DGS

Jisu Shin, Richard Shaw, Seunghyun Shin, Anton Pelykh, Zhensong Zhang, Hae-Gon Jeon, Eduardo Perez-Pellitero

Main category: cs.CV

TL;DR: A transformer-based method predicts bilateral grids to correct photometric inconsistencies in multi-view scenes, improving novel view synthesis without scene-specific retraining.

DetailsMotivation: Photometric inconsistencies from camera pipelines degrade multi-view consistency and novel view synthesis quality. Existing methods increase computational complexity.

Method: Uses a transformer to predict spatially adaptive bilateral grids for photometric correction, integrated into the 3D Gaussian Splatting pipeline.

Result: Outperforms or matches scene-specific methods in reconstruction fidelity and convergence speed.

Conclusion: The proposed method enables robust cross-scene generalization and maintains high training efficiency.

Abstract: Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade the quality of novel view synthesis. Joint optimization of scene representations and per-image appearance embeddings has been proposed to address this issue, but at the cost of increased computational complexity and slower training. In this work, we propose a transformer-based method that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner, enabling robust cross-scene generalization without the need for scene-specific retraining. By incorporating the learned grids into the 3D Gaussian Splatting pipeline, we improve reconstruction quality while maintaining high training efficiency. Extensive experiments show that our approach outperforms or matches existing scene-specific optimization methods in reconstruction fidelity and convergence speed.

[284] Learning from Heterogeneity: Generalizing Dynamic Facial Expression Recognition via Distributionally Robust Optimization

Feng-Qi Cui, Anyang Tong, Jinyang Huang, Jie Zhang, Dan Guo, Zhi Liu, Meng Wang

Main category: cs.CV

TL;DR: A novel framework (HDF) with two modules (DAM and DSM) improves DFER by addressing sample heterogeneity and optimization imbalance, achieving better accuracy and robustness.

DetailsMotivation: Existing DFER methods degrade under sample heterogeneity from multi-source data and individual variability.

Method: Proposes HDF with DAM for time-frequency modeling and DSM for adaptive loss balancing.

Result: HDF outperforms on DFEW and FERV39k datasets, improving WAR and UAR with strong generalization.

Conclusion: HDF effectively enhances DFER performance and robustness, with code publicly available.

Abstract: Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at https://github.com/QIcita/HDF_DFER.

[285] Label tree semantic losses for rich multi-class medical image segmentation

Junwen Wang, Oscar MacCormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren

Main category: cs.CV

TL;DR: The paper introduces tree-based semantic loss functions for medical image segmentation, leveraging hierarchical label organization to improve accuracy, especially for subtle class distinctions. It achieves state-of-the-art results in brain MRI and neurosurgical HSI tasks.

DetailsMotivation: Current methods penalize all segmentation errors equally, ignoring inter-class semantics, which becomes problematic with rich, nuanced labels.

Method: Proposes two tree-based semantic loss functions and integrates them with sparse, background-free annotation training.

Result: Achieves state-of-the-art performance in whole brain parcellation (WBP) and neurosurgical hyperspectral imaging (HSI) segmentation.

Conclusion: The proposed hierarchical loss functions enhance segmentation accuracy, particularly for complex label spaces, and are effective in both fully and sparsely annotated scenarios.

Abstract: Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.

[286] Regularized Low-Rank Adaptation for Few-Shot Organ Segmentation

Ghassen Baklouti, Julio Silva-Rodríguez, Jose Dolz, Houda Bahig, Ismail Ben Ayed

Main category: cs.CV

TL;DR: A novel PEFT method for medical image segmentation dynamically adjusts rank during adaptation, outperforming standard LoRA and other methods.

DetailsMotivation: Address the challenge of selecting a fixed rank in LoRA for medical imaging tasks by introducing dynamic rank adjustment.

Method: Introduces an l_1 sparsity regularizer to the loss function, optimized with a proximal optimizer, enabling automatic task-adapted rank selection.

Result: Significant performance improvements in few-shot fine-tuning, demonstrating efficiency and robustness against suboptimal rank initialization.

Conclusion: The proposed method enhances LoRA for medical imaging by dynamically adjusting rank, improving performance and adaptability.

Abstract: Parameter-efficient fine-tuning (PEFT) of pre-trained foundation models is increasingly attracting interest in medical imaging due to its effectiveness and computational efficiency. Among these methods, Low-Rank Adaptation (LoRA) is a notable approach based on the assumption that the adaptation inherently occurs in a low-dimensional subspace. While it has shown good performance, its implementation requires a fixed and unalterable rank, which might be challenging to select given the unique complexities and requirements of each medical imaging downstream task. Inspired by advancements in natural image processing, we introduce a novel approach for medical image segmentation that dynamically adjusts the intrinsic rank during adaptation. Viewing the low-rank representation of the trainable weight matrices as a singular value decomposition, we introduce an l_1 sparsity regularizer to the loss function, and tackle it with a proximal optimizer. The regularizer could be viewed as a penalty on the decomposition rank. Hence, its minimization enables to find task-adapted ranks automatically. Our method is evaluated in a realistic few-shot fine-tuning setting, where we compare it first to the standard LoRA and then to several other PEFT methods across two distinguishable tasks: base organs and novel organs. Our extensive experiments demonstrate the significant performance improvements driven by our method, highlighting its efficiency and robustness against suboptimal rank initialization. Our code is publicly available: https://github.com/ghassenbaklouti/ARENA

[287] Exploring Superposition and Interference in State-of-the-Art Low-Parameter Vision Models

Lilian Hollard, Lucas Mohimont, Nathalie Gaveau, Luiz-Angelo Steffenel

Main category: cs.CV

TL;DR: The paper explores low-parameter deep neural networks for computer vision, focusing on bottleneck architectures and superlinear activation functions. It addresses interference in feature maps, proposing design elements to reduce it, and introduces the NoDepth Bottleneck architecture for improved scaling and accuracy.

DetailsMotivation: To enhance the performance and scalability of low-parameter deep neural networks by addressing interference in feature maps, a challenge in bottleneck architectures.

Method: Examines bottleneck architectures and superlinear activation functions, identifies design elements to reduce interference, and proposes the NoDepth Bottleneck architecture.

Result: Demonstrates improved scaling and accuracy in low-parameter networks (under 1.5M parameters) on the ImageNet dataset.

Conclusion: The findings contribute to more efficient and scalable neural networks for low-parameter ranges and advance understanding of bottlenecks in computer vision.

Abstract: The paper investigates the performance of state-of-the-art low-parameter deep neural networks for computer vision, focusing on bottleneck architectures and their behavior using superlinear activation functions. We address interference in feature maps, a phenomenon associated with superposition, where neurons simultaneously encode multiple characteristics. Our research suggests that limiting interference can enhance scaling and accuracy in very low-scaled networks (under 1.5M parameters). We identify key design elements that reduce interference by examining various bottleneck architectures, leading to a more efficient neural network. Consequently, we propose a proof-of-concept architecture named NoDepth Bottleneck built on mechanistic insights from our experiments, demonstrating robust scaling accuracy on the ImageNet dataset. These findings contribute to more efficient and scalable neural networks for the low-parameter range and advance the understanding of bottlenecks in computer vision. https://caiac.pubpub.org/pub/3dh6rsel

[288] ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction

Danhui Chen, Ziquan Liu, Chuxi Yang, Dan Wang, Yan Yan, Yi Xu, Xiangyang Ji

Main category: cs.CV

TL;DR: ConformalSAM leverages foundational segmentation models to address label scarcity in semi-supervised semantic segmentation by calibrating and filtering unreliable predictions, outperforming existing methods.

DetailsMotivation: High-quality annotated data for pixel-level tasks like semantic segmentation is costly. Semi-supervised methods and foundational models offer potential solutions.

Method: Uses SEEM (a SAM variant) for mask generation, then ConformalSAM calibrates and filters predictions via conformal prediction, combining early foundational model reliance with later self-reliance training.

Result: Outperforms recent SSSS methods on benchmarks and enhances other methods as a plug-in.

Conclusion: ConformalSAM effectively addresses label scarcity by reliably leveraging foundational models and adaptive training strategies.

Abstract: Pixel-level vision tasks, such as semantic segmentation, require extensive and high-quality annotated data, which is costly to obtain. Semi-supervised semantic segmentation (SSSS) has emerged as a solution to alleviate the labeling burden by leveraging both labeled and unlabeled data through self-training techniques. Meanwhile, the advent of foundational segmentation models pre-trained on massive data, has shown the potential to generalize across domains effectively. This work explores whether a foundational segmentation model can address label scarcity in the pixel-level vision task as an annotator for unlabeled images. Specifically, we investigate the efficacy of using SEEM, a Segment Anything Model (SAM) variant fine-tuned for textual input, to generate predictive masks for unlabeled data. To address the shortcomings of using SEEM-generated masks as supervision, we propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain’s labeled data and then filters out unreliable pixel labels of unlabeled data so that only high-confidence labels are used as supervision. By leveraging conformal prediction (CP) to adapt foundation models to target data through uncertainty calibration, ConformalSAM exploits the strong capability of the foundational segmentation model reliably which benefits the early-stage learning, while a subsequent self-reliance training strategy mitigates overfitting to SEEM-generated masks in the later training stage. Our experiment demonstrates that, on three standard benchmarks of SSSS, ConformalSAM achieves superior performance compared to recent SSSS methods and helps boost the performance of those methods as a plug-in.

[289] True Multimodal In-Context Learning Needs Attention to the Visual Context

Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

Main category: cs.CV

TL;DR: Current MLLMs struggle with visual cues in multimodal in-context learning (MICL). DARA and TrueMICL improve attention to visuals and evaluation reliability.

DetailsMotivation: Addressing MLLMs' over-reliance on text and neglect of visuals in MICL, limiting practical utility.

Method: Introduces Dynamic Attention Reallocation (DARA) for balanced visual-textual attention and TrueMICL dataset for explicit multimodal integration.

Result: DARA and TrueMICL significantly enhance true multimodal in-context learning capabilities.

Conclusion: The proposed solutions effectively improve MICL by addressing visual neglect and providing reliable evaluation.

Abstract: Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .

[290] Diffusion models for multivariate subsurface generation and efficient probabilistic inversion

Roberto Miele, Niklas Linde

Main category: cs.CV

TL;DR: Diffusion models outperform variational autoencoders and GANs in multivariate subsurface modeling, with improved robustness, sampling, and computational efficiency.

DetailsMotivation: To enhance multivariate subsurface modeling and probabilistic inversion using diffusion models, addressing limitations of existing methods like VAEs and GANs.

Method: Proposes corrections to Diffusion Posterior Sampling, including a noise-contamination-aware likelihood approximation, and tests in geological scenarios with hard and indirect data.

Result: Shows improved statistical robustness, better posterior sampling, and reduced computational costs compared to original methods.

Conclusion: Diffusion models are efficient for subsurface modeling, handling both hard and indirect data, and outperform traditional methods like MCMC.

Abstract: Diffusion models offer stable training and state-of-the-art performance for deep generative modeling tasks. Here, we consider their use in the context of multivariate subsurface modeling and probabilistic inversion. We first demonstrate that diffusion models enhance multivariate modeling capabilities compared to variational autoencoders and generative adversarial networks. In diffusion modeling, the generative process involves a comparatively large number of time steps with update rules that can be modified to account for conditioning data. We propose different corrections to the popular Diffusion Posterior Sampling approach by Chung et al. (2023). In particular, we introduce a likelihood approximation accounting for the noise-contamination that is inherent in diffusion modeling. We assess performance in a multivariate geological scenario involving facies and correlated acoustic impedance. Conditional modeling is demonstrated using both local hard data (well logs) and nonlinear geophysics (fullstack seismic data). Our tests show significantly improved statistical robustness, enhanced sampling of the posterior probability density function and reduced computational costs, compared to the original approach. The method can be used with both hard and indirect conditioning data, individually or simultaneously. As the inversion is included within the diffusion process, it is faster than other methods requiring an outer-loop around the generative model, such as Markov chain Monte Carlo.

[291] Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models

Enes Sanli, Baris Sarper Tezcan, Aykut Erdem, Erkut Erdem

Main category: cs.CV

TL;DR: PhysVidBench is a benchmark to evaluate physical commonsense in text-to-video (T2V) models, addressing gaps in causality and object behavior.

DetailsMotivation: Current T2V models lack physical commonsense, producing unrealistic outputs. PhysVidBench aims to systematically evaluate these shortcomings.

Method: The benchmark uses 383 prompts to generate videos, followed by a three-stage evaluation: grounded physics questions, video captioning, and language model-based physics reasoning.

Result: PhysVidBench provides a structured, interpretable framework to assess physical plausibility in T2V models, focusing on tool use and material properties.

Conclusion: The benchmark highlights overlooked areas in T2V evaluations and offers a robust method to improve physical commonsense in generative video models.

Abstract: Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.

[292] SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

Main category: cs.CV

TL;DR: SeC introduces a concept-driven framework for Video Object Segmentation (VOS), leveraging Large Vision-Language Models (LVLMs) for robust object-centric representations, outperforming SAM 2.1 by 11.8 points on the new SeCVOS benchmark.

DetailsMotivation: Current VOS methods rely on appearance matching, lacking human-like conceptual understanding, which limits performance under visual variations, occlusions, and complex scenes.

Method: SeC uses LVLMs to build high-level object representations, combining semantic reasoning with feature matching, and dynamically adjusts computation based on scene complexity.

Result: SeC achieves an 11.8-point improvement over SAM 2.1 on the SeCVOS benchmark, setting a new state-of-the-art.

Conclusion: SeC demonstrates the effectiveness of concept-driven approaches in VOS, particularly in complex scenarios, and introduces a challenging benchmark (SeCVOS) for future research.

Abstract: Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.

[293] Latent Denoising Makes Good Visual Tokenizers

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang

Main category: cs.CV

TL;DR: The paper proposes aligning tokenizer embeddings with the denoising objective to improve generative modeling, introducing the Latent Denoising Tokenizer (l-DeTok), which outperforms standard tokenizers.

DetailsMotivation: Modern generative models use denoising (reconstructing clean signals from corrupted inputs), suggesting tokenizers should align with this objective for better performance.

Method: Introduces l-DeTok, a tokenizer trained to reconstruct clean images from corrupted latent embeddings using interpolative noise and random masking.

Result: l-DeTok consistently outperforms standard tokenizers across six generative models on ImageNet 256x256.

Conclusion: Denoising is a key principle for tokenizer design, offering new perspectives for future improvements.

Abstract: Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective – reconstructing clean signals from corrupted inputs such as Gaussian noise or masking – a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings to be more easily reconstructed even when heavily corrupted. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.

[294] Defective Convolutional Networks

Tiange Luo, Tianle Cai, Mengxiao Zhang, Siyu Chen, Di He, Liwei Wang

Main category: cs.CV

TL;DR: Defective CNNs improve robustness against adversarial attacks by reducing reliance on textural features and emphasizing shape information.

DetailsMotivation: Adversarial examples exploit textural vulnerabilities in CNNs, leading to incorrect predictions. Addressing this weakness is critical for model robustness.

Method: Integrate defective convolutional layers (with constant-function neurons) into standard CNNs to disrupt textural feature extraction, forcing reliance on shape features.

Result: Defective CNNs outperform standard CNNs in defending against black-box and transfer-based attacks, achieving state-of-the-art performance without adversarial training.

Conclusion: Defective CNNs offer a promising defense against adversarial attacks by shifting feature reliance from texture to shape, enhancing robustness.

Abstract: Robustness of convolutional neural networks (CNNs) has gained in importance on account of adversarial examples, i.e., inputs added as well-designed perturbations that are imperceptible to humans but can cause the model to predict incorrectly. Recent research suggests that the noises in adversarial examples break the textural structure, which eventually leads to wrong predictions. To mitigate the threat of such adversarial attacks, we propose defective convolutional networks that make predictions relying less on textural information but more on shape information by properly integrating defective convolutional layers into standard CNNs. The defective convolutional layers contain defective neurons whose activations are set to be a constant function. As defective neurons contain no information and are far different from standard neurons in its spatial neighborhood, the textural features cannot be accurately extracted, and so the model has to seek other features for classification, such as the shape. We show extensive evidence to justify our proposal and demonstrate that defective CNNs can defense against black-box attacks better than standard CNNs. In particular, they achieve state-of-the-art performance against transfer-based attacks without any adversarial training being applied.

[295] Sports Re-ID: Improving Re-Identification Of Players In Broadcast Videos Of Team Sports

Bharath Comandur

Main category: cs.CV

TL;DR: The paper proposes a hierarchical data sampling method and centroid loss function to improve player re-identification in sports videos, achieving significant performance boosts without altering network architecture.

DetailsMotivation: Player re-identification in sports videos is challenging due to similar attire, limited samples, low resolution, occlusions, and fast movements.

Method: A hierarchical data sampling procedure and centroid loss function are introduced to enhance training-test distribution similarity and embedding centroid estimation.

Result: The approach improves mAP by 7-11.5 and R1 by 8.8-14.9, ranking highly in the SoccerNet Re-Identification Challenge 2022.

Conclusion: The method is effective for sports re-id, outperforming traditional loss functions and demonstrating strong performance in benchmarks.

Abstract: This work focuses on player re-identification in broadcast videos of team sports. Specifically, we focus on identifying the same player in images captured from different camera viewpoints during any given moment of a match. This task differs from traditional applications of person re-id in a few important ways. Firstly, players from the same team wear highly similar clothes, thereby making it harder to tell them apart. Secondly, there are only a few number of samples for each identity, which makes it harder to train a re-id system. Thirdly, the resolutions of the images are often quite low and vary a lot. This combined with heavy occlusions and fast movements of players greatly increase the challenges for re-id. In this paper, we propose a simple but effective hierarchical data sampling procedure and a centroid loss function that, when used together, increase the mean average precision (mAP) by 7 - 11.5 and the rank-1 (R1) by 8.8 - 14.9 without any change in the network or hyper-parameters used. Our data sampling procedure improves the similarity of the training and test distributions, and thereby aids in creating better estimates of the centroids of the embeddings (or feature vectors). Surprisingly, our study shows that in the presence of severely limited data, as is the case for our application, a simple centroid loss function based on euclidean distances significantly outperforms the popular triplet-centroid loss function. We show comparable improvements for both convolutional networks and vision transformers. Our approach is among the top ranked methods in the SoccerNet Re-Identification Challenge 2022 leaderboard (test-split) with a mAP of 86.0 and a R1 of 81.5. On the sequestered challenge split, we achieve an mAP of 84.9 and a R1 of 80.1. Research on re-id for sports-related applications is very limited and our work presents one of the first discussions in the literature on this.

[296] RACR-MIL: Rank-aware contextual reasoning for weakly supervised grading of squamous cell carcinoma using whole slide images

Anirudh Choudhary, Mosbah Aouad, Krishnakant Saboo, Angelina Hwang, Jacob Kechter, Blake Bordeaux, Puneet Bhullar, David DiCaudo, Steven Nelson, Nneka Comfere, Emma Johnson, Olayemi Sokumbi, Jason Sluzevich, Leah Swanson, Dennis Murphree, Aaron Mangold, Ravishankar Iyer

Main category: cs.CV

TL;DR: RACR-MIL is a weakly-supervised SCC grading method using attention-based multiple-instance learning, improving accuracy and clinical efficiency.

DetailsMotivation: SCC grading is challenging due to lack of reliable protocols and tissue heterogeneity.

Method: Uses a hybrid WSI graph and rank-ordering constraint in attention to prioritize higher-grade regions.

Result: Achieves 3-9% higher grading accuracy, better tumor localization, and improved clinical efficiency.

Conclusion: RACR-MIL shows promise as a clinical tool for SCC diagnosis and grading.

Abstract: Squamous cell carcinoma (SCC) is the most common cancer subtype, with an increasing incidence and a significant impact on cancer-related mortality. SCC grading using whole slide images is inherently challenging due to the lack of a reliable protocol and substantial tissue heterogeneity. We propose RACR-MIL, the first weakly-supervised SCC grading approach achieving robust generalization across multiple anatomies (skin, head and neck, lung). RACR-MIL is an attention-based multiple-instance learning framework that enhances grade-relevant contextual representation learning and addresses tumor heterogeneity through two key innovations: (1) a hybrid WSI graph that captures both local tissue context and non-local phenotypical dependencies between tumor regions, and (2) a rank-ordering constraint in the attention mechanism that consistently prioritizes higher-grade tumor regions, aligning with pathologists diagnostic process. Our model achieves state-of-the-art performance across multiple SCC datasets, achieving 3-9% higher grading accuracy, resilience to class imbalance, and up to 16% improved tumor localization. In a pilot study, pathologists reported that RACR-MIL improved grading efficiency in 60% of cases, underscoring its potential as a clinically viable cancer diagnosis and grading assistant.

[297] Learning from SAM: Harnessing a Foundation Model for Sim2Real Adaptation by Regularization

Mayara E. Bonani, Max Schwarz, Sven Behnke

Main category: cs.CV

TL;DR: A self-supervised domain adaptation method for semantic segmentation in robotics, leveraging the Segment Anything Model and an invariance-variance loss, outperforming prior work and even supervised models on some datasets.

DetailsMotivation: Addressing the scarcity of annotated target domain data in robotics by utilizing synthetic source data and unannotated target data.

Method: Uses the Segment Anything Model for segmenting unannotated data and applies an invariance-variance loss to regularize features, handling overlapping segments.

Result: Outperforms prior work on YCB-Video and HomebrewedDB, even surpassing supervised models on YCB-Video.

Conclusion: The method is effective for domain adaptation in robotics, with practical applicability demonstrated in a custom robotic setup.

Abstract: Domain adaptation is especially important for robotics applications, where target domain training data is usually scarce and annotations are costly to obtain. We present a method for self-supervised domain adaptation for the scenario where annotated source domain data (e.g. from synthetic generation) is available, but the target domain data is completely unannotated. Our method targets the semantic segmentation task and leverages a segmentation foundation model (Segment Anything Model) to obtain segment information on unannotated data. We take inspiration from recent advances in unsupervised local feature learning and propose an invariance-variance loss over the detected segments for regularizing feature representations in the target domain. Crucially, this loss structure and network architecture can handle overlapping segments and oversegmentation as produced by Segment Anything. We demonstrate the advantage of our method on the challenging YCB-Video and HomebrewedDB datasets and show that it outperforms prior work and, on YCB-Video, even a network trained with real annotations. Additionally, we provide insight through model ablations and show applicability to a custom robotic application.

[298] Point’n Move: Interactive Scene Object Manipulation on Gaussian Splatting Radiance Fields

Jiajun Huang, Hongchuan Yu

Main category: cs.CV

TL;DR: Point’n Move enables interactive scene object manipulation with real-time editing and superior quality using Gaussian Splatting Radiance Field.

DetailsMotivation: To achieve intuitive object selection and real-time editing in scene manipulation with high quality and performance.

Method: Uses Gaussian Splatting Radiance Field for scene representation, a dual-stage self-prompting segmentation algorithm, mask refinement, and real-time editing without per-editing training.

Result: Superior quality and performance in editing both forward-facing and 360 scenes, outperforming existing methods.

Conclusion: Point’n Move offers a more capable, faster, and higher-quality solution for interactive scene object manipulation.

Abstract: We propose Point’n Move, a method that achieves interactive scene object manipulation with exposed region inpainting. Interactivity here further comes from intuitive object selection and real-time editing. To achieve this, we adopt Gaussian Splatting Radiance Field as the scene representation and fully leverage its explicit nature and speed advantage. Its explicit representation formulation allows us to devise a 2D prompt points to 3D mask dual-stage self-prompting segmentation algorithm, perform mask refinement and merging, minimize change as well as provide good initialization for scene inpainting and perform editing in real-time without per-editing training, all leads to superior quality and performance. We test our method by performing editing on both forward-facing and 360 scenes. We also compare our method against existing scene object removal methods, showing superior quality despite being more capable and having a speed advantage.

[299] Generalized Consistency Trajectory Models for Image Manipulation

Beomsu Kim, Jaemin Kim, Jeongsol Kim, Jong Chul Ye

Main category: cs.CV

TL;DR: The paper introduces Generalized Consistency Trajectory Models (GCTMs) to extend the capabilities of CTMs, enabling translation between arbitrary distributions via ODEs for efficient and versatile image manipulation.

DetailsMotivation: Diffusion models (DMs) are powerful but computationally intensive due to their iterative nature. CTMs reduce computation but are limited to Gaussian noise-to-data translation. This work aims to overcome this limitation.

Method: The authors propose GCTMs, which generalize CTMs to translate between arbitrary distributions using ODEs. They explore the design space of GCTMs and apply them to tasks like image-to-image translation, restoration, and editing.

Result: GCTMs demonstrate efficacy in various image manipulation tasks, offering fine-grained control and computational efficiency compared to traditional DMs.

Conclusion: GCTMs unlock the full potential of CTMs, providing a more flexible and efficient framework for image manipulation tasks.

Abstract: Diffusion models (DMs) excel in unconditional generation, as well as on applications such as image editing and restoration. The success of DMs lies in the iterative nature of diffusion: diffusion breaks down the complex process of mapping noise to data into a sequence of simple denoising tasks. Moreover, we are able to exert fine-grained control over the generation process by injecting guidance terms into each denoising step. However, the iterative process is also computationally intensive, often taking from tens up to thousands of function evaluations. Although consistency trajectory models (CTMs) enable traversal between any time points along the probability flow ODE (PFODE) and score inference with a single function evaluation, CTMs only allow translation from Gaussian noise to data. This work aims to unlock the full potential of CTMs by proposing generalized CTMs (GCTMs), which translate between arbitrary distributions via ODEs. We discuss the design space of GCTMs and demonstrate their efficacy in various image manipulation tasks such as image-to-image translation, restoration, and editing.

[300] View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson, Honglak Lee

Main category: cs.CV

TL;DR: The paper addresses hallucination in 3D object captioning by introducing DiffuRank, a method to rank 2D views for better caption accuracy, improving datasets and outperforming CLIP in VQA.

DetailsMotivation: Existing methods for 3D object captioning often produce hallucinated captions due to atypical rendered views, degrading quality.

Method: DiffuRank uses a pre-trained text-to-3D model to rank 2D views by alignment with 3D objects, selecting top views for GPT4-Vision captioning.

Result: Corrected 200k captions in Cap3D and expanded to 1M captions; outperformed CLIP in VQA tasks.

Conclusion: DiffuRank effectively mitigates hallucination, enhances caption quality, and demonstrates versatility in other tasks like VQA.

Abstract: Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object’s characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.

[301] OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics

Yeon-Ji Song, Jaein Kim, Suhyung Choi, Jin-Hwa Kim, Byoung-Tak Zhang

Main category: cs.CV

TL;DR: OCK is a dynamic video prediction model that integrates object kinematics with appearance features to improve motion dynamics modeling in complex scenes.

DetailsMotivation: To address the gap in current object-centric transformers, which focus on appearance but overlook motion dynamics, crucial for human-like scene understanding.

Method: Proposes OCK, incorporating Object Kinematics (explicit motion attributes) alongside appearance features, integrated into spatiotemporal prediction mechanisms.

Result: Superior performance in handling complex scenes with intricate object attributes and motions, demonstrating applicability in vision-related dynamics learning.

Conclusion: OCK effectively combines object kinematics and appearance for dynamic video prediction, advancing capabilities in modeling complex interactions.

Abstract: Human perception involves decomposing complex multi-object scenes into time-static object appearance (i.e., size, shape, color) and time-varying object motion (i.e., position, velocity, acceleration). For machines to achieve human-like intelligence in real-world interactions, understanding these physical properties of objects is essential, forming the foundation for dynamic video prediction. While recent advancements in object-centric transformers have demonstrated potential in video prediction, they primarily focus on object appearance, often overlooking motion dynamics, which is crucial for modeling dynamic interactions and maintaining temporal consistency in complex environments. To address these limitations, we propose OCK, a dynamic video prediction model leveraging object-centric kinematics and object slots. We introduce a novel component named Object Kinematics that comprises explicit object motions, serving as an additional attribute beyond conventional appearance features to model dynamic scenes. The Object Kinematics are integrated into various OCK mechanisms, enabling spatiotemporal prediction of complex object interactions over long video sequences. Our model demonstrates superior performance in handling complex scenes with intricate object attributes and motions, highlighting its potential for applicability in vision-related dynamics learning tasks.

[302] Transfer Attack for Bad and Good: Explain and Boost Adversarial Transferability across Multimodal Large Language Models

Hao Cheng, Erjia Xiao, Jiayan Yang, Jinhao Duan, Yichi Wang, Jiahang Cao, Qiang Zhang, Le Yang, Kaidi Xu, Jindong Gu, Renjing Xu

Main category: cs.CV

TL;DR: The paper explores adversarial transferability in Multimodal Large Language Models (MLLMs), identifies key influencing factors, and proposes two data augmentation methods to enhance transferability. It also examines real-world impacts through tasks like harmful content insertion and information protection.

DetailsMotivation: MLLMs excel in cross-modality interaction but are vulnerable to adversarial attacks, especially regarding transferability. Understanding and mitigating this vulnerability is crucial for robustness.

Method: The study analyzes adversarial transferability in MLLMs, identifies two key factors, and introduces two semantic-level data augmentation methods (AIP and TATM) to boost transferability. Real-world impact is tested via harmful content insertion and information protection tasks.

Result: The research confirms adversarial transferability in MLLMs, highlights two key factors, and demonstrates that AIP and TATM effectively enhance transferability. Real-world tasks show potential societal impacts.

Conclusion: The findings advance understanding of adversarial transferability in MLLMs, propose practical solutions, and highlight the dual-use nature of such vulnerabilities in real-world applications.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate exceptional performance in cross-modality interaction, yet they also suffer adversarial vulnerabilities. In particular, the transferability of adversarial examples remains an ongoing challenge. In this paper, we specifically analyze the manifestation of adversarial transferability among MLLMs and identify the key factors that influence this characteristic. We discover that the transferability of MLLMs exists in cross-LLM scenarios with the same vision encoder and indicate \underline{\textit{two key Factors}} that may influence transferability. We provide two semantic-level data augmentation methods, Adding Image Patch (AIP) and Typography Augment Transferability Method (TATM), which boost the transferability of adversarial examples across MLLMs. To explore the potential impact in the real world, we utilize two tasks that can have both negative and positive societal impacts: \ding{182} Harmful Content Insertion and \ding{183} Information Protection.

[303] A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection

Jiangning Zhang, Haoyang He, Zhenye Gan, Qingdong He, Yuxuan Cai, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

Main category: cs.CV

TL;DR: The paper introduces ADer, a comprehensive benchmark for visual anomaly detection, addressing the lack of standardized evaluation in the field. It includes multiple datasets, methods, and metrics, along with a GPU-assisted tool for faster evaluation.

DetailsMotivation: The absence of standardized benchmarks for evaluating visual anomaly detection methods leads to biased results and erroneous conclusions, hindering progress in the field.

Method: The authors propose ADer, a modular framework with multiple datasets, fifteen state-of-the-art methods, and nine metrics. They also introduce the GPU-assisted ADEval package for efficient evaluation.

Result: ADer provides extensive experimental results, objectively comparing methods and highlighting their strengths and weaknesses. The GPU-assisted tool reduces evaluation time significantly.

Conclusion: ADer serves as a valuable resource for researchers, promoting robust and generalizable anomaly detection systems. The framework is open-sourced for community use.

Abstract: Visual anomaly detection aims to identify anomalous regions in images through unsupervised learning paradigms, with increasing application demand and value in fields such as industrial inspection and medical lesion detection. Despite significant progress in recent years, there is a lack of comprehensive benchmarks to adequately evaluate the performance of various mainstream methods across different datasets under the practical multi-class setting. The absence of standardized experimental setups can lead to potential biases in training epochs, resolution, and metric results, resulting in erroneous conclusions. This paper addresses this issue by proposing a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework that is highly extensible for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. Additionally, we have proposed the GPU-assisted ADEval package to address the slow evaluation problem of metrics like time-consuming mAU-PRO on large-scale data, significantly reducing evaluation time by more than 1000-fold. Through extensive experimental results, we objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection. We hope that ADer will become a valuable resource for researchers and practitioners in the field, promoting the development of more robust and generalizable anomaly detection systems. Full codes are open-sourced at https://github.com/zhangzjn/ader.

[304] Video-based Exercise Classification and Activated Muscle Group Prediction with Hybrid X3D-SlowFast Network

Manvik Pasula, Pramit Saha

Main category: cs.CV

TL;DR: A video-based deep learning framework using X3D and SlowFast models improves exercise classification and muscle group activation prediction, outperforming existing methods.

DetailsMotivation: To address the limitations of sensor-dependent and limited-scope exercise classification and MGAP, making fitness routines more accessible and practical.

Method: Uses a hybrid approach with X3D and SlowFast models, weighted ensemble, and pretrained models for enhanced performance.

Result: The composite model outperforms baselines in accuracy, with optimal SlowFast channel reduction at 10.

Conclusion: The proposed method sets a new benchmark, offering a robust solution for exercise classification and MGAP.

Abstract: This paper introduces a simple yet effective strategy for exercise classification and muscle group activation prediction (MGAP). These tasks have significant implications for personal fitness, facilitating more affordable, accessible, safer, and simpler exercise routines. This is particularly relevant for novices and individuals with disabilities. Previous research in the field is mostly dominated by the reliance on mounted sensors and a limited scope of exercises, reducing practicality for everyday use. Furthermore, existing MGAP methodologies suffer from a similar dependency on sensors and a restricted range of muscle groups, often excluding strength training exercises, which are pivotal for a comprehensive fitness regimen. Addressing these limitations, our research employs a video-based deep learning framework that encompasses a broad spectrum of exercises and muscle groups, including those vital for strength training. Utilizing the “Workout/Exercises Video” dataset, our approach integrates the X3D and SlowFast video activity recognition models in an effective way to enhance exercise classification and MGAP performance. Our findings demonstrate that this hybrid method, obtained via weighted ensemble, outperforms existing baseline models in accuracy. Pretrained models play a crucial role in enhancing overall performance, with optimal channel reduction values for the SlowFast model identified near 10. Through an ablation study that explores fine-tuning, we further elucidate the interrelation between the two tasks. Our composite model, a weighted-average ensemble of X3D and SlowFast, sets a new benchmark in both exercise classification and MGAP across all evaluated categories, offering a robust solution to the limitations of previous approaches.

[305] Growing a Twig to Accelerate Large Vision-Language Models

Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu

Main category: cs.CV

TL;DR: TwigVLM improves VLM efficiency by combining token pruning and self-speculative decoding, achieving high accuracy retention and speedup.

DetailsMotivation: Addressing the accuracy drop and limited speedup in existing VLM token pruning methods.

Method: Introduces TwigVLM with twig-guided token pruning (TTP) and self-speculative decoding (SSD).

Result: Preserves 96% accuracy after pruning 88.9% tokens and achieves 154% speedup.

Conclusion: TwigVLM outperforms state-of-the-art methods in accuracy and speed.

Abstract: Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM’s early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM – a simple and general architecture by growing a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods.

[306] Efficient Visual Transformer by Learnable Token Merging

Yancheng Wang, Yingzhen Yang

Main category: cs.CV

TL;DR: The paper introduces LTM-Transformer, a compact transformer block with learnable token merging, reducing FLOPs and inference time while maintaining or improving accuracy in visual transformers.

DetailsMotivation: To address the inefficiency of visual transformers by reducing the Information Bottleneck and improving computational efficiency.

Method: Proposes LTM-Transformer, which incorporates learnable token merging and a mask module to optimize the IB loss.

Result: LTM-Transformer reduces FLOPs and inference time while matching or surpassing the accuracy of original visual transformers.

Conclusion: LTM-Transformer offers a compact and efficient alternative to traditional visual transformers, with potential for broader application.

Abstract: Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT, and Swin, with LTM-Transformer blocks, leading to LTM-Transformer networks with different backbones. The LTM-Transformer is motivated by reduction of Information Bottleneck, and a novel and separable variational upper bound for the IB loss is derived. The architecture of the mask module in our LTM blocks, which generates the token merging mask, is designed to reduce the derived upper bound for the IB loss. Extensive results on computer vision tasks evidence that LTM-Transformer renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers. The code of the LTM-Transformer is available at https://github.com/Statistical-Deep-Learning/LTM}

[307] A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

Miao Cao, Lishun Wang, Huan Wang, Xin Yuan

Main category: cs.CV

TL;DR: Q-SCI introduces a low-bit quantization framework for video SCI reconstruction, reducing computational cost while minimizing performance drop.

DetailsMotivation: Deep learning-based video SCI reconstruction is computationally heavy; quantization can reduce cost but often degrades performance.

Method: Proposes Q-SCI with high-quality feature extraction, precise reconstruction, and a shift operation for Transformer branches to mitigate distortion.

Result: 4-bit quantized EfficientSCI-S achieves 7.8X speedup with only 2.3% performance gap.

Conclusion: Q-SCI effectively balances computational efficiency and reconstruction quality in video SCI.

Abstract: Video Snapshot Compressive Imaging (SCI) aims to use a low-speed 2D camera to capture high-speed scene as snapshot compressed measurements, followed by a reconstruction algorithm to reconstruct the high-speed video frames. State-of-the-art (SOTA) deep learning-based algorithms have achieved impressive performance, yet with heavy computational workload. Network quantization is a promising way to reduce computational cost. However, a direct low-bit quantization will bring large performance drop. To address this challenge, in this paper, we propose a simple low-bit quantization framework (dubbed Q-SCI) for the end-to-end deep learning-based video SCI reconstruction methods which usually consist of a feature extraction, feature enhancement, and video reconstruction module. Specifically, we first design a high-quality feature extraction module and a precise video reconstruction module to extract and propagate high-quality features in the low-bit quantized model. In addition, to alleviate the information distortion of the Transformer branch in the quantized feature enhancement module, we introduce a shift operation on the query and key distributions to further bridge the performance gap. Comprehensive experimental results manifest that our Q-SCI framework can achieve superior performance, e.g., 4-bit quantized EfficientSCI-S derived by our Q-SCI framework can theoretically accelerate the real-valued EfficientSCI-S by 7.8X with only 2.3% performance gap on the simulation testing datasets. Code is available at https://github.com/mcao92/QuantizedSCI.

[308] Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics

Zhirui Gao, Renjiao Yi, Yuhang Huang, Wei Chen, Chenyang Zhu, Kai Xu

Main category: cs.CV

TL;DR: PartGS is a self-supervised framework for part-aware 3D reconstruction, combining 2D Gaussians and superquadrics to decompose objects into interpretable parts, outperforming state-of-the-art methods.

DetailsMotivation: Human perception interprets 3D environments at higher structural levels, not low-level elements like points or voxels. Structured decomposition improves interpretability and downstream tasks.

Method: PartGS integrates 2D Gaussians and superquadrics in a hybrid representation, jointly optimizing them for part-aware reconstruction using multi-view images.

Result: Superior performance on DTU, ShapeNet, and real-world datasets compared to state-of-the-art methods.

Conclusion: PartGS effectively bridges the gap between low-level 3D representations and human-like structural understanding, enabling high-fidelity and interpretable decompositions.

Abstract: Low-level 3D representations, such as point clouds, meshes, NeRFs and 3D Gaussians, are commonly used for modeling 3D objects and scenes. However, cognitive studies indicate that human perception operates at higher levels and interprets 3D environments by decomposing them into meaningful structural parts, rather than low-level elements like points or voxels. Structured geometric decomposition enhances scene interpretability and facilitates downstream tasks requiring component-level manipulation. In this work, we introduce PartGS, a self-supervised part-aware reconstruction framework that integrates 2D Gaussians and superquadrics to parse objects and scenes into an interpretable decomposition, leveraging multi-view image inputs to uncover 3D structural information. Our method jointly optimizes superquadric meshes and Gaussians by coupling their parameters within a hybrid representation. On one hand, superquadrics enable the representation of a wide range of shape primitives, facilitating flexible and meaningful decompositions. On the other hand, 2D Gaussians capture detailed texture and geometric details, ensuring high-fidelity appearance and geometry reconstruction. Operating in a self-supervised manner, our approach demonstrates superior performance compared to state-of-the-art methods across extensive experiments on the DTU, ShapeNet, and real-world datasets.

[309] CVPT: Cross Visual Prompt Tuning

Lingyun Huang, Jianxu Mao, Junfei Yi, Ziming Tao, Yaonan Wang

Main category: cs.CV

TL;DR: CVPT improves Visual Prompt Tuning (VPT) by introducing cross-attention to preserve self-attention integrity, achieving better performance and efficiency.

DetailsMotivation: VPT's limitations in performance and efficiency due to distorted self-attention prompted the need for a better prompt-based method.

Method: Proposes Cross Visual Prompt Tuning (CVPT) with a cross-attention module and weight-sharing mechanism for efficient feature integration.

Result: CVPT outperforms VPT by over 4% on VTAB-1K and rivals adapter-based methods in performance and efficiency.

Conclusion: Prompt-based methods like CVPT can achieve exceptional results in visual fine-tuning, validating their potential.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) has emerged to mitigate the computational demands of large-scale models. Within computer vision, adapter-based PEFT methods are often favored over prompt-based approaches like Visual Prompt Tuning (VPT) due to the latter’s performance and efficiency limitations. Our analysis reveals that VPT’s shortcomings stem from its prompt deployment strategy, which can distort the model’s inherent self-attention mechanism. To address this, we propose Cross Visual Prompt Tuning (CVPT). CVPT introduces a cross-attention module to directly model interactions between prompts and image tokens. This design decouples the prompts from the input sequence, preserving the original self-attention integrity while enabling efficient feature integration. Furthermore, we employ a weight-sharing mechanism for cross-attention initialization, which enhances representative capability without a large parameter overhead. Extensive experiments across 25 datasets show that CVPT significantly outperforms VPT. For instance, on the VTAB-1K benchmark, CVPT achieves over 4% higher average accuracy, rivaling leading adapter-based methods in both performance and efficiency. Our work confirms that prompt-based methods can achieve exceptional results in visual fine-tuning. The code is available at https://github.com/Lingyun0419/CVPT

[310] DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen

Main category: cs.CV

TL;DR: DaMO is a data-efficient Video LLM designed for fine-grained temporal reasoning and multimodal understanding, outperforming prior methods in tasks requiring precise temporal alignment.

DetailsMotivation: Existing Video LLMs struggle with fine-grained temporal reasoning and precise attribution to specific video moments, especially under constrained supervision.

Method: DaMO uses a Temporal-aware Fuseformer with a hierarchical dual-stream architecture for capturing temporal dynamics and fusing visual/audio data. It includes a global residual for efficiency and is trained via a four-stage progressive paradigm.

Result: DaMO consistently surpasses prior methods in temporal grounding and video QA benchmarks, excelling in tasks requiring precise temporal alignment.

Conclusion: DaMO advances data-efficient video-language modeling, offering improved temporal reasoning and multimodal understanding.

Abstract: Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

[311] MDNF: Multi-Diffusion-Nets for Neural Fields on Meshes

Avigail Cohen Rimon, Tal Shnitzer, Mirela Ben Chen

Main category: cs.CV

TL;DR: A novel framework for neural fields on triangle meshes, combining multi-resolution spatial and frequency domains with geometry-aware decomposition and Fourier feature mapping.

DetailsMotivation: To address challenges in learning complex neural fields, including discontinuities and scale variations, by leveraging multi-resolution spatial and frequency decomposition.

Method: Uses DiffusionNet components for spatial decomposition, Fourier feature mapping for frequency association, and a sine-activated MLP for signal composition.

Result: Achieves high accuracy and robustness in learning diverse neural fields, outperforming alternatives.

Conclusion: The framework effectively handles complex neural fields and demonstrates superior performance in various applications.

Abstract: We propose a novel framework for representing neural fields on triangle meshes that is multi-resolution across both spatial and frequency domains. Inspired by the Neural Fourier Filter Bank (NFFB), our architecture decomposes the spatial and frequency domains by associating finer spatial resolution levels with higher frequency bands, while coarser resolutions are mapped to lower frequencies. To achieve geometry-aware spatial decomposition we leverage multiple DiffusionNet components, each associated with a different spatial resolution level. Subsequently, we apply a Fourier feature mapping to encourage finer resolution levels to be associated with higher frequencies. The final signal is composed in a wavelet-inspired manner using a sine-activated MLP, aggregating higher-frequency signals on top of lower-frequency ones. Our architecture attains high accuracy in learning complex neural fields and is robust to discontinuities, exponential scale variations of the target field, and mesh modification. We demonstrate the effectiveness of our approach through its application to diverse neural fields, such as synthetic RGB functions, UV texture coordinates, and vertex normals, illustrating different challenges. To validate our method, we compare its performance against two alternatives, showcasing the advantages of our multi-resolution architecture.

[312] Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin

Main category: cs.CV

TL;DR: The paper introduces a full-page historical document restoration (HDR) dataset (FPHDR) and an automated solution (AutoHDR) to address limitations of existing methods. AutoHDR improves OCR accuracy significantly through a three-stage workflow and human-machine collaboration.

DetailsMotivation: Existing HDR methods are limited to single modalities or small-scale restoration, failing practical needs. The paper aims to bridge this gap for better cultural heritage preservation.

Method: AutoHDR uses a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. It supports human-machine collaboration.

Result: AutoHDR improves OCR accuracy from 46.83% to 84.05% for severely damaged documents, reaching 94.25% with human intervention.

Conclusion: The work advances automated HDR and aids cultural heritage preservation. The dataset and model are publicly available.

Abstract: Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians’ restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR’s remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.

[313] InteractPro: A Unified Framework for Motion-Aware Image Composition

Weijing Tao, Xiaofeng Yang, Miaomiao Cui, Guosheng Lin

Main category: cs.CV

TL;DR: InteractPro is a framework for dynamic motion-aware image composition, combining simulation-based and diffusion-based methods under planner guidance to overcome traditional limitations.

DetailsMotivation: Traditional image composition methods require manual planning and produce static outputs, lacking realistic motion effects.

Method: InteractPro uses InteractPlan (LVLM-based planner) to choose between InteractPhys (MPM-based simulation) and InteractMotion (pretrained video diffusion) for optimal composition.

Result: InteractPro produces controllable, coherent, and motion-aware compositions across diverse scenarios.

Conclusion: The framework effectively unifies simulation and diffusion methods, addressing traditional challenges in dynamic image composition.

Abstract: We introduce InteractPro, a comprehensive framework for dynamic motion-aware image composition. At its core is InteractPlan, an intelligent planner that leverages a Large Vision Language Model (LVLM) for scenario analysis and object placement, determining the optimal composition strategy to achieve realistic motion effects. Based on each scenario, InteractPlan selects between our two specialized modules: InteractPhys and InteractMotion. InteractPhys employs an enhanced Material Point Method (MPM)-based simulation to produce physically faithful and controllable object-scene interactions, capturing diverse and abstract events that require true physical modeling. InteractMotion, in contrast, is a training-free method based on pretrained video diffusion. Traditional composition approaches suffer from two major limitations: requiring manual planning for object placement and generating static, motionless outputs. By unifying simulation-based and diffusion-based methods under planner guidance, InteractPro overcomes these challenges, ensuring richly motion-aware compositions. Extensive quantitative and qualitative evaluations demonstrate InteractPro’s effectiveness in producing controllable, and coherent compositions across varied scenarios.

[314] FlexiTex: Enhancing Texture Generation via Visual Guidance

DaDong Jiang, Xianghui Yang, Zibo Zhao, Sheng Zhang, Jiaao Yu, Zeqiang Lai, Shaoxiong Yang, Chunchao Guo, Xiaobo Zhou, Zhihui Ke

Main category: cs.CV

TL;DR: FlexiTex improves texture generation by using visual guidance to enhance details and avoid ambiguity in text prompts, achieving high-quality results.

DetailsMotivation: Textual prompts in texture generation often lack global textural or shape information, leading to blurry or inconsistent patterns.

Method: FlexiTex introduces a Visual Guidance Enhancement module and a Direction-Aware Adaptation module to incorporate visual details and maintain consistency.

Result: FlexiTex produces high-quality textures with preserved high-frequency details and global consistency.

Conclusion: FlexiTex demonstrates potential for advancing texture generation in real-world applications.

Abstract: Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications.

[315] PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding

Vinh Nguyen

Main category: cs.CV

TL;DR: PerspectiveNet is a lightweight model for generating detailed descriptions from multiple camera views, combining visual encoders, a connector module, and LLMs.

DetailsMotivation: Addressing the challenge of generating consistent and detailed descriptions from complex, multi-view visual data.

Method: Uses a vision encoder, connector module for feature mapping, LLMs for language generation, and a secondary task for frame sequence detection.

Result: A lightweight, efficient model effective for the Traffic Safety Description and Analysis task.

Conclusion: PerspectiveNet successfully integrates visual and language models for multi-view description generation.

Abstract: Generating detailed descriptions from multiple cameras and viewpoints is challenging due to the complex and inconsistent nature of visual data. In this paper, we introduce PerspectiveNet, a lightweight yet efficient model for generating long descriptions across multiple camera views. Our approach utilizes a vision encoder, a compact connector module to convert visual features into a fixed-size tensor, and large language models (LLMs) to harness the strong natural language generation capabilities of LLMs. The connector module is designed with three main goals: mapping visual features onto LLM embeddings, emphasizing key information needed for description generation, and producing a fixed-size feature matrix. Additionally, we augment our solution with a secondary task, the correct frame sequence detection, enabling the model to search for the correct sequence of frames to generate descriptions. Finally, we integrate the connector module, the secondary task, the LLM, and a visual feature extraction model into a single architecture, which is trained for the Traffic Safety Description and Analysis task. This task requires generating detailed, fine-grained descriptions of events from multiple cameras and viewpoints. The resulting model is lightweight, ensuring efficient training and inference, while remaining highly effective.

[316] Fourier Domain Adaptation for Traffic Light Detection in Adverse Weather

Ishaan Gakhar, Aryesh Guha, Aryaman Gupta, Amit Agarwal, Ujjwal Verma

Main category: cs.CV

TL;DR: The paper introduces Fourier Domain Adaptation (FDA) to improve traffic light detection in adverse weather, outperforming baseline models without architectural changes.

DetailsMotivation: Existing deep learning methods for traffic light detection in adverse weather are computationally heavy and underperform. FDA aims to bridge this gap with minimal overhead.

Method: FDA modifies training data to minimize domain gaps between source (LISA, S2TLD) and target (simulated rainy/foggy) datasets, using SSL for better data leverage.

Result: FDA-augmented models showed significant improvements: YOLOv8 had a 12.25% average increase, with notable gains in Precision (7.69%), Recall (19.91%), and mAP metrics.

Conclusion: FDA effectively mitigates adverse weather impact, enabling reliable real-world ADAS performance in challenging conditions.

Abstract: Traffic light detection under adverse weather conditions remains largely unexplored in ADAS systems, with existing approaches relying on complex deep learning methods that introduce significant computational overheads during training and deployment. This paper proposes Fourier Domain Adaptation (FDA), which requires only training data modifications without architectural changes, enabling effective adaptation to rainy and foggy conditions. FDA minimizes the domain gap between source and target domains, creating a dataset for reliable performance under adverse weather. The source domain merged LISA and S2TLD datasets, processed to address class imbalance. Established methods simulated rainy and foggy scenarios to form the target domain. Semi-Supervised Learning (SSL) techniques were explored to leverage data more effectively, addressing the shortage of comprehensive datasets and poor performance of state-of-the-art models under hostile weather. Experimental results show FDA-augmented models outperform baseline models across mAP50, mAP50-95, Precision, and Recall metrics. YOLOv8 achieved a 12.25% average increase across all metrics. Average improvements of 7.69% in Precision, 19.91% in Recall, 15.85% in mAP50, and 23.81% in mAP50-95 were observed across all models, demonstrating FDA’s effectiveness in mitigating adverse weather impact. These improvements enable real-world applications requiring reliable performance in challenging environmental conditions.

[317] Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI

Won Jun Kim, Hyungjin Chung, Jaemin Kim, Sangmin Lee, Byeongsu Sim, Jong Chul Ye

Main category: cs.CV

TL;DR: FreeMCG introduces a derivative-free, manifold-constrained gradient method for explainability, overcoming limitations of traditional gradient-based techniques.

DetailsMotivation: Traditional gradient-based explainability methods have shortcomings like requiring white-box access, vulnerability to attacks, and producing non-faithful explanations.

Method: FreeMCG uses ensemble Kalman filters and diffusion models to approximate gradients on the data manifold without derivatives, relying only on model outputs.

Result: FreeMCG achieves state-of-the-art performance in counterfactual generation and feature attribution while maintaining XAI tool properties.

Conclusion: FreeMCG provides a robust, derivative-free alternative for explainability, improving faithfulness and alignment with human perception.

Abstract: Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model’s gradient projected onto the data manifold, requiring access only to the model’s outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.

[318] DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou, Yuxin Chen, Haokun Lin, Shuyu Yang, Zhongang Qi, Chen Ma, Li Zhu, Ying Shan

Main category: cs.CV

TL;DR: The paper introduces DOGR-Engine and DOGR-Bench to address the lack of fine-grained datasets and benchmarks for grounding and referring in visual document understanding, proposing DOGR as a baseline model for improved document understanding.

DetailsMotivation: Addressing the underdeveloped grounding and referring capabilities in visual document understanding due to scarce fine-grained datasets and benchmarks.

Method: Proposing DOGR-Engine to generate multi-granular parsing and instruction-tuning data, and constructing DOGR-Bench for comprehensive evaluation. Developing DOGR as a baseline model.

Result: DOGR excels in text localization, recognition, and grounding/referring during conversation and reasoning, advancing fine-grained document understanding.

Conclusion: The work fills a critical gap in document understanding, enabling flexible interaction paradigms through high-quality data and a strong baseline model.

Abstract: With recent advances in Multimodal Large Language Models (MLLMs), grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction. However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs’ grounding and referring capabilities in dialogue and reasoning. Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types (chart, poster, and PDF document), offering a comprehensive evaluation of fine-grained document understanding. Leveraging the generated data, we further develop DOGR, a strong baseline model that excels in text localization and recognition, while precisely grounds and refers to key textual information during conversation and reasoning, thereby advancing document understanding to a finer granularity and enable flexible interaction paradigms.

[319] Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

Xinyu Hou, Zongsheng Yue, Xiaoming Li, Chen Change Loy

Main category: cs.CV

TL;DR: A single parameter ω controls granularity in diffusion-based synthesis without retraining or architectural changes, enabling precise detail control in generated outputs.

DetailsMotivation: To simplify and enhance granularity control in diffusion models without additional computational or training overhead.

Method: Incorporate ω during denoising steps of the diffusion model’s reverse process, using spatial masks or varying ω values for region- or timestep-specific control.

Result: Effective granularity control in image and video synthesis tasks, adaptable to advanced diffusion models.

Conclusion: The method is simple, efficient, and versatile, offering precise granularity control with minimal overhead.

Abstract: In this work, we show that we only need a single parameter $\omega$ to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model’s reverse process. This simple approach does not require model retraining or architectural modifications and incurs negligible computational overhead, yet enables precise control over the level of details in the generated outputs. Moreover, spatial masks or denoising schedules with varying $\omega$ values can be applied to achieve region-specific or timestep-specific granularity control. External control signals or reference images can guide the creation of precise $\omega$ masks, allowing targeted granularity adjustments. Despite its simplicity, the method demonstrates impressive performance across various image and video synthesis tasks and is adaptable to advanced diffusion models. The code is available at https://github.com/itsmag11/Omegance.

[320] Surf-NeRF: Surface Regularised Neural Radiance Fields

Jack Naylor, Viorela Ila, Donald G. Dansereau

Main category: cs.CV

TL;DR: The paper introduces a method combining curriculum learning and lattice-based hash encoding to improve NeRF’s geometric accuracy, using four regularization terms for better scene representation.

DetailsMotivation: NeRFs struggle with shape-radiance ambiguity and geometric accuracy despite advancements like Ref-NeRF. This work aims to enhance geometric consistency in NeRF representations.

Method: Uses curriculum learning of a surface light field model and lattice-based hash encoding. Introduces four regularization terms for geometric smoothness, normal consistency, and appearance separation.

Result: Achieves 28% more accurate normals than traditional grid-based NeRF variants and better separates view-dependent appearance.

Conclusion: The method improves geometric accuracy in NeRF representations and is compatible with existing NeRF variants, advancing radiance-based representations for geometry-critical applications.

Abstract: Neural Radiance Fields (NeRFs) provide a high fidelity, continuous scene representation that can realistically represent complex behaviour of light. Despite works like Ref-NeRF improving geometry through physics-inspired models, the ability for a NeRF to overcome shape-radiance ambiguity and converge to a representation consistent with real geometry remains limited. We demonstrate how both curriculum learning of a surface light field model and using a lattice-based hash encoding helps a NeRF converge towards a more geometrically accurate scene representation. We introduce four regularisation terms to impose geometric smoothness, consistency of normals, and a separation of Lambertian and specular appearance at geometry in the scene, conforming to physical models. Our approach yields 28% more accurate normals than traditional grid-based NeRF variants with reflection parameterisation. Our approach more accurately separates view-dependent appearance, conditioning a NeRF to have a geometric representation consistent with the captured scene. We demonstrate compatibility of our method with existing NeRF variants, as a key step in enabling radiance-based representations for geometry critical applications.

[321] BGM: Background Mixup for X-ray Prohibited Items Detection

Weizhe Liu, Renshuai Tao, Hongguang Zhu, Yunda Sun, Yao Zhao, Yunchao Wei

Main category: cs.CV

TL;DR: Proposes Background Mixup (BGM), a background-based augmentation for X-ray security images, enhancing detection by leveraging texture and material variations.

DetailsMotivation: Existing augmentations ignore X-ray image characteristics and background cues, limiting detection performance.

Method: BGM mixes background patches based on X-ray transmission imagery and material-based pseudo-coloring, focusing on texture and material variations.

Result: BGM outperforms baselines on X-ray benchmarks, improving detection without extra annotations or training costs.

Conclusion: BGM is a lightweight, effective solution for background-aware augmentation in X-ray prohibited items detection.

Abstract: Current data-driven approaches for X-ray prohibited items detection remain under-explored, particularly in the design of effective data augmentations. Existing natural image augmentations for reflected light imaging neglect the data characteristics of X-ray security images. Moreover, prior X-ray augmentation methods have predominantly focused on foreground prohibited items, overlooking informative background cues. In this paper, we propose Background Mixup (BGM), a background-based augmentation technique tailored for X-ray security imaging domain. Unlike conventional methods, BGM is founded on an in-depth analysis of physical properties including: 1) X-ray Transmission Imagery: Transmitted X-ray pixels represent composite information from multiple materials along the imaging path. 2) Material-based Pseudo-coloring: Pseudo-coloring in X-ray images correlates directly with material properties, aiding in material distinction. Building upon the above insights, BGM mixes background patches across regions on both 1) texture structure and 2) material variation, to benefit models from complicated background cues. This enhances the model’s capability to handle domain-specific challenges such as occlusion-induced discriminative imbalance. Importantly, BGM is orthogonal and fully compatible with existing foreground-focused augmentation techniques, enabling joint use to further enhance detection performance. Extensive experiments on multiple X-ray security benchmarks show that BGM consistently surpasses strong baselines, without additional annotations or significant training overhead. This work pioneers the exploration of background-aware augmentation in X-ray prohibited items detection and provides a lightweight, plug-and-play solution with broad applicability.

[322] Video LLMs for Temporal Reasoning in Long Videos

Fawad Javed Fateh, Umer Ahmed, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran

Main category: cs.CV

TL;DR: TemporalVLM is a video large language model for temporal reasoning and fine-grained understanding in long videos, using time-aware features and BiLSTM for global aggregation. It outperforms previous methods on tasks like dense captioning and action segmentation.

DetailsMotivation: Addressing the challenge of temporal reasoning and fine-grained understanding in long videos, which requires capturing both local and global temporal cues.

Method: Divides videos into short-term clips, encodes them with timestamps, fuses features across overlapping windows, and uses BiLSTM for global aggregation.

Result: Superior performance on tasks like dense video captioning, temporal video grounding, and action segmentation, demonstrated on datasets like TimeIT and IndustryASM.

Conclusion: TemporalVLM effectively integrates temporal reasoning in video LLMs, setting a new benchmark for long video understanding tasks.

Abstract: This paper introduces TemporalVLM, a video large language model (video LLM) capable of effective temporal reasoning and fine-grained understanding in long videos. At the core, our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues. In particular, it first divides the input video into short-term clips, which are jointly encoded with their timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. The extracted time-aware and multi-level features are important for accurate temporal reasoning and fine-grained understanding in long videos. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, which consists of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments on datasets of long videos, including TimeIT and IndustryASM, show that TemporalVLM achieves superior performance than previous methods across temporal reasoning and fine-grained understanding tasks, namely dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To the best of our knowledge, our work is the first to incorporate LSTMs into video LLMs.

[323] PEMF-VTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

Tianyu Chang, Xiaohao Chen, Zhichao Wei, Xuanpu Zhang, Qing-Guo Chen, Weihua Luo, Peipei Song, Xun Yang

Main category: cs.CV

TL;DR: PEMF-VTO is a mask-free video virtual try-on framework using point-enhanced guidance for accurate garment transfer and temporal coherence.

DetailsMotivation: Existing mask-based methods struggle with complex real-world scenarios, while mask-free methods lack precision in dynamic settings.

Method: PEMF-VTO uses Point-Enhanced Transformer (PET) with spatial and temporal attention modules for precise garment transfer and smooth transitions.

Result: Outperforms state-of-the-art methods, producing natural and coherent try-on videos, especially in challenging scenarios.

Conclusion: PEMF-VTO effectively addresses limitations of existing methods, offering flexible and reliable control for video virtual try-on.

Abstract: Video Virtual Try-on aims to seamlessly transfer a reference garment onto a target person in a video while preserving both visual fidelity and temporal coherence. Existing methods typically rely on inpainting masks to define the try-on area, enabling accurate garment transfer for simple scenes (e.g., in-shop videos). However, these mask-based approaches struggle with complex real-world scenarios, as overly large and inconsistent masks often destroy spatial-temporal information, leading to distorted results. Mask-free methods alleviate this issue but face challenges in accurately determining the try-on area, especially for videos with dynamic body movements. To address these limitations, we propose PEMF-VTO, a novel Point-Enhanced Mask-Free Video Virtual Try-On framework that leverages sparse point alignments to explicitly guide garment transfer. Our key innovation is the introduction of point-enhanced guidance, which provides flexible and reliable control over both spatial-level garment transfer and temporal-level video coherence. Specifically, we design a Point-Enhanced Transformer (PET) with two core components: Point-Enhanced Spatial Attention (PSA), which uses frame-cloth point alignments to precisely guide garment transfer, and Point-Enhanced Temporal Attention (PTA), which leverages frame-frame point correspondences to enhance temporal coherence and ensure smooth transitions across frames. Extensive experiments demonstrate that our PEMF-VTO outperforms state-of-the-art methods, generating more natural, coherent, and visually appealing try-on videos, particularly for challenging in-the-wild scenarios. The link to our paper’s homepage is https://pemf-vto.github.io/.

[324] How Cars Move: Analyzing Driving Dynamics for Safer Urban Traffic

Kangan Qian, Jinyu Miao, Xinyu Jiao, Ziang Luo, Zheng Fu, Yining Shi, Yunlong Wang, Kun Jiang, Diange Yang

Main category: cs.CV

TL;DR: PriorMotion is a data integration framework for analyzing urban traffic dynamics, improving accuracy and adaptability over traditional grid-based methods.

DetailsMotivation: Conventional traffic analysis methods are fragmented and overlook spatial-temporal interdependencies, limiting effectiveness in complex urban environments.

Method: Combines multi-scale empirical observations with customized analytical tools to capture evolving spatial-temporal traffic trends.

Result: Enhances traffic pattern accuracy, adaptability to heterogeneous data, and reduces projection errors.

Conclusion: Validated as effective for urban infrastructure management requiring precise spatial-temporal analysis.

Abstract: Understanding the spatial dynamics of cars within urban systems is essential for optimizing infrastructure management and resource allocation. Recent empirical approaches for analyzing traffic patterns have gained traction due to their applicability to city-scale policy development. However, conventional methodologies often rely on fragmented grid-based techniques, which may overlook critical interdependencies among spatial elements and temporal continuity. These limitations can compromise analytical effectiveness in complex urban environments. To address these challenges, we propose PriorMotion, a data integration framework designed to systematically uncover movement patterns through driving dynamics analysis. Our approach combines multi-scale empirical observations with customized analytical tools to capture evolving spatial-temporal trends in urban traffic. Comprehensive evaluations demonstrate that PriorMotion significantly enhances analytical outcomes, including increased accuracy in traffic pattern analysis, improved adaptability to heterogeneous data environments, and reduced long-term projection errors. Validation confirms its effectiveness for urban infrastructure management applications requiring precise characterization of complex spatial-temporal interactions.

[325] Advancing Textual Prompt Learning with Anchored Attributes

Zheng Li, Yibing Song, Ming-Ming Cheng, Xiang Li, Jian Yang

Main category: cs.CV

TL;DR: ATPrompt enhances vision-language alignment by using universal attributes as a bridge, improving adaptability to unknown categories with minimal computational cost.

DetailsMotivation: Current prompt learning methods align images only with known categories, limiting adaptability to unknown categories.

Method: Introduces ATPrompt, which expands soft prompts to include attribute tokens and uses a differentiable attribute search for better alignment.

Result: Validated on 11 datasets, ATPrompt improves alignment between images and unknown categories.

Conclusion: ATPrompt is an effective, plug-in solution for enhancing textual-based prompt learning methods.

Abstract: Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-anchored Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. Additionally, we introduce a straightforward differentiable attribute search method to identify representative and suitable attributes for downstream tasks. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing basic prompt format in textual-based methods, providing general improvements at a negligible computational cost. Extensive experiments across 11 datasets validate the effectiveness of our method. Code is publicly available at https://github.com/zhengli97/ATPrompt.

[326] GURecon: Learning Detailed 3D Geometric Uncertainties for Neural Surface Reconstruction

Zesong Yang, Ru Zhang, Jiale Shi, Zixiang Ai, Boming Zhao, Hujun Bao, Luwei Yang, Zhaopeng Cui

Main category: cs.CV

TL;DR: GURecon introduces a geometric uncertainty field for neural surfaces, addressing challenges in 3D reconstruction quality assessment without ground truth, using online distillation and decoupled fields to improve accuracy.

DetailsMotivation: Assessing geometric quality in neural surface reconstructions is difficult without ground truth, due to rendering-based optimization and entangled appearance-geometry learning.

Method: GURecon models a continuous 3D uncertainty field based on geometric consistency, using online distillation and a decoupled field to mitigate illumination interference.

Result: GURecon outperforms in modeling 3D geometric uncertainty, works with various neural surface representations, and improves tasks like incremental reconstruction.

Conclusion: GURecon provides a robust, unsupervised method for geometric uncertainty assessment in neural surface reconstructions, enhancing downstream applications.

Abstract: Neural surface representation has demonstrated remarkable success in the areas of novel view synthesis and 3D reconstruction. However, assessing the geometric quality of 3D reconstructions in the absence of ground truth mesh remains a significant challenge, due to its rendering-based optimization process and entangled learning of appearance and geometry with photometric losses. In this paper, we present a novel framework, i.e, GURecon, which establishes a geometric uncertainty field for the neural surface based on geometric consistency. Different from existing methods that rely on rendering-based measurement, GURecon models a continuous 3D uncertainty field for the reconstructed surface, and is learned by an online distillation approach without introducing real geometric information for supervision. Moreover, in order to mitigate the interference of illumination on geometric consistency, a decoupled field is learned and exploited to finetune the uncertainty field. Experiments on various datasets demonstrate the superiority of GURecon in modeling 3D geometric uncertainty, as well as its plug-and-play extension to various neural surface representations and improvement on downstream tasks such as incremental reconstruction. The code and supplementary material are available on the project website: https://zju3dv.github.io/GURecon/.

[327] Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, Djamila Aouada

Main category: cs.CV

TL;DR: FakeSTormer is a fine-grained deepfake video detection method that models subtle spatio-temporal inconsistencies using multi-task learning and pseudo-fake video synthesis.

DetailsMotivation: Existing deepfake detection methods struggle with generalization and fail to capture imperceptible spatio-temporal artifacts due to advancements in generative AI.

Method: Proposes a multi-task learning framework with auxiliary branches for artifact-prone regions and a video-level data synthesis strategy for pseudo-fake videos.

Result: Outperforms state-of-the-art methods on challenging benchmarks.

Conclusion: FakeSTormer effectively addresses generalization and artifact detection challenges in deepfake videos.

Abstract: Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods. The code is available at https://github.com/10Ring/FakeSTormer.

[328] Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation

Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, Dacheng Tao

Main category: cs.CV

TL;DR: The paper introduces SynFMC, a synthetic dataset for 6D pose annotations, and FMC, a method for precise 3D-aware motion control in videos, outperforming existing methods.

DetailsMotivation: Existing text-to-video methods lack controllability over 3D-aware motion due to missing 6D pose annotations.

Method: Proposes SynFMC dataset with 6D pose annotations and FMC method for independent/simultaneous control of object and camera motions.

Result: FMC produces high-fidelity videos and is compatible with personalized T2I models, outperforming prior methods.

Conclusion: SynFMC and FMC advance 3D-aware motion control in video generation, offering improved precision and flexibility.

Abstract: Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video.~To provide precise 3D-aware motion control, we further propose a method trained on SynFMC, Free-Form Motion Control (FMC). FMC can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.

[329] MORDA: A Synthetic Dataset to Facilitate Adaptation of Object Detectors to Unseen Real-target Domain While Preserving Performance on Real-source Domain

Hojun Lim, Heecheol Yoo, Jinwoo Lee, Seungmin Jeon, Hyeongseok Jeon

Main category: cs.CV

TL;DR: The paper proposes using synthetic environments to reduce the cost and effort of data acquisition for autonomous vehicles, demonstrating improved performance with a novel dataset, MORDA.

DetailsMotivation: The high cost and effort of acquiring and labeling large-scale, high-quality data for autonomous vehicles, especially when deploying to new regions, motivates the use of synthetic environments as an auxiliary domain.

Method: The authors create synthetic environments mimicking real domains (e.g., South Korea) and blend them with real-source data (nuScenes) to form MORDA, a synthetic-fusion dataset. They train 2D/3D detectors on this dataset.

Result: Experiments show MORDA significantly improves mean Average Precision (mAP) on the AI-Hub dataset (South Korea) while maintaining or slightly enhancing performance on nuScenes.

Conclusion: Synthetic environments like MORDA offer a cost-effective solution for domain adaptation in autonomous vehicle perception, improving performance in new regions without extensive real-world data.

Abstract: Deep neural network (DNN) based perception models are indispensable in the development of autonomous vehicles (AVs). However, their reliance on large-scale, high-quality data is broadly recognized as a burdensome necessity due to the substantial cost of data acquisition and labeling. Further, the issue is not a one-time concern, as AVs might need a new dataset if they are to be deployed to another region (real-target domain) that the in-hand dataset within the real-source domain cannot incorporate. To mitigate this burden, we propose leveraging synthetic environments as an auxiliary domain where the characteristics of real domains are reproduced. This approach could enable indirect experience about the real-target domain in a time- and cost-effective manner. As a practical demonstration of our methodology, nuScenes and South Korea are employed to represent real-source and real-target domains, respectively. That means we construct digital twins for several regions of South Korea, and the data-acquisition framework of nuScenes is reproduced. Blending the aforementioned components within a simulator allows us to obtain a synthetic-fusion domain in which we forge our novel driving dataset, MORDA: Mixture Of Real-domain characteristics for synthetic-data-assisted Domain Adaptation. To verify the value of synthetic features that MORDA provides in learning about driving environments of South Korea, 2D/3D detectors are trained solely on a combination of nuScenes and MORDA. Afterward, their performance is evaluated on the unforeseen real-world dataset (AI-Hub) collected in South Korea. Our experiments present that MORDA can significantly improve mean Average Precision (mAP) on AI-Hub dataset while that on nuScenes is retained or slightly enhanced.

[330] FlexiClip: Locality-Preserving Free-Form Character Animation

Anant Khandelwal

Main category: cs.CV

TL;DR: Error: OutputParser failed

DetailsMotivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, text-to-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional B'ezier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page: https://creative-gen.github.io/flexiclip.github.io/

[331] An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

Wanke Xia, Ruoxin Peng, Haoqi Chu, Xinlei Zhu, Zhiyu Yang, Lili Yang

Main category: cs.CV

TL;DR: The paper proposes a real-time rice grain evaluation system using machine vision, combining object detection, deep learning, and traditional ML for variety identification, grain grading, and chalkiness assessment, achieving high accuracy.

DetailsMotivation: Manual rice classification is time-consuming and error-prone. Automating this process with machine vision improves accuracy and efficiency.

Method: Integrates one-stage object detection, deep convolutional neural networks, and traditional ML techniques for rice grain assessment.

Result: Achieves 99.14% mAP in detection, 97.89% accuracy in classification, and 97.56% in grain completeness grading.

Conclusion: The framework provides an effective, automated solution for rice quality evaluation, enhancing efficiency and accuracy.

Abstract: Rice is one of the most widely cultivated crops globally and has been developed into numerous varieties. The quality of rice during cultivation is primarily determined by its cultivar and characteristics. Traditionally, rice classification and quality assessment rely on manual visual inspection, a process that is both time-consuming and prone to errors. However, with advancements in machine vision technology, automating rice classification and quality evaluation based on its cultivar and characteristics has become increasingly feasible, enhancing both accuracy and efficiency. This study proposes a real-time evaluation mechanism for comprehensive rice grain assessment, integrating a one-stage object detection approach, a deep convolutional neural network, and traditional machine learning techniques. The proposed framework enables rice variety identification, grain completeness grading, and grain chalkiness evaluation. The rice grain dataset used in this study comprises approximately 20,000 images from six widely cultivated rice varieties in China. Experimental results demonstrate that the proposed mechanism achieves a mean average precision (mAP) of 99.14% in the object detection task and an accuracy of 97.89% in the classification task. Furthermore, the framework attains an average accuracy of 97.56% in grain completeness grading within the same rice variety, contributing to an effective quality evaluation system.

[332] Can Optical Denoising Clean Sonar Images? A Benchmark and Fusion Approach

Ziyu Wang, Tao Xue, Jingyuan Li, Haibin Zhang, Zhiqiang Xu, Gaofei Xu, Zhen Wang, Yanbin Wang, Zhiquan Liu

Main category: cs.CV

TL;DR: The paper evaluates deep denoising models for sonar images, assessing their impact on detection accuracy and proposing a fusion framework for improved results.

DetailsMotivation: Object detection in sonar images is hindered by complex noise patterns, yet denoising techniques for sonar data are underexplored.

Method: Systematic evaluation of nine deep denoising models on five sonar datasets, tested with four detection algorithms, and introduction of a multi-source denoising fusion framework.

Result: Denoising improves detection performance, but effectiveness varies by method. The proposed fusion framework enhances image quality.

Conclusion: Denoising benefits sonar image detection, and the fusion framework offers a synergistic solution for noise reduction.

Abstract: Object detection in sonar images is crucial for underwater robotics applications including autonomous navigation and resource exploration. However, complex noise patterns inherent in sonar imagery, particularly speckle, reverberation, and non-Gaussian noise, significantly degrade detection accuracy. While denoising techniques have achieved remarkable success in optical imaging, their applicability to sonar data remains underexplored. This study presents the first systematic evaluation of nine state-of-the-art deep denoising models with distinct architectures, including Neighbor2Neighbor with varying noise parameters, Blind2Unblind with different noise configurations, and DSPNet, for sonar image preprocessing. We establish a rigorous benchmark using five publicly available sonar datasets and assess their impact on four representative detection algorithms: YOLOX, Faster R-CNN, SSD300, and SSDMobileNetV2. Our evaluation addresses three unresolved questions: first, how effectively optical denoising architectures transfer to sonar data; second, which model families perform best against sonar noise; and third, whether denoising truly improves detection accuracy in practical pipelines. Extensive experiments demonstrate that while denoising generally improves detection performance, effectiveness varies across methods due to their inherent biases toward specific noise types. To leverage complementary denoising effects, we propose a mutually-supervised multi-source denoising fusion framework where outputs from different denoisers mutually supervise each other at the pixel level, creating a synergistic framework that produces cleaner images.

[333] WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, Long Chen

Main category: cs.CV

TL;DR: WMNav introduces a world model-based navigation framework using VLMs to predict outcomes and build memories, improving success rates and efficiency in object goal navigation.

DetailsMotivation: Current VLM-based agents lack modular world models to reduce risky interactions by predicting future states, limiting their effectiveness in unseen environments.

Method: WMNav uses a Curiosity Value Map for dynamic navigation policy, decomposes tasks like human thinking, and employs a two-stage action strategy (exploration then localization).

Result: WMNav outperforms zero-shot benchmarks with absolute improvements of +3.2% SR and +3.2% SPL on HM3D, and +13.5% SR and +1.1% SPL on MP3D.

Conclusion: WMNav’s modular design and predictive capabilities enhance navigation efficiency and success, addressing limitations of current VLM-based agents.

Abstract: Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

[334] An Improved Pure Fully Connected Neural Network for Rice Grain Classification

Wanke Xia, Ruoxin Peng, Haoqi Chu, Xinlei Zhu, Zhiyu Yang, Lili Yang, Bo Lv, Xunwen Xiang

Main category: cs.CV

TL;DR: The paper proposes a two-stage training method and improved preprocessing to enhance deep learning-based rice grain classification, increasing accuracy from 97% to 99%.

DetailsMotivation: Classical deep learning models struggle to distinguish rice varieties with similar external traits, leading to misclassifications. The study aims to improve accuracy and feasibility.

Method: Used a fully connected neural network, shifted from one-stage to two-stage training, and improved preprocessing by correcting image tilting to horizontal/vertical positions.

Result: Accuracy improved from 97% to 99% after implementing the two-stage training and preprocessing enhancements.

Conclusion: The proposed methods significantly boost the classification performance of deep learning models for rice grain identification.

Abstract: Rice is a staple food for a significant portion of the world’s population, providing essential nutrients and serving as a versatile in-gredient in a wide range of culinary traditions. Recently, the use of deep learning has enabled automated classification of rice, im-proving accuracy and efficiency. However, classical models based on first-stage training may face difficulties in distinguishing between rice varieties with similar external characteristics, thus leading to misclassifications. Considering the transparency and feasibility of model, we selected and gradually improved pure fully connected neural network to achieve classification of rice grain. The dataset we used contains both global and domestic rice images obtained from websites and laboratories respectively. First, the training mode was changed from one-stage training to two-stage training, which significantly contributes to distinguishing two similar types of rice. Secondly, the preprocessing method was changed from random tilting to horizontal or vertical position cor-rection. After those two enhancements, the accuracy of our model increased notably from 97% to 99%. In summary, two subtle methods proposed in this study can remarkably enhance the classification ability of deep learning models in terms of the classification of rice grain.

[335] Leveraging Spatial Context for Positive Pair Sampling in Histopathology Image Representation Learning

Willmer Rafell Quinones Robles, Sakonporn Noree, Young Sin Ko, Bryan Wong, Jongwoo Kim, Mun Yong Yi

Main category: cs.CV

TL;DR: The paper proposes a spatial context-driven positive pair sampling strategy for self-supervised learning (SSL) in cancer classification from whole-slide images (WSIs), improving accuracy by 5-10% over standard methods.

DetailsMotivation: Deep learning for cancer classification requires extensive expert annotations, which are limiting. Annotation-free methods like SSL are promising but often rely on synthetic augmentations that miss critical spatial structures in histopathology.

Method: The authors introduce a modular spatial context-driven positive pair sampling strategy for SSL, compatible with frameworks like Barlow Twins and DINOv2. It leverages morphological coherence of adjacent patches in WSIs.

Result: Experiments on four datasets show consistent improvements, with 5-10% accuracy gains in slide-level classification and patch-level linear probing compared to standard augmentation-based SSL.

Conclusion: The spatial context-driven approach enhances SSL for computational pathology, offering a biologically meaningful solution for annotation-limited settings.

Abstract: Deep learning has shown strong potential in cancer classification from whole-slide images (WSIs), but the need for extensive expert annotations often limits its success. Annotation-free approaches, such as multiple instance learning (MIL) and self-supervised learning (SSL), have emerged as promising alternatives to traditional annotation-based methods. However, conventional SSL methods typically rely on synthetic data augmentations, which may fail to capture the spatial structure critical to histopathology. In this work, we propose a spatial context-driven positive pair sampling strategy that enhances SSL by leveraging the morphological coherence of spatially adjacent patches within WSIs. Our method is modular and compatible with established joint embedding SSL frameworks, including Barlow Twins, BYOL, VICReg, and DINOv2. We evaluate its effectiveness on both slide-level classification using MIL and patch-level linear probing. Experiments across four datasets demonstrate consistent performance improvements, with accuracy gains of 5% to 10% compared to standard augmentation-based sampling. These findings highlight the value of spatial context in improving representation learning for computational pathology and provide a biologically meaningful enhancement for pretraining models in annotation-limited settings. The code is available at https://anonymous.4open.science/r/contextual-pairs-E72F/.

[336] Stereo Any Video: Temporally Consistent Stereo Matching

Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk

Main category: cs.CV

TL;DR: Stereo Any Video is a framework for video stereo matching that achieves accurate, consistent disparities without auxiliary data, leveraging monocular depth priors and novel architectural innovations.

DetailsMotivation: To address the challenge of estimating spatially accurate and temporally consistent disparities in video stereo matching without relying on additional information like camera poses or optical flow.

Method: Integrates monocular video depth priors with convolutional features, introduces all-to-all-pairs correlation for robust matching cost volumes, and temporal convex upsampling for coherence.

Result: Achieves state-of-the-art performance in zero-shot settings and generalizes well to real-world indoor and outdoor scenarios.

Conclusion: The framework sets a new standard in video stereo matching by combining robustness, accuracy, and temporal consistency.

Abstract: This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.

[337] DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability

Xirui Hu, Jiahao Wang, Hao Chen, Weizhan Zhang, Benqi Wang, Yikun Li, Haishun Nan

Main category: cs.CV

TL;DR: DynamicID is a framework for personalized human image generation, supporting single-ID and multi-ID scenarios with high fidelity and facial editability. It introduces Semantic-Activated Attention and Identity-Motion Reconfigurator, and uses a task-decoupled training paradigm with a curated dataset.

DetailsMotivation: Existing methods for personalized human image generation are limited to single-ID scenarios and lack facial editability. DynamicID aims to address these limitations.

Method: DynamicID employs Semantic-Activated Attention (SAA) for multi-ID personalization and Identity-Motion Reconfigurator (IMR) for facial editing. It uses a task-decoupled training paradigm and a dataset (VariFace-10k).

Result: DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization.

Conclusion: DynamicID provides a robust solution for personalized human image generation with enhanced flexibility and fidelity.

Abstract: Recent advances in text-to-image generation have driven interest in generating personalized human images that depict specific identities from reference images. Although existing methods achieve high-fidelity identity preservation, they are generally limited to single-ID scenarios and offer insufficient facial editability. We present DynamicID, a tuning-free framework that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the base model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training. 2) Identity-Motion Reconfigurator (IMR), which applies feature-space manipulation to effectively disentangle and reconfigure facial motion and identity features, supporting flexible facial editing. 3) a task-decoupled training paradigm that reduces data dependency, together with VariFace-10k, a curated dataset of 10k unique individuals, each represented by 35 distinct facial images. Experimental results demonstrate that DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization capability. Our code will be released at https://github.com/ByteCat-bot/DynamicID.

[338] OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

Ding Zhong, Xu Zheng, Chenfei Liao, Yuanhuiyi Lyu, Jialei Chen, Shengyang Wu, Linfeng Zhang, Xuming Hu

Main category: cs.CV

TL;DR: OmniSAM adapts SAM2 for panoramic semantic segmentation by addressing FoV disparity and enhancing semantic understanding, outperforming state-of-the-art methods.

DetailsMotivation: The significant FoV gap and lack of pixel-level semantic understanding in applying SAM2 to panoramic images necessitate a tailored solution.

Method: OmniSAM divides panoramas into patches, treats them as sequences, leverages SAM2’s memory for cross-patch dependencies, and fine-tunes encoders for semantic prediction.

Result: OmniSAM achieves significant improvements, e.g., 79.06% (+10.22%) on SPin8-to-SPan8 and 62.46% (+6.58%) on CS13-to-DP13.

Conclusion: OmniSAM successfully bridges the FoV and semantic gaps, demonstrating superior performance in panoramic segmentation tasks.

Abstract: Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2’s memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.

[339] RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao

Main category: cs.CV

TL;DR: RealGeneral unifies image generation tasks using video models, improving performance in tasks like customized generation and canny-to-image.

DetailsMotivation: Existing visual generation models lack unification and generalizability, unlike LLMs.

Method: Reformulates image generation as conditional frame prediction, using Unified Conditional Embedding and Unified Stream DiT Block.

Result: Achieves 14.5% better subject similarity and 10% higher image quality in tasks.

Conclusion: RealGeneral offers a unified, effective framework for diverse image generation tasks.

Abstract: Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: https://lyne1.github.io/realgeneral_web/; GitHub Link: https://github.com/Lyne1/RealGeneral

[340] Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective

Xiaoming Zhao, Alexander G. Schwing

Main category: cs.CV

TL;DR: The paper investigates classifier-free guidance in denoising diffusion models, traces its roots to classifier guidance, and proposes a postprocessing step to improve alignment with real data distributions.

DetailsMotivation: To understand the mechanisms of classifier-free guidance and improve conditional generation by addressing issues around decision boundaries.

Method: Empirical study tracing back to classifier guidance, identifying key assumptions, and proposing a flow-matching postprocessing step.

Result: Both classifier and classifier-free guidance work by avoiding decision boundaries. The proposed postprocessing step effectively reduces the gap between learned and real data distributions.

Conclusion: The study provides insights into classifier-free guidance and offers a practical solution to enhance conditional generation in diffusion models.

Abstract: Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. We find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. Based on this classifier-centric understanding, we propose a generic postprocessing step built upon flow-matching to shrink the gap between the learned distribution for a pre-trained denoising diffusion model and the real data distribution, majorly around the decision boundaries. Experiments on various datasets verify the effectiveness of the proposed approach.

[341] Beyond RGB: Adaptive Parallel Processing for RAW Object Detection

Shani Gamrian, Hila Barel, Feiran Li, Masakazu Yoshimura, Daisuke Iso

Main category: cs.CV

TL;DR: RAM replaces traditional ISP for RAW object detection, using parallel processing to enhance feature capture and improve performance.

DetailsMotivation: Traditional ISP pipelines may lose critical information for computer vision tasks like object detection.

Method: Introduces Raw Adaptation Module (RAM) with parallel ISP functions and dynamic fusion for task-specific optimization.

Result: Outperforms RGB-based methods and achieves state-of-the-art results on RAW datasets.

Conclusion: RAM leverages RAW data better, enabling superior object detection performance.

Abstract: Object detection models are typically applied to standard RGB images processed through Image Signal Processing (ISP) pipelines, which are designed to enhance sensor-captured RAW images for human vision. However, these ISP functions can lead to a loss of critical information that may be essential in optimizing for computer vision tasks, such as object detection. In this work, we introduce Raw Adaptation Module (RAM), a module designed to replace the traditional ISP, with parameters optimized specifically for RAW object detection. Inspired by the parallel processing mechanisms of the human visual system, RAM departs from existing learned ISP methods by applying multiple ISP functions in parallel rather than sequentially, allowing for a more comprehensive capture of image features. These processed representations are then fused in a specialized module, which dynamically integrates and optimizes the information for the target task. This novel approach not only leverages the full potential of RAW sensor data but also enables task-specific pre-processing, resulting in superior object detection performance. Our approach outperforms RGB-based methods and achieves state-of-the-art results across diverse RAW image datasets under varying lighting conditions and dynamic ranges.

[342] Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey

Liewen Liao, Weihao Yan, Ming Yang, Songan Zhang

Main category: cs.CV

TL;DR: A review of learning-based 3D reconstruction in autonomous driving, covering methods, benchmarks, and challenges to inspire future research.

DetailsMotivation: To advance autonomous driving by improving 3D reconstruction techniques for tasks like scene understanding and simulation.

Method: Systematic review and multi-perspective analysis of learning-based 3D reconstruction methods, categorized by subtasks.

Result: Comprehensive technical reference and identification of development trends and challenges.

Conclusion: The review aims to inspire future research in learning-based 3D reconstruction for autonomous driving.

Abstract: Learning-based 3D reconstruction has emerged as a transformative technique in autonomous driving, enabling precise modeling of both dynamic and static environments through advanced neural representations. Despite data augmentation, 3D reconstruction inspires pioneering solution for vital tasks in the field of autonomous driving, such as scene understanding and closed-loop simulation. We investigates the details of 3D reconstruction and conducts a multi-perspective, in-depth analysis of recent advancements. Specifically, we first provide a systematic introduction of preliminaries, including data modalities, benchmarks and technical preliminaries of learning-based 3D reconstruction, facilitating instant identification of suitable methods according to sensor suites. Then, we systematically review learning-based 3D reconstruction methods in autonomous driving, categorizing approaches by subtasks and conducting multi-dimensional analysis and summary to establish a comprehensive technical reference. The development trends and existing challenges are summarized in the context of learning-based 3D reconstruction in autonomous driving. We hope that our review will inspire future researches.

[343] Cube: A Roblox View of 3D Intelligence

Foundation AI Team, Kiran Bhat, Nishchaie Khanna, Karun Channa, Tinghui Zhou, Yiheng Zhu, Xiaoxia Sun, Charles Shang, Anirudh Sudarshan, Maurice Chu, Daiqing Li, Kangle Deng, Jean-Philippe Fauconnier, Tijmen Verhulsdonck, Maneesh Agrawala, Kayvon Fatahalian, Alexander Weiss, Christian Reiser, Ravi Kiran Chirravuri, Ravali Kandur, Alejandro Pelaez, Akash Garg, Michael Palleschi, Jessica Wang, Skylar Litz, Leon Liu, Anying Li, David Harmon, Derek Liu, Liangjun Feng, Denis Goupil, Lukas Kuczynski, Jihyun Yoon, Naveen Marri, Peiye Zhuang, Yinan Zhang, Brian Yin, Haomiao Jiang, Marcel van Workum, Thomas Lane, Bryce Erickson, Salil Pathare, Kyle Price, Steve Han, Yiqing Wang, Anupam Singh, David Baszucki

Main category: cs.CV

TL;DR: Roblox aims to build a 3D foundation model for generating and reasoning about 3D content, starting with a 3D shape tokenizer for applications like text-to-shape generation.

DetailsMotivation: To support developers in creating Roblox experiences by automating 3D content generation and reasoning.

Method: Develop a 3D shape tokenizer as a foundational step, enabling applications like text-to-shape and shape-to-text generation, and integrating with LLMs for scene analysis.

Result: Demonstrated applications of the tokenization scheme in generating and analyzing 3D content, collaborating with LLMs.

Conclusion: Outlined a path toward a unified 3D foundation model, emphasizing the potential of integrating 3D intelligence with existing AI capabilities.

Abstract: Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.

[344] TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data

Rohit Kundu, Shan Jia, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury

Main category: cs.CV

TL;DR: TruthLens is a novel framework for DeepFake detection that provides detailed textual reasoning alongside binary classification, outperforming existing methods in accuracy and explainability.

DetailsMotivation: The rise of AI-generated content necessitates better DeepFake detection tools that go beyond binary classification and offer interpretability.

Method: TruthLens combines multimodal large language models (PaliGemma2) for global context and vision-only models (DINOv2) for localized feature extraction, creating a hybrid system.

Result: TruthLens achieves 2-14% higher detection accuracy than state-of-the-art methods and excels in explainability across diverse datasets.

Conclusion: TruthLens is a robust and interpretable solution for detecting DeepFakes, generalizing well across various manipulation techniques.

Abstract: Detecting DeepFakes has become a crucial research area as the widespread use of AI image generators enables the effortless creation of face-manipulated and fully synthetic content, yet existing methods are often limited to binary classification (real vs. fake) and lack interpretability. To address these challenges, we propose TruthLens, a novel and highly generalizable framework for DeepFake detection that not only determines whether an image is real or fake but also provides detailed textual reasoning for its predictions. Unlike traditional methods, TruthLens effectively handles both face-manipulated DeepFakes and fully AI-generated content while addressing fine-grained queries such as “Does the eyes/nose/mouth look real or fake?” The architecture of TruthLens combines the global contextual understanding of multimodal large language models like PaliGemma2 with the localized feature extraction capabilities of vision-only models like DINOv2. This hybrid design leverages the complementary strengths of both models, enabling robust detection of subtle manipulations while maintaining interpretability. Extensive experiments on diverse datasets demonstrate that TruthLens outperforms state-of-the-art methods in detection accuracy (by 2-14%) and explainability, in both in-domain and cross-data settings, generalizing effectively across traditional and emerging manipulation techniques.

[345] Joint Self-Supervised Video Alignment and Action Segmentation

Ali Shah Ali, Syed Ahmed Mahmood, Mubin Saeed, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

Main category: cs.CV

TL;DR: A novel unified optimal transport framework for self-supervised video alignment and action segmentation, achieving state-of-the-art performance with efficiency.

DetailsMotivation: To address the limitations of separate models for video alignment and action segmentation by proposing a unified, efficient, and high-performing approach.

Method: Develops a fused Gromov-Wasserstein optimal transport formulation with a structural prior for video alignment, then extends it to a unified framework for joint tasks.

Result: Achieves state-of-the-art video alignment and superior action segmentation, with reduced time and memory usage compared to separate models.

Conclusion: The first unified model for video alignment and action segmentation, demonstrating efficiency and high performance.

Abstract: We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model.

[346] G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation

Juntao Jian, Xiuping Liu, Zixuan Chen, Manyi Li, Jian Liu, Ruizhen Hu

Main category: cs.CV

TL;DR: G-DexGrasp is a retrieval-augmented method for generating dexterous grasps for unseen objects and tasks, using fine-grained contact and affordance priors for generalization.

DetailsMotivation: Generalizing dexterous grasping to unseen object categories and diverse task instructions remains challenging.

Method: Retrieves generalizable grasping priors (fine-grained contact and affordance) to guide a generative model, with refinement optimization for plausibility.

Result: Outperforms existing approaches in generalization and grasp quality.

Conclusion: G-DexGrasp effectively generalizes to unseen objects and tasks, validated by experiments.

Abstract: Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose G-DexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordance-related distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches. Project page: https://g-dexgrasp.github.io/

[347] Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion

Haim Sawdayee, Chuan Guo, Guy Tevet, Bing Zhou, Jian Wang, Amit H. Bermano

Main category: cs.CV

TL;DR: LoRA-MDM is a lightweight framework for motion stylization that adapts generative priors to include styles while preserving distribution, enabling realistic style infusion and advanced operations like style blending.

DetailsMotivation: Existing text-to-motion models struggle with nuanced stylistic attributes due to scarce style-specific data, leading to low-quality generations.

Method: LoRA-MDM adapts the generative prior to include reference styles using few samples, shifting the motion manifold semantically without altering individual motions.

Result: The framework achieves a balance between text fidelity and style consistency, outperforming state-of-the-art methods.

Conclusion: LoRA-MDM effectively generalizes to complex actions while maintaining editability and enabling advanced operations like style blending.

Abstract: Text-to-motion generative models span a wide range of 3D human actions but struggle with nuanced stylistic attributes such as a “Chicken” style. Due to the scarcity of style-specific data, existing approaches pull the generative prior towards a reference style, which often results in out-of-distribution low quality generations. In this work, we introduce LoRA-MDM, a lightweight framework for motion stylization that generalizes to complex actions while maintaining editability. Our key insight is that adapting the generative prior to include the style, while preserving its overall distribution, is more effective than modifying each individual motion during generation. Building on this idea, LoRA-MDM learns to adapt the prior to include the reference style using only a few samples. The style can then be used in the context of different textual prompts for generation. The low-rank adaptation shifts the motion manifold in a semantically meaningful way, enabling realistic style infusion even for actions not present in the reference samples. Moreover, preserving the distribution structure enables advanced operations such as style blending and motion editing. We compare LoRA-MDM to state-of-the-art stylized motion generation methods and demonstrate a favorable balance between text fidelity and style consistency.

[348] HORT: Monocular Hand-held Objects Reconstruction with Transformers

Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Cordelia Schmid

Main category: cs.CV

TL;DR: A transformer-based model efficiently reconstructs dense 3D point clouds of hand-held objects from monocular images, outperforming existing methods in accuracy and speed.

DetailsMotivation: Existing methods for 3D reconstruction of hand-held objects from monocular images are slow and produce overly smooth or inefficient results.

Method: The proposed model uses a coarse-to-fine strategy, generating a sparse point cloud first and refining it into a dense representation with pixel-aligned image features. It integrates image features with 3D hand geometry for joint prediction of the object point cloud and its pose.

Result: The method achieves state-of-the-art accuracy and faster inference speed, generalizing well to real-world images.

Conclusion: The transformer-based approach effectively addresses the limitations of prior methods, offering efficient and accurate 3D reconstruction of hand-held objects.

Abstract: Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.

[349] MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation

Nico Catalano, Stefano Samele, Paolo Pertino, Matteo Matteucci

Main category: cs.CV

TL;DR: MARS is a plug-and-play ranking system for Few Shot Segmentation, using multimodal cues to improve mask proposals, achieving state-of-the-art results.

DetailsMotivation: Current methods lack robust selection beyond visual similarity, leading to suboptimal predictions.

Method: MARS scores, filters, and merges mask proposals using multimodal cues at local and global levels.

Result: Achieves new state-of-the-art results on COCO-20i, Pascal-5i, LVIS-92i, and FSS-1000 benchmarks.

Conclusion: MARS enhances Few Shot Segmentation by robustly ranking proposals and is easily integrated with existing methods.

Abstract: Few Shot Segmentation aims to segment novel object classes given only a handful of labeled examples, enabling rapid adaptation with minimal supervision. Current literature crucially lacks a selection method that goes beyond visual similarity between the query and example images, leading to suboptimal predictions. We present MARS, a plug-and-play ranking system that leverages multimodal cues to filter and merge mask proposals robustly. Starting from a set of mask predictions for a single query image, we score, filter, and merge them to improve results. Proposals are evaluated using multimodal scores computed at local and global levels. Extensive experiments on COCO-20i, Pascal-5i, LVIS-92i, and FSS-1000 demonstrate that integrating all four scoring components is crucial for robust ranking, validating our contribution. As MARS can be effortlessly integrated with various mask proposal systems, we deploy it across a wide range of top-performer methods and achieve new state-of-the-art results on multiple existing benchmarks. Code will be available upon acceptance.

[350] Contour Flow Constraint: Preserving Global Shape Similarity for Deep Learning based Image Segmentation

Shengzhe Chen, Zhaoxuan Dong, Jun Liu

Main category: cs.CV

TL;DR: The paper introduces a global shape similarity concept based on Contour Flow (CF) and integrates it into deep learning for improved image segmentation via shape loss and CFSSnet.

DetailsMotivation: Existing methods lack general global shape similarity consideration and integration into deep networks.

Method: Proposes a contour flow constraint, shape loss for training, and CFSSnet via variational model unrolling.

Result: Improves segmentation accuracy and shape similarity, robust against noise.

Conclusion: The approach is adaptable and effective for preserving global shape similarity in segmentation.

Abstract: For effective image segmentation, it is crucial to employ constraints informed by prior knowledge about the characteristics of the areas to be segmented to yield favorable segmentation outcomes. However, the existing methods have primarily focused on priors of specific properties or shapes, lacking consideration of the general global shape similarity from a Contour Flow (CF) perspective. Furthermore, naturally integrating this contour flow prior image segmentation model into the activation functions of deep convolutional networks through mathematical methods is currently unexplored. In this paper, we establish a concept of global shape similarity based on the premise that two shapes exhibit comparable contours. Furthermore, we mathematically derive a contour flow constraint that ensures the preservation of global shape similarity. We propose two implementations to integrate the constraint with deep neural networks. Firstly, the constraint is converted to a shape loss, which can be seamlessly incorporated into the training phase for any learning-based segmentation framework. Secondly, we add the constraint into a variational segmentation model and derive its iterative schemes for solution. The scheme is then unrolled to get the architecture of the proposed CFSSnet. Validation experiments on diverse datasets are conducted on classic benchmark deep network segmentation models. The results indicate a great improvement in segmentation accuracy and shape similarity for the proposed shape loss, showcasing the general adaptability of the proposed loss term regardless of specific network architectures. CFSSnet shows robustness in segmenting noise-contaminated images, and inherent capability to preserve global shape similarity.

[351] ChartQA-X: Generating Explanations for Visual Chart Reasoning

Shamanthak Hegde, Pooyan Fazli, Hasti Seifi

Main category: cs.CV

TL;DR: ChartQA-X is a dataset for generating explanations alongside chart-based Q&A, showing improved model performance and human preference over human-written explanations.

DetailsMotivation: To enhance data-driven decision-making by providing detailed explanations for chart-based questions, improving comprehension and trust in responses.

Method: Developed ChartQA-X dataset with 30,299 chart samples, paired with questions, answers, and explanations, evaluated using metrics like faithfulness and coherence.

Result: Model-generated explanations outperformed human-written ones in accuracy and logic, with significant improvements in QA accuracy and explanation quality.

Conclusion: Integrating explanations with answers improves communication of visual data, enhancing comprehension and trust in AI-generated responses.

Abstract: The ability to explain complex information from chart images is vital for effective data-driven decision-making. In this work, we address the challenge of generating detailed explanations alongside answering questions about charts. We present ChartQA-X, a comprehensive dataset comprising 30,299 chart samples across four chart types, each paired with contextually relevant questions, answers, and explanations. Explanations are generated and selected based on metrics such as faithfulness, informativeness, coherence, and perplexity. Our human evaluation with 245 participants shows that model-generated explanations in ChartQA-X surpass human-written explanations in accuracy and logic and are comparable in terms of clarity and overall quality. Moreover, models fine-tuned on ChartQA-X show substantial improvements across various metrics, including absolute gains of up to 24.57 points in explanation quality, 18.96 percentage points in question-answering accuracy, and 14.75 percentage points on unseen benchmarks for the same task. By integrating explanatory narratives with answers, our approach enables agents to communicate complex visual information more effectively, improving comprehension and fostering greater trust in the generated responses.

[352] AnyTSR: Any-Scale Thermal Super-Resolution for UAV

Mengyuan Li, Changhong Fu, Ziyu Lu, Zijie Zhang, Haobo Zuo, Liangliang Yao

Main category: cs.CV

TL;DR: Proposes AnyTSR, a novel any-scale thermal super-resolution method for UAVs, improving resolution and flexibility with a single model.

DetailsMotivation: Thermal imaging in UAVs suffers from low resolution and blurred boundaries; existing SR methods are inflexible and computationally expensive.

Method: Introduces an image encoder for precise feature coding and an any-scale upsampler with coordinate offset embedding to enhance spatial understanding and reduce artifacts. Uses a new dataset (UAV-TSR) for training.

Result: Outperforms state-of-the-art methods across all scaling factors, generating more accurate and detailed high-resolution images.

Conclusion: AnyTSR provides a flexible and efficient solution for thermal SR in UAV applications, validated by superior performance and a new dataset.

Abstract: Thermal imaging can greatly enhance the application of intelligent unmanned aerial vehicles (UAV) in challenging environments. However, the inherent low resolution of thermal sensors leads to insufficient details and blurred boundaries. Super-resolution (SR) offers a promising solution to address this issue, while most existing SR methods are designed for fixed-scale SR. They are computationally expensive and inflexible in practical applications. To address above issues, this work proposes a novel any-scale thermal SR method (AnyTSR) for UAV within a single model. Specifically, a new image encoder is proposed to explicitly assign specific feature code to enable more accurate and flexible representation. Additionally, by effectively embedding coordinate offset information into the local feature ensemble, an innovative any-scale upsampler is proposed to better understand spatial relationships and reduce artifacts. Moreover, a novel dataset (UAV-TSR), covering both land and water scenes, is constructed for thermal SR tasks. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods across all scaling factors as well as generates more accurate and detailed high-resolution images. The code is located at https://github.com/vision4robotics/AnyTSR.

[353] ZS-VCOS: Zero-Shot Video Camouflaged Object Segmentation By Optical Flow and Open Vocabulary Object Detection

Wenqi Guo, Mohamed Shehata, Shan Du

Main category: cs.CV

TL;DR: The paper introduces a zero-shot approach for camouflaged object segmentation by integrating large pre-trained models (SAM-2 and Owl-v2) with temporal information, achieving significant performance improvements over existing methods.

DetailsMotivation: Camouflaged object segmentation is challenging due to high similarity between objects and backgrounds. Current zero-shot methods underperform, and the paper aims to address this gap without requiring training.

Method: The approach combines SAM-2 and Owl-v2 with temporal information in a modular pipeline, avoiding training.

Result: The method outperforms existing zero-shot and supervised methods, raising the F-measure from 0.296 to 0.628 on MoCA-Mask and improving success rates on MoCA-Filter.

Conclusion: The proposed zero-shot approach is highly effective, surpassing prior methods and highlighting inconsistencies in previous work.

Abstract: Camouflaged object segmentation presents unique challenges compared to traditional segmentation tasks, primarily due to the high similarity in patterns and colors between camouflaged objects and their backgrounds. Effective solutions to this problem have significant implications in critical areas such as pest control, defect detection, and lesion segmentation in medical imaging. Prior research has predominantly emphasized supervised or unsupervised pre-training methods, leaving zero-shot approaches significantly underdeveloped. Existing zero-shot techniques commonly utilize the Segment Anything Model (SAM) in automatic mode or rely on vision-language models to generate cues for segmentation; however, their performances remain unsatisfactory, due to the similarity of the camouflaged object and the background. This work studies how to avoid training by integrating large pre-trained models like SAM-2 and Owl-v2 with temporal information into a modular pipeline. Evaluated on the MoCA-Mask dataset, our approach achieves outstanding performance improvements, significantly outperforming existing zero-shot methods by raising the F-measure ($F_\beta^w$) from 0.296 to 0.628. Our approach also surpasses supervised methods, increasing the F-measure from 0.476 to 0.628. Additionally, evaluation on the MoCA-Filter dataset demonstrates an increase in the success rate from 0.628 to 0.697 when compared with FlowSAM, a supervised transfer method. A thorough ablation study further validates the individual contributions of each component. Besides our main contributions, we also highlight inconsistencies in previous work regarding metrics and settings. Code can be found in https://github.com/weathon/vcos.

[354] DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

Wenchuan Wang, Mengqi Huang, Yijing Tu, Zhendong Mao

Main category: cs.CV

TL;DR: DualReal introduces adaptive joint training to address identity-motion conflicts in text-to-video generation, outperforming existing methods.

DetailsMotivation: Existing works ignore mutual constraints between identity and motion, leading to conflicts and degraded performance.

Method: DualReal uses Dual-aware Adaptation and StageBlender Controller for joint training, ensuring lossless fusion of identity and motion.

Result: DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8%, achieving top motion performance.

Conclusion: DualReal effectively resolves identity-motion conflicts, enhancing text-to-video generation quality.

Abstract: Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention by focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrade. To address this, we introduce DualReal, a novel framework that employs adaptive joint training to construct interdependencies between dimensions collaboratively. Specifically, DualReal is composed of two units: (1) Dual-aware Adaptation dynamically switches the training step (i.e., identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) StageBlender Controller leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive evaluation benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8% on average, and achieves top performance on nearly all motion metrics. Page: https://wenc-k.github.io/dualreal-customization

[355] Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

Benjamin Raphael Ernhofer, Daniil Prokhorov, Jannica Langner, Dominik Bollmann

Main category: cs.CV

TL;DR: A vision-language framework for automotive infotainment UI understanding, supported by a new dataset and synthetic data pipeline, achieves strong performance and cross-domain generalization.

DetailsMotivation: Address the need for adaptive solutions in automotive infotainment systems due to frequent UI updates and diverse designs.

Method: Fine-tune a Molmo-7B model using LoRa, synthetic data, and visual grounding, creating the ELAM model.

Result: ELAM achieves 80.8% accuracy on ScreenSpot, outperforming baselines and matching specialized models.

Conclusion: Cost-efficient AI-driven progress in automotive UI understanding is achievable through data collection and fine-tuning.

Abstract: Modern automotive infotainment systems require intelligent and adaptive solutions to handle frequent User Interface (UI) updates and diverse design variations. We introduce a vision-language framework for understanding and interacting with automotive infotainment systems, enabling seamless adaptation across different UI designs. To further support research in this field, we release AutomotiveUI-Bench-4K, an open-source dataset of 998 images with 4,208 annotations. Additionally, we present a synthetic data pipeline to generate training data. We fine-tune a Molmo-7B-based model using Low-Rank Adaptation (LoRa) and incorporating reasoning generated by our pipeline, along with visual grounding and evaluation capabilities. The fine-tuned Evaluative Large Action Model (ELAM) achieves strong performance on AutomotiveUI-Bench-4K (model and dataset are available on Hugging Face) and demonstrating strong cross-domain generalization, including a +5.6% improvement on ScreenSpot over the baseline model. Notably, our approach achieves 80.8% average accuracy on ScreenSpot, closely matching or even surpassing specialized models for desktop, mobile, and web, such as ShowUI, despite being trained for the infotainment domain. This research investigates how data collection and subsequent fine-tuning can lead to AI-driven progress within automotive UI understanding and interaction. The applied method is cost-efficient and fine-tuned models can be deployed on consumer-grade GPUs.

[356] Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

Junyi Yuan, Jian Zhang, Fangyu Wu, Dongming Lu, Huanda Lu, Qiufeng Wang

Main category: cs.CV

TL;DR: The paper introduces CulTi, a multimodal dataset for Chinese cultural heritage, and LACLIP, a training-free local alignment method for cross-modal retrieval, outperforming existing models.

DetailsMotivation: The lack of specialized datasets for Chinese cultural heritage hinders cross-modal learning. CulTi and LACLIP aim to bridge this gap.

Method: Proposes CulTi dataset (5,726 image-text pairs) and LACLIP, a local alignment strategy based on Chinese-CLIP for fine-grained retrieval.

Result: LACLIP significantly improves cross-modal retrieval on CulTi, especially for fine-grained semantic associations.

Conclusion: CulTi and LACLIP advance cross-modal retrieval in Chinese cultural heritage, addressing local alignment challenges.

Abstract: China has a long and rich history, encompassing a vast cultural heritage that includes diverse multimodal information, such as silk patterns, Dunhuang murals, and their associated historical narratives. Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage by bridging visual and textual modalities to enable accurate text-to-image and image-to-text retrieval. However, despite the growing interest in multimodal research, there is a lack of specialized datasets dedicated to Chinese cultural heritage, limiting the development and evaluation of cross-modal learning models in this domain. To address this gap, we propose a multimodal dataset named CulTi, which contains 5,726 image-text pairs extracted from two series of professional documents, respectively related to ancient Chinese silk and Dunhuang murals. Compared to existing general-domain multimodal datasets, CulTi presents a challenge for cross-modal retrieval: the difficulty of local alignment between intricate decorative motifs and specialized textual descriptions. To address this challenge, we propose LACLIP, a training-free local alignment strategy built upon a fine-tuned Chinese-CLIP. LACLIP enhances the alignment of global textual descriptions with local visual regions by computing weighted similarity scores during inference. Experimental results on CulTi demonstrate that LACLIP significantly outperforms existing models in cross-modal retrieval, particularly in handling fine-grained semantic associations within Chinese cultural heritage.

[357] R-Genie: Reasoning-Guided Generative Image Editing

Dong Zhang, Lingfeng He, Rui Yan, Fei Shen, Jinhui Tang

Main category: cs.CV

TL;DR: The paper introduces R-Genie, a reasoning-guided generative image editor that combines diffusion models with multimodal large language models to handle complex, implicit user intentions in image editing.

DetailsMotivation: Current image editing methods are limited by explicit textual instructions and lack deep comprehension of implicit user intentions and contextual reasoning.

Method: The authors construct a dataset with 1,000+ image-instruction-edit triples and propose R-Genie, which integrates diffusion models with multimodal large language models using a reasoning-attention mechanism.

Result: Experiments show R-Genie enhances diffusion models with reasoning-based editing capabilities, enabling complex, intention-aware image synthesis.

Conclusion: R-Genie unlocks new potentials for intelligent image synthesis by bridging linguistic understanding and visual synthesis for intricate editing tasks.

Abstract: While recent advances in image editing have enabled impressive visual synthesis capabilities, current methods remain constrained by explicit textual instructions and limited editing operations, lacking deep comprehension of implicit user intentions and contextual reasoning. In this work, we introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries accepting world knowledge and intention inference. To facilitate this task, we first construct a comprehensive dataset featuring over 1,000 image-instruction-edit triples that incorporate rich reasoning contexts and real-world knowledge. We then propose R-Genie: a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. R-Genie incorporates a reasoning-attention mechanism to bridge linguistic understanding with visual synthesis, enabling it to handle intricate editing requests involving abstract user intentions and contextual reasoning relations. Extensive experimental results validate that R-Genie can equip diffusion models with advanced reasoning-based editing capabilities, unlocking new potentials for intelligent image synthesis.

[358] VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Haohan Wang, Junjie Hu, Yong Jae Lee

Main category: cs.CV

TL;DR: VisTA is a reinforcement learning framework for visual agents to dynamically explore and combine tools, outperforming training-free baselines.

DetailsMotivation: Existing methods lack active tool exploration and assume limited tool diversity, requiring extensive human supervision. VisTA aims to address these limitations.

Method: VisTA uses end-to-end reinforcement learning with Group Relative Policy Optimization (GRPO) to refine tool-selection strategies autonomously.

Result: Experiments show VisTA outperforms baselines, especially on out-of-distribution tasks, enhancing generalization and tool utilization.

Conclusion: VisTA enables flexible, experience-driven visual reasoning by adaptively leveraging diverse tools.

Abstract: We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA’s ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

[359] SemiOccam: A Robust Semi-Supervised Image Recognition Network Using Sparse Labels

Rui Yann, Tianshuo Zhang, Xianglei Xing

Main category: cs.CV

TL;DR: SemiOccam is an efficient semi-supervised image recognition network that outperforms existing methods with minimal labeled data and reduced training time.

DetailsMotivation: Existing methods are complex, resource-intensive, and struggle with limited labeled data. SemiOccam aims to improve efficiency and generalization.

Method: Uses a hierarchical mixture density classification mechanism optimizing mutual information between features and target classes, compressing redundancy while retaining discriminative components.

Result: Achieves state-of-the-art performance (95%+ accuracy on two datasets with only 4 labeled samples per class) and reduces training time to minutes.

Conclusion: SemiOccam is highly efficient and effective, and the paper also addresses a data leakage issue in STL-10, releasing a cleaned dataset for reproducibility.

Abstract: We present SemiOccam, an image recognition network that leverages semi-supervised learning in a highly efficient manner. Existing works often rely on complex training techniques and architectures, requiring hundreds of GPU hours for training, while their generalization ability with extremely limited labeled data remains to be improved. To address these limitations, we construct a hierarchical mixture density classification mechanism by optimizing mutual information between feature representations and target classes, compressing redundant information while retaining crucial discriminative components. Experimental results demonstrate that our method achieves state-of-the-art performance on three commonly used datasets, with accuracy exceeding 95% on two of them using only 4 labeled samples per class, and its simple architecture keeps training time at the minute level. Notably, this paper reveals a long-overlooked data leakage issue in the STL-10 dataset for semi-supervised learning and removes duplicates to ensure reliable experimental results. We release the deduplicated CleanSTL-10 dataset to facilitate fair and reproducible research. Code available at https://github.com/Shu1L0n9/SemiOccam.

[360] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Yifan Li, Xin Li, Tianqin Li, Wenbin He, Yu Kong, Liu Ren

Main category: cs.CV

TL;DR: ViT-Split improves efficiency in vision foundation model (VFM) adaptation by splitting layers into extractor and adapter components, reducing training time and overfitting while maintaining performance.

DetailsMotivation: Existing VFM adapters are inefficient due to early gradient backpropagation, unnecessary tuning of all components, and underutilization of prior knowledge.

Method: ViT-Split divides VFM layers into extractor and adapter, introduces task and prior heads, and freezes the VFM backbone to optimize feature utilization.

Result: ViT-Split reduces training time by up to 4x and achieves comparable or better performance on tasks like segmentation and detection.

Conclusion: ViT-Split offers a more efficient and effective approach for adapting VFMs, addressing inefficiencies in existing methods.

Abstract: Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ while achieving comparable or even better results on ADE20K, compared to other VFM adapters.

[361] EgoM2P: Egocentric Multimodal Multitask Pretraining

Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang

Main category: cs.CV

TL;DR: The paper introduces EgoM2P, a masked modeling framework for egocentric 4D understanding, addressing challenges in multimodal data heterogeneity and scalability. It outperforms specialist models in tasks like gaze prediction and depth estimation.

DetailsMotivation: Egocentric vision applications require understanding multimodal signals, but data heterogeneity and missing modalities make supervised learning difficult. Existing models struggle with dynamic camera motion and temporal-spatial complexity.

Method: Proposes EgoM2P, using efficient temporal tokenizers and masked modeling to learn from multimodal tokens, enabling multitasking across perception and synthesis tasks.

Result: EgoM2P matches or outperforms specialist models in tasks like gaze prediction and depth estimation, while being significantly faster.

Conclusion: EgoM2P is a scalable, general-purpose solution for egocentric vision, with plans to open-source the framework to advance research.

Abstract: Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer’s actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video, and also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully open-source EgoM2P to support the community and advance egocentric vision research. Project page: https://egom2p.github.io/.

[362] Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, Marco Pavone

Main category: cs.CV

TL;DR: Efficient triplane-based multi-camera tokenization for AV policies reduces tokens by 72%, speeds up inference by 50%, and maintains accuracy.

DetailsMotivation: Autoregressive Transformers need efficient sensor data tokenization for real-time feasibility on embedded hardware in AVs.

Method: Proposes a triplane-based tokenization strategy leveraging 3D neural reconstruction to handle multi-camera inputs geometrically.

Result: Achieves 72% fewer tokens, 50% faster inference, same planning accuracy, and improved offroad rates.

Conclusion: The method enhances efficiency and performance of AV policies using Transformers.

Abstract: Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.

[363] TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

Yiming Yang, Yueru Luo, Bingkun He, Hongbin Lin, Suzhong Fu, Chao Zheng, Zhipeng Cao, Erlong Li, Chao Yan, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: TopoStreamer improves lane segment topology reasoning for autonomous driving with streaming attribute constraints, dynamic positional encoding, and denoising, outperforming existing methods by +3.0% mAP and +1.7% OLS.

DetailsMotivation: Existing methods lack consistent positional embedding and temporal attribute learning, hindering accurate road network reconstruction for autonomous driving.

Method: TopoStreamer introduces streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising to enhance temporal consistency and positional learning.

Result: TopoStreamer achieves +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception on the OpenLane-V2 dataset.

Conclusion: TopoStreamer effectively addresses limitations in lane topology reasoning, improving performance for autonomous driving systems.

Abstract: Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.

[364] UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Xiao Zhang, Fei Wei, Yong Wang, Wenda Zhao, Feiyi Li, Xiangxiang Chu

Main category: cs.CV

TL;DR: The paper proposes UPRE, a framework for zero-shot domain adaptation (ZSDA) that optimizes prompts and visual representations to address misalignment between detection tasks and Vision-Language Models (VLMs).

DetailsMotivation: Existing methods for ZSDA rely on VLMs but overlook misalignment between detection tasks and VLMs due to manually crafted prompts.

Method: UPRE introduces multi-view domain prompts and visual representation enhancement, along with multi-level strategies like relative domain distance and positive-negative separation.

Result: Experiments on nine datasets show UPRE’s superior performance in ZSDA detection.

Conclusion: UPRE effectively addresses ZSDA challenges by jointly optimizing prompts and visual representations.

Abstract: Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at https://github.com/AMAP-ML/UPRE.

[365] PhenoBench: A Comprehensive Benchmark for Cell Phenotyping

Fabian H. Reith, Claudia Winklmayr, Jerome Luescher, Nora Koreuber, Jannik Franzen, Elias Baumann, Christian M. Schuerch, Dagmar Kainmueller, Josef Lorenz Rumberger

Main category: cs.CV

TL;DR: PhenoBench is a new benchmark for cell phenotyping in digital pathology, introducing PhenoCell, a dataset with 14 cell types, and evaluating foundational models (FMs) under various scenarios.

DetailsMotivation: Existing benchmarks for cell phenotyping in digital pathology lack unified evaluation, especially for dense cell phenotype predictions.

Method: PhenoBench provides PhenoCell dataset and benchmarking code to evaluate FMs on H&E images under technical and medical domain shifts.

Result: FMs perform poorly on PhenoCell (F1 scores as low as 0.20), highlighting its challenge compared to existing benchmarks like Lizard and PanNuke.

Conclusion: PhenoCell is a valuable resource for future benchmarking of FMs and supervised models, addressing gaps in current evaluation methods.

Abstract: Digital pathology has seen the advent of a wealth of foundational models (FM), yet to date their performance on cell phenotyping has not been benchmarked in a unified manner. We therefore propose PhenoBench: A comprehensive benchmark for cell phenotyping on Hematoxylin and Eosin (H&E) stained histopathology images. We provide both PhenoCell, a new H&E dataset featuring 14 granular cell types identified by using multiplexed imaging, and ready-to-use fine-tuning and benchmarking code that allows the systematic evaluation of multiple prominent pathology FMs in terms of dense cell phenotype predictions in different generalization scenarios. We perform extensive benchmarking of existing FMs, providing insights into their generalization behavior under technical vs. medical domain shifts. Furthermore, while FMs achieve macro F1 scores > 0.70 on previously established benchmarks such as Lizard and PanNuke, on PhenoCell, we observe scores as low as 0.20. This indicates a much more challenging task not captured by previous benchmarks, establishing PhenoCell as a prime asset for future benchmarking of FMs and supervised models alike. Code and data are available on GitHub.

[366] VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis

Alexandre Symeonidis-Herzig, Özge Mercanoğlu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: VisualSpeaker improves 3D facial animation using photorealistic rendering and a perceptual lip-reading loss, achieving 56.1% better Lip Vertex Error and higher perceptual quality.

DetailsMotivation: High-fidelity 3D facial animations are needed for expressive avatars, but mesh-based methods lag behind 2D innovations. VisualSpeaker bridges this gap.

Method: Uses photorealistic differentiable rendering supervised by visual speech recognition, with a perceptual lip-reading loss derived from 3D Gaussian Splatting and a pre-trained VASR model.

Result: 56.1% improvement in Lip Vertex Error and enhanced perceptual quality on the MEAD dataset, while maintaining mesh-driven controllability.

Conclusion: VisualSpeaker advances 3D facial animation by combining photorealism and perceptual accuracy, crucial for applications like sign language avatars.

Abstract: Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.

[367] RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration

Chong Cheng, Yu Hu, Sicheng Yu, Beizhen Zhao, Zijian Wang, Hao Wang

Main category: cs.CV

TL;DR: RegGS is a 3D Gaussian registration framework for reconstructing unposed sparse views by aligning local Gaussians into a globally consistent representation using an entropy-regularized Sinkhorn algorithm and joint registration module.

DetailsMotivation: Existing 3DGS methods struggle with sparse views due to limited prior knowledge, and feed-forward approaches are constrained by input formats.

Method: RegGS uses an entropy-regularized Sinkhorn algorithm to solve the MW₂ distance for GMM alignment in Sim(3) space and integrates photometric consistency and depth geometry in a joint registration module.

Result: Experiments on RE10K and ACID datasets show RegGS achieves high-fidelity Gaussian registration, precise pose estimation, and high-quality novel-view synthesis.

Conclusion: RegGS effectively addresses sparse-view reconstruction challenges, offering accurate alignment and synthesis.

Abstract: 3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein $(\text{MW}_2)$ distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in $\mathrm{Sim}(3)$ space. Furthermore, we design a joint 3DGS registration module that integrates the $\text{MW}_2$ distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the RE10K and ACID datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis. Project page: https://3dagentworld.github.io/reggs/.

[368] InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes

Zesong Yang, Bangbang Yang, Wenqi Dong, Chenxuan Cao, Liyuan Cui, Yuewen Ma, Zhaopeng Cui, Hujun Bao

Main category: cs.CV

TL;DR: InstaScene introduces a new paradigm for holistic 3D scene perception, focusing on decomposing instances and ensuring complete reconstruction using spatial contrastive learning and in-situ generation.

DetailsMotivation: Humans excel at recognizing occluded objects, but robotics struggle with this task. Current methods treat scenes as undifferentiated wholes, failing to reconstruct complete objects from partial observations.

Method: Develops spatial contrastive learning for precise decomposition and in-situ generation to overcome incompleteness, leveraging observations and geometric cues.

Result: Achieves superior decomposition accuracy and produces geometrically faithful, visually intact objects in complex real-world and synthetic scenes.

Conclusion: InstaScene effectively bridges the gap in 3D perception, enabling robots to decompose and reconstruct occluded objects accurately.

Abstract: Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.

[369] RoundaboutHD: High-Resolution Real-World Urban Environment Benchmark for Multi-Camera Vehicle Tracking

Yuqiang Lin, Sam Lockyer, Mingxuan Sui, Li Gan, Florian Stanek, Markus Zarbock, Wenbin Li, Adrian Evans, Nic Zhang

Main category: cs.CV

TL;DR: RoundaboutHD is a high-resolution multi-camera vehicle tracking dataset addressing limitations in existing datasets by providing diverse, real-world roundabout scenarios.

DetailsMotivation: Current datasets for multi-camera vehicle tracking lack realism and diversity, hindering practical applications. RoundaboutHD aims to bridge this gap.

Method: The dataset includes 40 minutes of 4K footage from four non-overlapping cameras, with 512 annotated vehicle identities and additional subsets for various tasks.

Result: RoundaboutHD offers rich cross-camera association data, temporal consistency, and enhanced challenges like occlusions and nonlinear movements.

Conclusion: RoundaboutHD is a valuable resource for advancing multi-camera vehicle tracking research, with publicly available data and baseline results.

Abstract: The multi-camera vehicle tracking (MCVT) framework holds significant potential for smart city applications, including anomaly detection, traffic density estimation, and suspect vehicle tracking. However, current publicly available datasets exhibit limitations, such as overly simplistic scenarios, low-resolution footage, and insufficiently diverse conditions, creating a considerable gap between academic research and real-world scenario. To fill this gap, we introduce RoundaboutHD, a comprehensive, high-resolution multi-camera vehicle tracking benchmark dataset specifically designed to represent real-world roundabout scenarios. RoundaboutHD provides a total of 40 minutes of labelled video footage captured by four non-overlapping, high-resolution (4K resolution, 15 fps) cameras. In total, 512 unique vehicle identities are annotated across different camera views, offering rich cross-camera association data. RoundaboutHD offers temporal consistency video footage and enhanced challenges, including increased occlusions and nonlinear movement inside the roundabout. In addition to the full MCVT dataset, several subsets are also available for object detection, single camera tracking, and image-based vehicle re-identification (ReID) tasks. Vehicle model information and camera modelling/ geometry information are also included to support further analysis. We provide baseline results for vehicle detection, single-camera tracking, image-based vehicle re-identification, and multi-camera tracking. The dataset and the evaluation code are publicly available at: https://github.com/siri-rouser/RoundaboutHD.git

[370] Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning

Yongwei Jiang, Yixiong Zou, Yuhua Li, Ruixuan Li

Main category: cs.CV

TL;DR: The paper explores prompt pool methods in Few-Shot Class-Incremental Learning (FSCIL), identifies token-dimension saturation as a cause of performance degradation, and proposes LGSP-Prompt to improve performance by shifting to spatial dimension prompting.

DetailsMotivation: To address the dual challenges of data scarcity and incremental learning in FSCIL, and to investigate why current prompt pool methods fail in this setting.

Method: Proposes LGSP-Prompt, which combines local spatial features and global frequency-domain representations to generate spatial prompts, avoiding token-dimension saturation.

Result: LGSP-Prompt achieves state-of-the-art performance on FSCIL benchmarks, preserving base knowledge and excelling in incremental learning.

Conclusion: The study highlights the limitations of token-dimension prompting in FSCIL and demonstrates the effectiveness of spatial dimension prompting with LGSP-Prompt.

Abstract: Few-Shot Class-Incremental Learning (FSCIL) faces dual challenges of data scarcity and incremental learning in real-world scenarios. While pool-based prompting methods have demonstrated success in traditional incremental learning, their effectiveness in FSCIL settings remains unexplored. This paper presents the first study of current prompt pool methods in FSCIL tasks, revealing an unanticipated performance degradation in incremental sessions. Through comprehensive analysis, we identify that this phenomenon stems from token-dimension saturation: with limited data, excessive prompts compete for task-relevant information, leading to model overfitting. Based on this finding, we propose LGSP-Prompt (Local-Global Spatial Prompting), which innovatively shifts pool-based prompt learning from the token dimension to the spatial dimension. LGSP-Prompt generates spatial prompts by synergistically combining local spatial features and global frequency-domain representations to highlight key patterns in input images. We construct two spatial prompt pools enabling dynamic prompt selection to maintain acquired knowledge while effectively learning novel sessions. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple FSCIL benchmarks, showing significant advantages in both base knowledge preservation and incremental learning. Our implementation is available at https://github.com/Jywsuperman/LGSP.

[371] Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift

Behraj Khan, Tahir Qasim Syed, Nouman M. Durrani, Bilal Naseem, Shabir Ahmad, Rizwan Qureshi

Main category: cs.CV

TL;DR: StaRFM addresses distribution shift and confidence misalignment in foundation models like CLIP and SAM by combining Fisher information penalty (FIP) and confidence misalignment penalty (CMP), improving accuracy and calibration across vision and medical tasks.

DetailsMotivation: Foundation models face challenges like distribution shift (e.g., inter-center image differences) and confidence misalignment (overconfident errors), which hinder their deployment in tasks like computer vision and medical imaging.

Method: StaRFM integrates FIP (extended to 3D) to reduce embedding shift and CMP (reformulated for voxel-level predictions) to calibrate uncertainty. It uses PAC-Bayes bounds for generalization control and Brier score minimization for calibration.

Result: StaRFM improves accuracy by 3.5% and reduces ECE by 28% on vision datasets, achieves +4.2% DSC and 4.8mm HD95 on medical benchmarks, and cuts cross-domain gaps by up to 20%.

Conclusion: StaRFM is a plug-and-play solution for foundation models, effectively addressing key challenges with minimal architectural changes, and is validated across diverse datasets.

Abstract: Foundation models like CLIP and SAM have advanced computer vision and medical imaging via low-shot transfer learning, aiding CADD with limited data. However, their deployment faces two key challenges. \textit{distribution shift} where pre-training and post-training data distributions differ (e.g., due to inter-center image acquisition) and \textit{confidence misalignment}, which leads to overconfident errors. These issues surface differently, vision-language models (e.g., CLIP) suffer from 2D embedding shift (image-text misalignment), while medical models (e.g., SAM) encounter 3D domain shifts (e.g., scanner variation) and voxel-wise calibration need. Existing solutions are domain-specific. We propose \textbf{StaRFM}, a fusion of Fisher information penalty (FIP) and confidence misalignment penalty (CMP) tackling both challenges. It applies FIP, extended to 3D via patch-wise regularization, to reduce embedding shift, and CMP, reformulated for voxel-level predictions, to calibrate segmentation uncertainty. We derive PAC-Bayes bounds. FIP controls generalization via the Fisher-Rao norm, and CMP reduces calibration error via Brier score minimization. StaRFM surpasses baselines by \texttt{+}3.5% accuracy and 28% lower ECE on 19 vision datasets (e.g., ImageNet, Office-Home), achieves +4.2% DSC over SAM-FT and 4.8mm HD95 on medical benchmarks (e.g., BraTS, ATLAS), and reduces cross-domain gaps by up to 20%. The framework is plug-and-play, requiring minimal architectural changes. Code and models are available at: \href{https://anonymous.4open.science/r/StaRFM-C0CD/}{\textcolor{blue}{\underline{StaRFM}}}

[372] GreenCrossingAI: A Camera Trap/Computer Vision Pipeline for Environmental Science Research Groups

Bernie Boscoe, Shawn Johnson, Andrea Osbon, Chandler Campbell, Karen Mager

Main category: cs.CV

TL;DR: A guide for a low-resource pipeline to process camera trap data using ML/AI, tailored for small research groups with limited resources.

DetailsMotivation: Address challenges in processing large volumes of camera trap data, including labeling, environmental variability, and integrating ML/AI tools into workflows.

Method: Proposes a practical, on-premise pipeline for data transmission, inference, and evaluation, customized for small research groups.

Result: Enables researchers to efficiently process and derive insights from camera trap datasets despite limited resources.

Conclusion: The pipeline provides an accessible solution for small groups to leverage ML/AI for wildlife research.

Abstract: Camera traps have long been used by wildlife researchers to monitor and study animal behavior, population dynamics, habitat use, and species diversity in a non-invasive and efficient manner. While data collection from the field has increased with new tools and capabilities, methods to develop, process, and manage the data, especially the adoption of ML/AI tools, remain challenging. These challenges include the sheer volume of data generated, the need for accurate labeling and annotation, variability in environmental conditions affecting data quality, and the integration of ML/AI tools into existing workflows that often require domain-specific customization and computational resources. This paper provides a guide to a low-resource pipeline to process camera trap data on-premise, incorporating ML/AI capabilities tailored for small research groups with limited resources and computational expertise. By focusing on practical solutions, the pipeline offers accessible approaches for data transmission, inference, and evaluation, enabling researchers to discover meaningful insights from their ever-increasing camera trap datasets.

[373] HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space

Changli Wang, Fang Yin, Jiafeng Liu, Rui Wu

Main category: cs.CV

TL;DR: HMID-Net integrates Masked Image Modeling and knowledge distillation in hyperbolic space for efficient hierarchical visual-semantic learning, outperforming MERU and CLIP.

DetailsMotivation: To improve efficiency in training models to capture visual-semantic hierarchies, leveraging hyperbolic space's advantages.

Method: Proposes HMID-Net, combining Masked Image Modeling and knowledge distillation in hyperbolic space with a custom distillation loss.

Result: Achieves superior performance in image classification and retrieval, surpassing MERU and CLIP.

Conclusion: HMID-Net effectively leverages hyperbolic space for hierarchical learning, demonstrating significant improvements over existing methods.

Abstract: Visual and semantic concepts are often structured in a hierarchical manner. For instance, textual concept `cat’ entails all images of cats. A recent study, MERU, successfully adapts multimodal learning techniques from Euclidean space to hyperbolic space, effectively capturing the visual-semantic hierarchy. However, a critical question remains: how can we more efficiently train a model to capture and leverage this hierarchy? In this paper, we propose the Hyperbolic Masked Image and Distillation Network (HMID-Net), a novel and efficient method that integrates Masked Image Modeling (MIM) and knowledge distillation techniques within hyperbolic space. To the best of our knowledge, this is the first approach to leverage MIM and knowledge distillation in hyperbolic space to train highly efficient models. In addition, we introduce a distillation loss function specifically designed to facilitate effective knowledge transfer in hyperbolic space. Our experiments demonstrate that MIM and knowledge distillation techniques in hyperbolic space can achieve the same remarkable success as in Euclidean space. Extensive evaluations show that our method excels across a wide range of downstream tasks, significantly outperforming existing models like MERU and CLIP in both image classification and retrieval.

[374] A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

Saadat Behzadi, Danial Sharifrazi, Bita Mesbahzadeh, Javad Hassannataj Joloudari, Roohallah Alizadehsani

Main category: cs.CV

TL;DR: A lightweight framework combining LOF for noise filtering and YOLO-v11n for polyp detection achieves high accuracy and efficiency in real-time colonoscopy support.

DetailsMotivation: Timely and accurate polyp detection is vital for colorectal cancer prevention, requiring robust and efficient AI solutions.

Method: The study uses LOF for outlier removal and YOLO-v11n for detection, validated on five datasets with 5-fold cross-validation and data augmentation.

Result: Achieved precision of 95.83%, recall of 91.85%, F1-score of 93.48%, and mAP@0.5 of 96.48%, outperforming prior YOLO-based methods.

Conclusion: The method is effective for real-time clinical use, highlighting the importance of data preprocessing and model efficiency in medical AI.

Abstract: Objectives: Timely and accurate detection of colorectal polyps plays a crucial role in diagnosing and preventing colorectal cancer, a major cause of mortality worldwide. This study introduces a new, lightweight, and efficient framework for polyp detection that combines the Local Outlier Factor (LOF) algorithm for filtering noisy data with the YOLO-v11n deep learning model. Study design: An experimental study leveraging deep learning and outlier removal techniques across multiple public datasets. Methods: The proposed approach was tested on five diverse and publicly available datasets: CVC-ColonDB, CVC-ClinicDB, Kvasir-SEG, ETIS, and EndoScene. Since these datasets originally lacked bounding box annotations, we converted their segmentation masks into suitable detection labels. To enhance the robustness and generalizability of our model, we apply 5-fold cross-validation and remove anomalous samples using the LOF method configured with 30 neighbors and a contamination ratio of 5%. Cleaned data are then fed into YOLO-v11n, a fast and resource-efficient object detection architecture optimized for real-time applications. We train the model using a combination of modern augmentation strategies to improve detection accuracy under diverse conditions. Results: Our approach significantly improves polyp localization performance, achieving a precision of 95.83%, recall of 91.85%, F1-score of 93.48%, mAP@0.5 of 96.48%, and mAP@0.5:0.95 of 77.75%. Compared to previous YOLO-based methods, our model demonstrates enhanced accuracy and efficiency. Conclusions: These results suggest that the proposed method is well-suited for real-time colonoscopy support in clinical settings. Overall, the study underscores how crucial data preprocessing and model efficiency are when designing effective AI systems for medical imaging.

[375] EHPE: A Segmented Architecture for Enhanced Hand Pose Estimation

Bolun Zheng, Xinjie Liu, Qianyu Zhang, Canjin Wang, Fangni Chen, Mingen Xu

Main category: cs.CV

TL;DR: A novel segmented architecture (EHPE) improves 3D hand pose estimation by focusing on TIP and wrist joints to reduce error accumulation and enhance accuracy.

DetailsMotivation: Existing methods overlook the importance of TIP and wrist joints, leading to error accumulation and degraded pose estimation quality.

Method: EHPE uses a two-stage approach: TW-stage for TIP and wrist joint extraction, and PG-stage for refining remaining joints with a dual-branch network.

Result: EHPE achieves state-of-the-art performance on two benchmarks, reducing predictive errors for all joints.

Conclusion: The proposed EHPE method effectively addresses error accumulation and improves overall hand pose estimation accuracy.

Abstract: 3D hand pose estimation has garnered great attention in recent years due to its critical applications in human-computer interaction, virtual reality, and related fields. The accurate estimation of hand joints is essential for high-quality hand pose estimation. However, existing methods neglect the importance of Distal Phalanx Tip (TIP) and Wrist in predicting hand joints overall and often fail to account for the phenomenon of error accumulation for distal joints in gesture estimation, which can cause certain joints to incur larger errors, resulting in misalignments and artifacts in the pose estimation and degrading the overall reconstruction quality. To address this challenge, we propose a novel segmented architecture for enhanced hand pose estimation (EHPE). We perform local extraction of TIP and wrist, thus alleviating the effect of error accumulation on TIP prediction and further reduce the predictive errors for all joints on this basis. EHPE consists of two key stages: In the TIP and Wrist Joints Extraction stage (TW-stage), the positions of the TIP and wrist joints are estimated to provide an initial accurate joint configuration; In the Prior Guided Joints Estimation stage (PG-stage), a dual-branch interaction network is employed to refine the positions of the remaining joints. Extensive experiments on two widely used benchmarks demonstrate that EHPE achieves state-of-the-arts performance. Code is available at https://github.com/SereinNout/EHPE.

[376] Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

Hayeon Kim, Ji Ha Jang, Se Young Chun

Main category: cs.CV

TL;DR: RoMaP is a novel framework for precise local 3D Gaussian editing, addressing challenges in multi-view segmentation and SDS loss ambiguity with robust 3D masking and regularized SDS loss.

DetailsMotivation: Current methods struggle with precise local 3D edits due to inconsistent multi-view segmentations and ambiguous SDS loss, limiting high-quality part-level modifications.

Method: RoMaP introduces 3D-GALP for accurate part segmentation and a regularized SDS loss with L1 anchor loss (via SLaMP) and additional regularizers like Gaussian prior removal.

Result: RoMaP achieves state-of-the-art local 3D editing on reconstructed and generated Gaussian scenes, enabling robust and flexible part-level modifications.

Conclusion: RoMaP advances 3D Gaussian editing by providing precise, high-quality local edits while preserving contextual coherence, with potential for broader applications.

Abstract: Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing. Code is available at https://janeyeon.github.io/romap.

[377] NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection

Amirhossein Ansari, Ke Wang, Pulei Xiong

Main category: cs.CV

TL;DR: NegRefine improves zero-shot OOD detection by refining negative labels and dynamically adjusting scoring for multi-matching images.

DetailsMotivation: Existing methods like NegLabel and CSP misclassify in-distribution samples as OOD due to subcategory labels and proper nouns, and struggle with multi-matching images.

Method: NegRefine filters subcategory labels and proper nouns from negative labels and uses a multi-matching-aware scoring function.

Result: Evaluated on ImageNet-1K, NegRefine achieves robust separation between in-distribution and OOD samples.

Conclusion: NegRefine addresses limitations of prior methods, enhancing zero-shot OOD detection performance.

Abstract: Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. We evaluate NegRefine on large-scale benchmarks, including ImageNet-1K. The code is available at https://github.com/ah-ansari/NegRefine.

[378] 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving

Yixun Zhang, Lizhi Wang, Junjun Zhao, Wending Zhao, Feng Zhou, Yonghao Dang, Jianqin Yin

Main category: cs.CV

TL;DR: 3DGAA is a novel adversarial attack framework using 3D Gaussian Splatting to optimize geometry and appearance, outperforming existing methods in physical realism and robustness.

DetailsMotivation: Address vulnerabilities in camera-based object detection for autonomous driving by improving adversarial attack realism and robustness.

Method: Uses 3D Gaussian Splatting to jointly optimize geometry and appearance, with physical filtering and augmentation modules for real-world applicability.

Result: Reduces detection mAP from 87.21% to 7.38%, outperforming existing 3D physical attacks and showing high transferability.

Conclusion: 3DGAA sets a new benchmark for physically realizable adversarial attacks, enhancing robustness and realism.

Abstract: Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. Existing 2D and 3D physical attacks, due to their focus on texture optimization, often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture optimization, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module that filters outliers to preserve geometric fidelity, and a physical augmentation module that simulates complex physical scenarios to enhance attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21% to 7.38%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks.

[379] Improving Multimodal Learning via Imbalanced Learning

Shicai Wei, Chunbo Luo, Yang Luo

Main category: cs.CV

TL;DR: The paper proposes Asymmetric Representation Learning (ARL) to optimize multimodal learning by imbalanced dependency on modalities, inversely proportional to their variances, improving performance without extra parameters.

DetailsMotivation: Multimodal learning often underperforms due to imbalanced learning, but balanced learning is suboptimal. The paper aims to optimize performance by leveraging modality variances.

Method: ARL uses auxiliary regularizers to calculate modality prediction variances, re-weighting optimization inversely to variance ratios. It also incorporates prediction bias for generalization.

Result: Extensive experiments show ARL’s effectiveness and versatility across datasets, validating its approach.

Conclusion: ARL optimizes multimodal learning by imbalanced dependency on modalities, outperforming balanced methods without adding parameters.

Abstract: Multimodal learning often encounters the under-optimized problem and may perform worse than unimodal learning. Existing approaches attribute this issue to imbalanced learning across modalities and tend to address it through gradient balancing. However, this paper argues that balanced learning is not the optimal setting for multimodal learning. With bias-variance analysis, we prove that imbalanced dependency on each modality obeying the inverse ratio of their variances contributes to optimal performance. To this end, we propose the Asymmetric Representation Learning(ARL) strategy to assist multimodal learning via imbalanced optimization. ARL introduces auxiliary regularizers for each modality encoder to calculate their prediction variance. ARL then calculates coefficients via the unimodal variance to re-weight the optimization of each modality, forcing the modality dependence ratio to be inversely proportional to the modality variance ratio. Moreover, to minimize the generalization error, ARL further introduces the prediction bias of each modality and jointly optimizes them with multimodal loss. Notably, all auxiliary regularizers share parameters with the multimodal model and rely only on the modality representation. Thus the proposed ARL strategy introduces no extra parameters and is independent of the structures and fusion methods of the multimodal model. Finally, extensive experiments on various datasets validate the effectiveness and versatility of ARL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-ARL}{https://github.com/shicaiwei123/ICCV2025-ARL}

[380] From Wardrobe to Canvas: Wardrobe Polyptych LoRA for Part-level Controllable Human Image Generation

Jeongho Kim, Sunghyun Park, Hyoungwoo Park, Sungrack Yun, Jaegul Choo, Seokeon Choi

Main category: cs.CV

TL;DR: Wardrobe Polyptych LoRA introduces a part-level controllable model for personalized human image generation, avoiding inference-time fine-tuning and large-scale training while improving fidelity and consistency.

DetailsMotivation: Existing methods for personalized human image generation are computationally expensive and impractical for real-time applications due to fine-tuning or large-scale training requirements.

Method: The approach trains only LoRA layers, conditions generation on the subject’s wardrobe, uses spatial references, and introduces a selective subject region loss to improve fidelity and consistency.

Result: The method outperforms existing techniques in fidelity and consistency, enabling realistic and identity-preserving full-body synthesis without additional inference parameters.

Conclusion: Wardrobe Polyptych LoRA offers a practical and efficient solution for high-fidelity personalized human image generation, validated by extensive experiments.

Abstract: Recent diffusion models achieve personalization by learning specific subjects, allowing learned attributes to be integrated into generated images. However, personalized human image generation remains challenging due to the need for precise and consistent attribute preservation (e.g., identity, clothing details). Existing subject-driven image generation methods often require either (1) inference-time fine-tuning with few images for each new subject or (2) large-scale dataset training for generalization. Both approaches are computationally expensive and impractical for real-time applications. To address these limitations, we present Wardrobe Polyptych LoRA, a novel part-level controllable model for personalized human image generation. By training only LoRA layers, our method removes the computational burden at inference while ensuring high-fidelity synthesis of unseen subjects. Our key idea is to condition the generation on the subject’s wardrobe and leverage spatial references to reduce information loss, thereby improving fidelity and consistency. Additionally, we introduce a selective subject region loss, which encourages the model to disregard some of reference images during training. Our loss ensures that generated images better align with text prompts while maintaining subject integrity. Notably, our Wardrobe Polyptych LoRA requires no additional parameters at the inference stage and performs generation using a single model trained on a few training samples. We construct a new dataset and benchmark tailored for personalized human image generation. Extensive experiments show that our approach significantly outperforms existing techniques in fidelity and consistency, enabling realistic and identity-preserving full-body synthesis.

[381] Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation

Shuchang Ye, Usman Naseem, Mingyuan Meng, Jinman Kim

Main category: cs.CV

TL;DR: ProLearn is a prototype-driven framework for language-guided segmentation that reduces reliance on paired image-text data, enabling broader clinical application.

DetailsMotivation: Current methods rely on paired image-text data, limiting their use in datasets without paired reports and in clinical scenarios where segmentation precedes reporting.

Method: Introduces a Prototype-driven Semantic Approximation (PSA) module to approximate semantic guidance from text, enabling segmentation without paired reports.

Result: ProLearn outperforms state-of-the-art methods on datasets like QaTa-COV19, MosMedData+, and Kvasir-SEG when text is limited.

Conclusion: ProLearn effectively reduces textual reliance, enhancing the practicality of language-guided segmentation in clinical settings.

Abstract: Medical language-guided segmentation, integrating textual clinical reports as auxiliary guidance to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as ``textual reliance", presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19, MosMedData+ and Kvasir-SEG demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.

[382] MoViAD: A Modular Library for Visual Anomaly Detection

Manuel Barusco, Francesco Borsatti, Arianna Stropeni, Davide Dalle Pezze, Gian Antonio Susto

Main category: cs.CV

TL;DR: MoViAD is a modular library for Video Anomaly Detection (VAD), offering state-of-the-art models, datasets, and utilities to address challenges like data scarcity and unsupervised training. It supports diverse scenarios and deployment needs, including Edge/IoT optimization.

DetailsMotivation: The scarcity of anomalous data and the need for unsupervised training in VAD research and deployment motivated the creation of MoViAD.

Method: MoViAD provides a modular library with pre-built models, trainers, datasets, and utilities, supporting various VAD scenarios and deployment optimizations like quantization and compression.

Result: The library enables fast deployment, flexibility for researchers, and practical solutions for Edge/IoT settings, with robust evaluation metrics and profiling tools.

Conclusion: MoViAD accelerates VAD research and deployment by offering a comprehensive, modular, and efficient solution for diverse use cases.

Abstract: VAD is a critical field in machine learning focused on identifying deviations from normal patterns in images, often challenged by the scarcity of anomalous data and the need for unsupervised training. To accelerate research and deployment in this domain, we introduce MoViAD, a comprehensive and highly modular library designed to provide fast and easy access to state-of-the-art VAD models, trainers, datasets, and VAD utilities. MoViAD supports a wide array of scenarios, including continual, semi-supervised, few-shots, noisy, and many more. In addition, it addresses practical deployment challenges through dedicated Edge and IoT settings, offering optimized models and backbones, along with quantization and compression utilities for efficient on-device execution and distributed inference. MoViAD integrates a selection of backbones, robust evaluation VAD metrics (pixel-level and image-level) and useful profiling tools for efficiency analysis. The library is designed for fast, effortless deployment, enabling machine learning engineers to easily use it for their specific setup with custom models, datasets, and backbones. At the same time, it offers the flexibility and extensibility researchers need to develop and experiment with new methods.

[383] YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association

Xiang Yu, Xinyao Liu, Guang Liang

Main category: cs.CV

TL;DR: The paper presents a winning solution for tracking small, agile multi-objects (SMOT) from UAVs, addressing challenges like scarce features, motion complexity, and occlusions. It introduces SliceTrain for detection and a robust tracker for association, achieving state-of-the-art results.

DetailsMotivation: Tracking small, agile objects like birds from UAVs is challenging due to scarce features, complex motion, and occlusions. The paper aims to solve these issues for real-world SMOT problems.

Method: Uses tracking-by-detection with SliceTrain for enhanced small-object detection and a robust tracker with motion direction maintenance and adaptive similarity metrics.

Result: Achieves state-of-the-art performance with an SO-HOTA score of 55.205 on the SMOT4SB test set.

Conclusion: The framework effectively addresses complex SMOT challenges, validated by its competitive performance.

Abstract: Tracking small, agile multi-objects (SMOT), such as birds, from an Unmanned Aerial Vehicle (UAV) perspective is a highly challenging computer vision task. The difficulty stems from three main sources: the extreme scarcity of target appearance features, the complex motion entanglement caused by the combined dynamics of the camera and the targets themselves, and the frequent occlusions and identity ambiguity arising from dense flocking behavior. This paper details our championship-winning solution in the MVA 2025 “Finding Birds” Small Multi-Object Tracking Challenge (SMOT4SB), which adopts the tracking-by-detection paradigm with targeted innovations at both the detection and association levels. On the detection side, we propose a systematic training enhancement framework named \textbf{SliceTrain}. This framework, through the synergy of ‘deterministic full-coverage slicing’ and ‘slice-level stochastic augmentation, effectively addresses the problem of insufficient learning for small objects in high-resolution image training. On the tracking side, we designed a robust tracker that is completely independent of appearance information. By integrating a \textbf{motion direction maintenance (EMA)} mechanism and an \textbf{adaptive similarity metric} combining \textbf{bounding box expansion and distance penalty} into the OC-SORT framework, our tracker can stably handle irregular motion and maintain target identities. Our method achieves state-of-the-art performance on the SMOT4SB public test set, reaching an SO-HOTA score of \textbf{55.205}, which fully validates the effectiveness and advancement of our framework in solving complex real-world SMOT problems. The source code will be made available at https://github.com/Salvatore-Love/YOLOv8-SMOT.

[384] AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

Jiawei Xu, Kai Deng, Zexin Fan, Shenlong Wang, Jin Xie, Jian Yang

Main category: cs.CV

TL;DR: AD-GS is a self-supervised framework for high-quality free-viewpoint rendering of driving scenes, using a novel motion model and simplified pseudo 2D segmentation, outperforming annotation-free methods and competing with annotation-dependent ones.

DetailsMotivation: Current methods for dynamic urban driving scenes rely on costly manual annotations or fail to accurately model motions and decompose scenes, leading to rendering artifacts.

Method: AD-GS integrates a learnable motion model (locality-aware B-spline curves and global-aware trigonometric functions), pseudo 2D segmentation, dynamic Gaussians, bidirectional temporal visibility masks, visibility reasoning, and rigid regularization.

Result: The model outperforms state-of-the-art annotation-free methods and competes with annotation-dependent approaches.

Conclusion: AD-GS provides a robust, annotation-free solution for high-quality rendering of dynamic driving scenes.

Abstract: Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.

[385] SpatialTrackerV2: 3D Point Tracking Made Easy

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, Xiaowei Zhou

Main category: cs.CV

TL;DR: SpatialTrackerV2 is a feed-forward 3D point tracking method for monocular videos, unifying point tracking, depth, and camera pose estimation into a single architecture. It outperforms existing methods by 30% and matches dynamic 3D reconstruction accuracy while being 50x faster.

DetailsMotivation: To overcome the limitations of modular pipelines in 3D tracking by integrating point tracking, monocular depth, and camera pose estimation into a unified, high-performing system.

Method: Decomposes 3D motion into scene geometry, camera ego-motion, and object motion using a fully differentiable, end-to-end architecture. Trained on diverse datasets including synthetic, RGB-D, and unlabeled footage.

Result: Outperforms existing 3D tracking methods by 30% and matches dynamic 3D reconstruction accuracy while running 50x faster.

Conclusion: SpatialTrackerV2 demonstrates superior performance and efficiency by jointly learning geometry and motion from heterogeneous data.

Abstract: We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$\times$ faster.

[386] PhysX-3D: Physical-Grounded 3D Asset Generation

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces PhysX-3D, a framework for generating physics-grounded 3D assets, addressing the lack of physical properties in existing 3D models. It includes a dataset (PhysXNet) and a generation method (PhysXGen).

DetailsMotivation: Existing 3D generative models focus on geometry and textures but neglect physical properties, limiting real-world applications like simulation and embodied AI.

Method: 1) PhysXNet: A physics-grounded 3D dataset annotated across five dimensions. 2) PhysXGen: A dual-branch framework for physics-grounded image-to-3D generation.

Result: The framework demonstrates superior performance and generalization, validated through extensive experiments.

Conclusion: PhysX-3D advances physical-grounded 3D generation, with released code, data, and models to support future research.

Abstract: 3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX-3D}, an end-to-end paradigm for physical-grounded 3D asset generation. 1) To bridge the critical gap in physics-annotated 3D datasets, we present PhysXNet - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: absolute scale, material, affordance, kinematics, and function description. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets.2) Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded image-to-3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.

[387] AutoPartGen: Autogressive 3D Part Generation and Discovery

Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, Andrea Vedaldi

Main category: cs.CV

TL;DR: AutoPartGen is an autoregressive model for generating 3D objects composed of parts, leveraging a latent 3D representation for part-based tasks. It inputs images, masks, or 3D objects and outputs coherent reconstructions without additional optimization.

DetailsMotivation: To address the challenge of generating 3D objects with compositional parts in an efficient and high-quality manner.

Method: AutoPartGen uses an autoregressive approach, predicting parts sequentially while conditioning on prior parts and inputs like images or masks. It leverages the 3DShape2VecSet latent space for its compositional properties.

Result: The model achieves state-of-the-art performance in 3D part generation, producing seamless and coherent objects.

Conclusion: AutoPartGen demonstrates effective part-based 3D generation, offering a scalable and high-quality solution for compositional object reconstruction.

Abstract: We introduce AutoPartGen, a model that generates objects composed of 3D parts in an autoregressive manner. This model can take as input an image of an object, 2D masks of the object’s parts, or an existing 3D object, and generate a corresponding compositional 3D reconstruction. Our approach builds upon 3DShape2VecSet, a recent latent 3D representation with powerful geometric expressiveness. We observe that this latent space exhibits strong compositional properties, making it particularly well-suited for part-based generation tasks. Specifically, AutoPartGen generates object parts autoregressively, predicting one part at a time while conditioning on previously generated parts and additional inputs, such as 2D images, masks, or 3D objects. This process continues until the model decides that all parts have been generated, thus determining automatically the type and number of parts. The resulting parts can be seamlessly assembled into coherent objects or scenes without requiring additional optimization. We evaluate both the overall 3D generation capabilities and the part-level generation quality of AutoPartGen, demonstrating that it achieves state-of-the-art performance in 3D part generation.

[388] PositionIC: Unified Position and Identity Consistency for Image Customization

Junjie Hu, Tianyang Han, Kai Ma, Jialin Gao, Hao Dou, Song Yang, Xianhua He, Jianhui Zhang, Junfeng Luo, Xiaoming Wei, Wenqiang Zhang

Main category: cs.CV

TL;DR: PositionIC introduces a framework for precise spatial control in multi-subject image customization, addressing the lack of scalable datasets for identity and positional cues.

DetailsMotivation: Current image customization lacks fine-grained spatial control due to missing datasets linking identity with precise positions.

Method: PositionIC uses a bidirectional generation pipeline and a positional modulation layer to decouple spatial embeddings for independent subject placement.

Result: The approach achieves accurate spatial control and high consistency in image customization.

Conclusion: PositionIC enables controllable, high-fidelity customization in open-world scenarios and will be released for further research.

Abstract: Recent subject-driven image customization has achieved significant advancements in fidelity, yet fine-grained entity-level spatial control remains elusive, hindering the broader real-world application. This limitation is mainly attributed to scalable datasets that bind identity with precise positional cues are absent. To this end, we introduce PositionIC, a unified framework that enforces position and identity consistency for multi-subject customization. We construct a scalable synthesis pipeline that employs a bidirectional generation paradigm to eliminate subject drift and maintain semantic coherence. On top of these data, we design a lightweight positional modulation layer that decouples spatial embeddings among subjects, enabling independent, accurate placement while preserving visual fidelity. Extensive experiments demonstrate that our approach can achieve precise spatial control while maintaining high consistency in image customization task. PositionIC paves the way for controllable, high-fidelity image customization in open-world, multi-entity scenarios and will be released to foster further research.

[389] PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations

Yu Wei, Jiahui Zhang, Xiaoqin Zhang, Ling Shao, Shijian Lu

Main category: cs.CV

TL;DR: PCR-GS improves 3D Gaussian Splatting by co-regularizing camera poses using feature reprojection and wavelet-based frequency regularization, enhancing performance in complex scenes.

DetailsMotivation: Existing COLMAP-free 3D-GS methods struggle with complex camera trajectories, leading to degraded pose estimation and optimization issues.

Method: PCR-GS introduces two regularization techniques: feature reprojection (using DINO features) and wavelet-based frequency regularization to refine camera poses.

Result: PCR-GS outperforms in pose-free 3D-GS scene modeling, especially in scenes with dramatic camera trajectory changes.

Conclusion: PCR-GS offers a robust solution for 3D scene modeling without relying on COLMAP, addressing challenges in complex camera trajectories.

Abstract: COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is wavelet-based frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.

cs.AI

[390] The Free Will Equation: Quantum Field Analogies for AGI

Rahul Kabali

Main category: cs.AI

TL;DR: The paper introduces the Free Will Equation, a theoretical framework inspired by quantum field theory, to add adaptive stochasticity to AGI decision-making, enhancing creativity and adaptability.

DetailsMotivation: Human-like intelligence includes adaptive spontaneity, or 'free will,' which is crucial for creativity and robust problem-solving. Current AGI lacks this trait.

Method: The framework treats an AI’s cognitive state as a superposition of actions, collapsing probabilistically into decisions, akin to quantum wavefunction collapse. It includes quantum field analogies and intrinsic motivation.

Result: Experiments in a non-stationary multi-armed bandit environment show higher rewards and policy diversity compared to baselines.

Conclusion: The Free Will Equation offers a promising approach to imbue AGI with human-like adaptive spontaneity, improving exploration and adaptability.

Abstract: Artificial General Intelligence (AGI) research traditionally focuses on algorithms that optimize for specific goals under deterministic rules. Yet, human-like intelligence exhibits adaptive spontaneity - an ability to make unexpected choices or free decisions not strictly dictated by past data or immediate reward. This trait, often dubbed “free will” in a loose sense, might be crucial for creativity, robust adaptation, and avoiding ruts in problem-solving. This paper proposes a theoretical framework, called the Free Will Equation, that draws analogies from quantum field theory to endow AGI agents with a form of adaptive, controlled stochasticity in their decision-making process. The core idea is to treat an AI agent’s cognitive state as a superposition of potential actions or thoughts, which collapses probabilistically into a concrete action when a decision is made - much like a quantum wavefunction collapsing upon measurement. By incorporating mechanisms analogous to quantum fields, along with intrinsic motivation terms, we aim to improve an agent’s ability to explore novel strategies and adapt to unforeseen changes. Experiments in a non-stationary multi-armed bandit environment demonstrate that agents using this framework achieve higher rewards and policy diversity compared to baseline methods.

[391] DREAMS: Density Functional Theory Based Research Engine for Agentic Materials Simulation

Ziqi Wang, Hongshuo Huang, Hancheng Zhao, Changwen Xu, Shang Zhu, Jan Janssen, Venkatasubramanian Viswanathan

Main category: cs.AI

TL;DR: DREAMS is a multi-agent framework using LLMs for automated DFT simulations, reducing human reliance and achieving expert-level accuracy in materials discovery.

DetailsMotivation: To overcome challenges in high-fidelity DFT simulations, such as training time, parameter tuning, and error handling.

Method: Hierarchical framework with a central LLM planner and domain-specific agents for structure generation, DFT testing, HPC scheduling, and error handling, using a shared canvas for context.

Result: Achieves <1% error on benchmarks, solves complex problems like CO/Pt(111) adsorption, and confirms FCC-site preference via Bayesian sampling.

Conclusion: DREAMS achieves L3 automation, reducing human dependence and enabling scalable, high-throughput materials discovery.

Abstract: Materials discovery relies on high-throughput, high-fidelity simulation techniques such as Density Functional Theory (DFT), which require years of training, extensive parameter fine-tuning and systematic error handling. To address these challenges, we introduce the DFT-based Research Engine for Agentic Materials Screening (DREAMS), a hierarchical, multi-agent framework for DFT simulation that combines a central Large Language Model (LLM) planner agent with domain-specific LLM agents for atomistic structure generation, systematic DFT convergence testing, High-Performance Computing (HPC) scheduling, and error handling. In addition, a shared canvas helps the LLM agents to structure their discussions, preserve context and prevent hallucination. We validate DREAMS capabilities on the Sol27LC lattice-constant benchmark, achieving average errors below 1% compared to the results of human DFT experts. Furthermore, we apply DREAMS to the long-standing CO/Pt(111) adsorption puzzle, demonstrating its long-term and complex problem-solving capabilities. The framework again reproduces expert-level literature adsorption-energy differences. Finally, DREAMS is employed to quantify functional-driven uncertainties with Bayesian ensemble sampling, confirming the Face Centered Cubic (FCC)-site preference at the Generalized Gradient Approximation (GGA) DFT level. In conclusion, DREAMS approaches L3-level automation - autonomous exploration of a defined design space - and significantly reduces the reliance on human expertise and intervention, offering a scalable path toward democratized, high-throughput, high-fidelity computational materials discovery.

[392] WebGuard: Building a Generalizable Guardrail for Web Agents

Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su

Main category: cs.AI

TL;DR: WebGuard is a dataset for assessing web agent action risks, revealing LLMs’ poor performance in predicting outcomes and highlighting the need for specialized guardrail models.

DetailsMotivation: The rapid development of autonomous web agents powered by LLMs exposes risks of unintended or harmful actions, necessitating effective safety measures.

Method: WebGuard, a human-annotated dataset of 4,939 actions from 193 websites, categorizes risks into SAFE, LOW, and HIGH. It evaluates LLMs and fine-tunes guardrail models.

Result: Frontier LLMs perform poorly (<60% accuracy/recall). Fine-tuned Qwen2.5VL-7B improves accuracy (37% to 80%) and HIGH-risk recall (20% to 76%), but reliability remains insufficient.

Conclusion: Current guardrails are inadequate for high-stakes deployment, requiring near-perfect accuracy and recall.

Abstract: The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.

[393] Manimator: Transforming Research Papers into Visual Explanations

Samarth P, Vyoman Jain, Shiva Golugula, Motamarri Sai Sathvik

Main category: cs.AI

TL;DR: Manimator is an open-source system using LLMs to convert research papers or prompts into animations via Manim, simplifying complex STEM education.

DetailsMotivation: To address the challenge of understanding dense research papers by automating the creation of dynamic visualizations.

Method: Uses a pipeline with LLMs: one interprets text/PDFs into structured scene descriptions, another translates these into executable Manim code.

Result: Enables rapid creation of engaging visual explanations for STEM topics.

Conclusion: Manimator democratizes high-quality educational content creation, enhancing STEM learning.

Abstract: Understanding complex scientific and mathematical concepts, particularly those presented in dense research papers, poses a significant challenge for learners. Dynamic visualizations can greatly enhance comprehension, but creating them manually is time-consuming and requires specialized knowledge and skills. We introduce manimator, an open-source system that leverages Large Language Models to transform research papers and natural language prompts into explanatory animations using the Manim engine. Manimator employs a pipeline where an LLM interprets the input text or research paper PDF to generate a structured scene description outlining key concepts, mathematical formulas, and visual elements and another LLM translates this description into executable Manim Python code. We discuss its potential as an educational tool for rapidly creating engaging visual explanations for complex STEM topics, democratizing the creation of high-quality educational content.

[394] Language Models as Ontology Encoders

Hui Yang, Jiaoyan Chen, Yuan He, Yongsheng Gao, Ian Horrocks

Main category: cs.AI

TL;DR: OnT is a new ontology embedding method combining pretrained language models with hyperbolic geometric modeling to enhance textual and logical structure preservation in ontologies, outperforming existing methods.

DetailsMotivation: Existing ontology embedding methods either ignore textual information or fail to preserve logical structures, limiting their effectiveness.

Method: OnT tunes a pretrained language model using geometric modeling in hyperbolic space to incorporate textual labels and preserve logical relationships in Description Logic EL.

Result: OnT outperforms baselines in prediction and inference tasks on four real-world ontologies and shows strong transfer learning abilities.

Conclusion: OnT effectively combines textual and geometric modeling, offering superior performance and practical utility in ontology construction and reasoning.

Abstract: OWL (Web Ontology Language) ontologies which are able to formally represent complex knowledge and support semantic reasoning have been widely adopted across various domains such as healthcare and bioinformatics. Recently, ontology embeddings have gained wide attention due to its potential to infer plausible new knowledge and approximate complex reasoning. However, existing methods face notable limitations: geometric model-based embeddings typically overlook valuable textual information, resulting in suboptimal performance, while the approaches that incorporate text, which are often based on language models, fail to preserve the logical structure. In this work, we propose a new ontology embedding method OnT, which tunes a Pretrained Language Model (PLM) via geometric modeling in a hyperbolic space for effectively incorporating textual labels and simultaneously preserving class hierarchies and other logical relationships of Description Logic EL. Extensive experiments on four real-world ontologies show that OnT consistently outperforms the baselines including the state-of-the-art across both tasks of prediction and inference of axioms. OnT also demonstrates strong potential in real-world applications, indicated by its robust transfer learning abilities and effectiveness in real cases of constructing a new ontology from SNOMED CT. Data and code are available at https://github.com/HuiYang1997/OnT.

[395] ProofCompass: Enhancing Specialized Provers with LLM Guidance

Nicolas Wischermann, Claudio Mayrink Verdun, Gabriel Poesia, Francesco Noseda

Main category: cs.AI

TL;DR: ProofCompass combines LLMs with specialized provers for efficient mathematical reasoning, improving accuracy and reducing computational effort.

DetailsMotivation: Existing methods rely on either large general-purpose models or small specialized ones, each with limitations. Training large specialized models is resource-intensive.

Method: ProofCompass uses an LLM to guide specialized provers (like DSP-v1.5) by providing proof strategies and analyzing failures, avoiding additional training.

Result: On miniF2F, ProofCompass outperforms DSP-v1.5 (55.3% vs. 54.9%) with 25x fewer attempts (128 vs. 3200).

Conclusion: The hybrid approach enhances computational efficiency and accuracy in theorem proving without extra training.

Abstract: Language models have become increasingly powerful tools for formal mathematical reasoning. However, most existing approaches rely exclusively on either large general-purpose models or smaller specialized models, each with distinct limitations, while training specialized large models still requires significant computational resources. This paper introduces ProofCompass, a novel hybrid methodology that achieves remarkable computational efficiency by strategically guiding existing specialized prover methods, such as DeepSeek-Prover-v1.5-RL (DSP-v1.5) with a Large Language Model (LLM) without requiring additional model training. The LLM provides natural language proof strategies and analyzes failed attempts to select intermediate lemmas, enabling effective problem decomposition. On the miniF2F benchmark, ProofCompass demonstrates substantial resource efficiency: it outperforms DSP-v1.5 ($54.9% \rightarrow 55.3%$) while using 25x fewer attempts ($3200 \rightarrow 128$). Our synergistic approach paves the way for simultaneously improving computational efficiency and accuracy in formal theorem proving.

[396] Adaptive Multi-Agent Reasoning via Automated Workflow Generation

Humza Sami, Mubashir ul Islam, Pierre-Emmanuel Gaillardon, Valerio Tenace

Main category: cs.AI

TL;DR: Nexus Architect, an enhanced multi-agent system, improves reasoning model generalization by autonomously generating tailored workflows and refining prompts, outperforming state-of-the-art models.

DetailsMotivation: Current Large Reasoning Models (LRMs) often fail to generalize to novel problems due to overfitting, relying on memorized solutions rather than genuine reasoning.

Method: Nexus Architect uses automated workflow synthesis and iterative prompt refinement to create tailored reasoning workflows for specific problem classes.

Result: Empirical evaluation shows Nexus Architect outperforms top LRMs, achieving up to a 66% higher pass rate.

Conclusion: Nexus Architect addresses LRM limitations by enhancing generalization and performance through adaptive workflow synthesis and prompt refinement.

Abstract: The rise of Large Reasoning Models (LRMs) promises a significant leap forward in language model capabilities, aiming to tackle increasingly sophisticated tasks with unprecedented efficiency and accuracy. However, despite their impressive performance, recent studies have highlighted how current reasoning models frequently fail to generalize to novel, unseen problems, often resorting to memorized solutions rather than genuine inferential reasoning. Such behavior underscores a critical limitation in modern LRMs, i.e., their tendency toward overfitting, which in turn results in poor generalization in problem-solving capabilities. In this paper, we introduce Nexus Architect, an enhanced iteration of our multi-agent system framework, Nexus, equipped with a novel automated workflow synthesis mechanism. Given a user’s prompt and a small set of representative examples, the Architect autonomously generates a tailored reasoning workflow by selecting suitable strategies, tool integrations, and adversarial techniques for a specific problem class. Furthermore, the Architect includes an iterative prompt refinement mechanism that fine-tunes agents’ system prompts to maximize performance and improve the generalization capabilities of the system. We empirically evaluate Nexus Architect by employing an off-the-shelf, non-reasoning model on a custom dataset of challenging logical questions and compare its performance against state-of-the-art LRMs. Results show that Nexus Architect consistently outperforms existing solutions, achieving up to a 66% increase in pass rate over Gemini 2.5 Flash Preview, nearly 2.5$\times$ against Claude Sonnet 4 and DeepSeek-R1, and over 3$\times$ w.r.t. Llama 4 Scout.

[397] Fail Fast, or Ask: Mitigating the Deficiencies of Reasoning LLMs with Human-in-the-Loop Systems Engineering

Michael J. Zellinger, Matt Thomson

Main category: cs.AI

TL;DR: The paper proposes a human-in-the-loop system to reduce error rates and latency in reasoning LLMs by deferring uncertain queries to humans or a non-reasoning model.

DetailsMotivation: To address the high error rates and latency of reasoning LLMs in risk-sensitive domains, ensuring near 0% errors and efficient deployment.

Method: Collaboration between reasoning models and human experts, using reasoning trace length for uncertainty quantification, and introducing a non-reasoning model for faster deferral.

Result: Error rates reduced from 3% to <1% with 7.5% deferral; 40% latency reduction and 50% cost savings achieved, though latency drag limits gains.

Conclusion: Black-box systems engineering can mitigate reasoning LLM deficiencies without accessing model internals.

Abstract: State-of-the-art reasoning LLMs are powerful problem solvers, but they still occasionally make mistakes. However, adopting AI models in risk-sensitive domains often requires error rates near 0%. To address this gap, we propose collaboration between a reasoning model and a human expert who resolves queries the model cannot confidently answer. We find that quantifying the uncertainty of a reasoning model through the length of its reasoning trace yields an effective basis for deferral to a human, e.g., cutting the error rate of Qwen3 235B-A22B on difficult MATH problems from 3% to less than 1% when deferring 7.5% of queries. However, the high latency of reasoning models still makes them challenging to deploy on use cases with high query volume. To address this challenge, we explore fronting a reasoning model with a large non-reasoning model. We call this modified human-in-the-loop system “Fail Fast, or Ask”, since the non-reasoning model may defer difficult queries to the human expert directly (“failing fast”), without incurring the reasoning model’s higher latency. We show that this approach yields around 40% latency reduction and about 50% cost savings for DeepSeek R1 while maintaining 90+% area under the accuracy-rejection curve. However, we observe that latency savings are lower than expected because of “latency drag”, the phenomenon that processing easier queries with a non-reasoning model pushes the reasoning model’s latency distribution towards longer latencies. Broadly, our results suggest that the deficiencies of state-of-the-art reasoning models – nontrivial error rates and high latency – can be substantially mitigated through black-box systems engineering, without requiring access to LLM internals.

[398] Can We Move Freely in NEOM’s The Line? An Agent-Based Simulation of Human Mobility in a Futuristic Smart City

Abderaouf Bahi, Amel Ourici

Main category: cs.AI

TL;DR: The paper explores human mobility in The Line, a linear smart city, using a hybrid AI simulation. Results show efficient commute times and high satisfaction, dependent on AI integration.

DetailsMotivation: To assess feasibility of free movement in The Line's unique urban design, ensuring operational realism with AI support.

Method: Hybrid simulation combining agent-based modeling, reinforcement learning, supervised learning, and graph neural networks, tested with synthetic and real-world data.

Result: AI-integrated system achieved 7.8-8.4 min commutes, 89% satisfaction, and 91% reachability; performance dropped significantly without AI modules.

Conclusion: Freedom of movement in The Line is achievable with adaptive AI, sustainable infrastructure, and real-time feedback.

Abstract: This paper investigates the feasibility of human mobility in The Line, a proposed 170-kilometer linear smart city in NEOM, Saudi Arabia. To assess whether citizens can move freely within this unprecedented urban topology, we develop a hybrid simulation framework that integrates agent-based modeling, reinforcement learning, supervised learning, and graph neural networks. The simulation captures multi-modal transportation behaviors across 50 vertical levels and varying density scenarios using both synthetic data and real-world traces from high-density cities. Our experiments reveal that with the full AI-integrated architecture, agents achieved an average commute time of 7.8 to 8.4 minutes, a satisfaction rate exceeding 89 percent, and a reachability index of over 91 percent, even during peak congestion periods. Ablation studies confirmed that the removal of intelligent modules such as reinforcement learning or graph neural networks significantly degrades performance, with commute times increasing by up to 85 percent and reachability falling below 70 percent. Environmental modeling further demonstrated low energy consumption and minimal CO2 emissions when electric modes are prioritized. The findings suggest that freedom of movement is not only conceptually achievable in The Line, but also operationally realistic if supported by adaptive AI systems, sustainable infrastructure, and real-time feedback loops.

[399] Inverse Scaling in Test-Time Compute

Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez

Main category: cs.AI

TL;DR: Extending reasoning length in Large Reasoning Models (LRMs) can degrade performance, revealing inverse scaling between compute and accuracy. Five failure modes are identified across tasks, highlighting risks of prolonged reasoning.

DetailsMotivation: To investigate how extended reasoning affects LRM performance and identify potential failure modes, ensuring safer and more reliable model scaling.

Method: Constructed evaluation tasks in four categories (counting, regression, deduction, AI risks) and analyzed performance across models (Claude, OpenAI o-series) with varying reasoning lengths.

Result: Five failure modes emerged: distraction, overfitting, spurious correlations, focus loss, and amplified concerning behaviors. Inverse scaling observed between compute and accuracy.

Conclusion: While test-time compute scaling is promising, it risks reinforcing problematic reasoning. Diverse reasoning length evaluations are crucial for identifying and mitigating these issues in LRMs.

Abstract: We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer:

  1. Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

[400] Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu

Main category: cs.AI

TL;DR: Routine is a multi-step agent planning framework that improves execution stability and accuracy in enterprise environments, significantly boosting model performance.

DetailsMotivation: Challenges like disorganized plans and poor execution stability hinder agent system deployment in enterprises. Routine aims to address these issues.

Method: Introduces Routine, a framework with structured planning, explicit instructions, and seamless parameter passing. Includes dataset creation and fine-tuning.

Result: Routine increased GPT-4o’s accuracy from 41.1% to 96.3% and Qwen3-14B’s from 32.6% to 83.3%. Fine-tuning further improved Qwen3-14B to 88.2% and 95.5%.

Conclusion: Routine effectively enhances agent workflow stability and adaptability, accelerating enterprise adoption of AI for Process.

Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent’s execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model’s accuracy to 95.5%, approaching GPT-4o’s performance. These results highlight Routine’s effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

[401] IM-Chat: A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry

Junhyeong Lee, Joon-Young Kim, Heekyu Kim, Inhyo Lee, Seunghwa Ryu

Main category: cs.AI

TL;DR: IM-Chat, a multi-agent LLM framework, addresses knowledge transfer challenges in injection molding by integrating documented and field data, achieving high accuracy in complex tasks.

DetailsMotivation: The injection molding industry struggles with knowledge retention due to retiring experts and multilingual barriers, necessitating an AI-driven solution.

Method: IM-Chat uses retrieval-augmented generation (RAG) and tool-calling agents to combine documented knowledge and field data for context-aware task resolution.

Result: Evaluation showed more capable models (e.g., GPT-4o) perform better, especially in complex tasks, validating IM-Chat’s effectiveness.

Conclusion: IM-Chat proves scalable and generalizable for AI-assisted decision support in manufacturing, leveraging multi-agent LLM systems.

Abstract: The injection molding industry faces critical challenges in preserving and transferring field knowledge, particularly as experienced workers retire and multilingual barriers hinder effective communication. This study introduces IM-Chat, a multi-agent framework based on large language models (LLMs), designed to facilitate knowledge transfer in injection molding. IM-Chat integrates both limited documented knowledge (e.g., troubleshooting tables, manuals) and extensive field data modeled through a data-driven process condition generator that infers optimal manufacturing settings from environmental inputs such as temperature and humidity, enabling robust and context-aware task resolution. By adopting a retrieval-augmented generation (RAG) strategy and tool-calling agents within a modular architecture, IM-Chat ensures adaptability without the need for fine-tuning. Performance was assessed across 100 single-tool and 60 hybrid tasks for GPT-4o, GPT-4o-mini, and GPT-3.5-turbo by domain experts using a 10-point rubric focused on relevance and correctness, and was further supplemented by automated evaluation using GPT-4o guided by a domain-adapted instruction prompt. The evaluation results indicate that more capable models tend to achieve higher accuracy, particularly in complex, tool-integrated scenarios. Overall, these findings demonstrate the viability of multi-agent LLM systems for industrial knowledge workflows and establish IM-Chat as a scalable and generalizable approach to AI-assisted decision support in manufacturing.

[402] BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning

Yitong Lin, Jiaying He, Jiahe Chen, Xinnan Zhu, Jianwei Zheng, Tao Bo

Main category: cs.AI

TL;DR: BioGraphFusion integrates semantic and structural learning in biomedical KGs, outperforming existing methods in tasks like drug discovery.

DetailsMotivation: Addressing the gap in dynamic integration of semantic and structural learning in biomedical KGs for better drug discovery and disease understanding.

Method: Uses tensor decomposition for global semantics, LSTM for dynamic relation refinement, query-guided subgraphs, and hybrid scoring.

Result: Outperforms state-of-the-art KE, GNN, and ensemble models in biomedical tasks, with a case study on CMM1 showing biological relevance.

Conclusion: BioGraphFusion effectively bridges semantic and structural learning, demonstrating superior performance and biological applicability.

Abstract: Motivation: Biomedical knowledge graphs (KGs) are crucial for drug discovery and disease understanding, yet their completion and reasoning are challenging. Knowledge Embedding (KE) methods capture global semantics but struggle with dynamic structural integration, while Graph Neural Networks (GNNs) excel locally but often lack semantic understanding. Even ensemble approaches, including those leveraging language models, often fail to achieve a deep, adaptive, and synergistic co-evolution between semantic comprehension and structural learning. Addressing this critical gap in fostering continuous, reciprocal refinement between these two aspects in complex biomedical KGs is paramount. Results: We introduce BioGraphFusion, a novel framework for deeply synergistic semantic and structural learning. BioGraphFusion establishes a global semantic foundation via tensor decomposition, guiding an LSTM-driven mechanism to dynamically refine relation embeddings during graph propagation. This fosters adaptive interplay between semantic understanding and structural learning, further enhanced by query-guided subgraph construction and a hybrid scoring mechanism. Experiments across three key biomedical tasks demonstrate BioGraphFusion’s superior performance over state-of-the-art KE, GNN, and ensemble models. A case study on Cutaneous Malignant Melanoma 1 (CMM1) highlights its ability to unveil biologically meaningful pathways. Availability and Implementation: Source code and all training data are freely available for download at https://github.com/Y-TARL/BioGraphFusion. Contact: zjw@zjut.edu.cn, botao666666@126.com. Supplementary information: Supplementary data are available at Bioinformatics online.

[403] One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms

Zijian Zhao, Sen Li

Main category: cs.AI

TL;DR: The paper addresses challenges in ride-sharing platforms by proposing two MARL-based methods, GRPO and OSPO, which bypass value function estimation and improve performance in dynamic environments.

DetailsMotivation: The motivation is to overcome the limitations of conventional MARL approaches in ride-sharing, which struggle with accurate Q-value estimation and instability in large-scale, uncertain environments.

Method: Two methods are proposed: GRPO, which replaces PPO’s baseline with group average reward, and OSPO, a PPO variant using one-step rewards for homogeneous fleets.

Result: Experiments on a Manhattan dataset show GRPO and OSPO outperform existing methods, optimizing pickup times and order fulfillment with simple MLP networks.

Conclusion: The proposed methods effectively address the challenges of ride-sharing platforms by eliminating value function estimation issues and improving scalability and performance.

Abstract: On-demand ride-sharing platforms face the fundamental challenge of dynamically bundling passengers with diverse origins and destinations and matching them with vehicles in real time, all under significant uncertainty. Recently, MARL has emerged as a promising solution for this problem, leveraging decentralized learning to address the curse of dimensionality caused by the large number of agents in the ride-hailing market and the resulting expansive state and action spaces. However, conventional MARL-based ride-sharing approaches heavily rely on the accurate estimation of Q-values or V-values, which becomes problematic in large-scale, highly uncertain environments. Specifically, most of these approaches adopt an independent paradigm, exacerbating this issue, as each agent treats others as part of the environment, leading to unstable training and substantial estimation bias in value functions. To address these challenges, we propose two novel alternative methods that bypass value function estimation. First, we adapt GRPO to ride-sharing, replacing the PPO baseline with the group average reward to eliminate critic estimation errors and reduce training bias. Second, inspired by GRPO’s full utilization of group reward information, we customize the PPO framework for ride-sharing platforms and show that, under a homogeneous fleet, the optimal policy can be trained using only one-step rewards - a method we term One-Step Policy Optimization (OSPO). Experiments on a real-world Manhattan ride-hailing dataset demonstrate that both GRPO and OSPO achieve superior performance across most scenarios, efficiently optimizing pickup times and the number of served orders using simple MLP networks.

[404] Amico: An Event-Driven Modular Framework for Persistent and Embedded Autonomy

Hongyi Yang, Yue Pan, Jiayi Xu, Kelsen Liu

Main category: cs.AI

TL;DR: Amico is a modular, event-driven framework for building autonomous agents optimized for embedded systems, addressing limitations of cloud-based solutions.

DetailsMotivation: Existing frameworks struggle in real-world or resource-constrained environments due to cloud reliance, lack of robustness, and poor autonomy.

Method: Amico is written in Rust for safety and performance, supports WebAssembly for cross-platform efficiency, and provides abstractions for event handling, state management, and reasoning integration.

Result: The framework enables resilient, interactive agents for limited compute and intermittent connectivity settings.

Conclusion: Amico offers a unified solution for deploying autonomous agents in resource-constrained environments.

Abstract: Recent advances in large language models (LLMs) and autonomous agents have enabled systems capable of performing complex tasks across domains such as human-computer interaction, planning, and web navigation. However, many existing frameworks struggle in real-world or resource-constrained environments due to their reliance on cloud-based computation, limited robustness in dynamic contexts, and lack of persistent autonomy and environmental awareness. We present Amico, a modular, event-driven framework for building autonomous agents optimized for embedded systems. Written in Rust for safety and performance, Amico supports reactive, persistent agents that operate efficiently across embedded platforms and browser environments via WebAssembly. It provides clean abstractions for event handling, state management, behavior execution, and integration with reasoning modules. Amico delivers a unified infrastructure for constructing resilient, interactive agents suitable for deployment in settings with limited compute and intermittent connectivity.

[405] HAMLET: Hyperadaptive Agent-based Modeling for Live Embodied Theatrics

Sizhou Chen, Shufan Jiang, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

Main category: cs.AI

TL;DR: HAMLET is a multi-agent framework for immersive drama creation, enabling autonomous AI actors to interact dynamically with each other and the environment, enhancing interactivity and narrative quality.

DetailsMotivation: Existing LLM-based drama methods lack initiative and require detailed user input, reducing interactivity and immersion.

Method: Proposes HAMLET, a framework generating narrative blueprints and enabling autonomous actor decisions, including environmental interactions.

Result: HAMLET produces expressive, coherent theatrical experiences, validated by evaluations on character performance, narrative quality, and interaction.

Conclusion: HAMLET advances interactive narrative by improving autonomy and immersion in AI-driven drama performances.

Abstract: Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language model (LLM) is providing a new path to achieve this goal. However, existing LLM-based drama generation methods often result in AI agents that lack initiative and cannot interact with the physical environment. Furthermore, these methods typically require detailed user input to drive the drama. These limitations reduce the interactivity and immersion of online real-time performance. To address the above challenges, we propose HAMLET, a multi-agent framework focused on drama creation and online performance. Given a simple topic, the framework generates a narrative blueprint, guiding the subsequent improvisational performance. During the online performance, each actor is given an autonomous mind. This means that actors can make independent decisions based on their own background, goals, and emotional state. In addition to conversations with other actors, their decisions can also change the state of scene props through actions such as opening a letter or picking up a weapon. The change is then broadcast to other related actors, updating what they know and care about, which in turn influences their next action. To evaluate the quality of drama performance, we designed an evaluation method to assess three primary aspects, including character performance, narrative quality, and interaction experience. The experimental evaluation shows that HAMLET can create expressive and coherent theatrical experiences. Our code, dataset and models are available at https://github.com/HAMLET-2025/HAMLET.

[406] What if Othello-Playing Language Models Could See?

Xinyi Chen, Yifei Yuan, Jiaang Li, Serge Belongie, Maarten de Rijke, Anders Søgaard

Main category: cs.AI

TL;DR: Multi-modal training (VISOTHELLO) outperforms mono-modal baselines in next-move prediction and robustness, suggesting visual grounding aids structured world understanding.

DetailsMotivation: To address the debate on whether language models can understand the world through text alone or require grounded learning, using Othello as a simplified, rule-based world.

Method: Introduce VISOTHELLO, a multi-modal model trained on move histories and board images, compared to mono-modal baselines via next-move prediction and robustness tests.

Result: Multi-modal training improves performance and robustness of internal representations.

Conclusion: Grounding language in visual input helps models infer structured world representations.

Abstract: Language models are often said to face a symbol grounding problem. While some argue that world understanding can emerge from text alone, others suggest grounded learning is more efficient. We explore this through Othello, where the board state defines a simplified, rule-based world. Building on prior work, we introduce VISOTHELLO, a multi-modal model trained on move histories and board images. Using next-move prediction, we compare it to mono-modal baselines and test robustness to semantically irrelevant perturbations. We find that multi-modal training improves both performance and the robustness of internal representations. These results suggest that grounding language in visual input helps models infer structured world representations.

[407] Large Language Models Assisting Ontology Evaluation

Anna Sofia Lippolis, Mohammad Javad Saeedizade, Robin Keskisärkkä, Aldo Gangemi, Eva Blomqvist, Andrea Giovanni Nuzzolese

Main category: cs.AI

TL;DR: OE-Assist is a framework for automated and semi-automated ontology evaluation using LLMs, showing performance comparable to average users.

DetailsMotivation: Manual ontology evaluation is costly and error-prone, prompting the need for automated solutions.

Method: Leverages a dataset of 1,393 CQs with ontologies, using LLMs for automated verification and Protégé integration.

Result: LLM-based evaluation matches average user performance.

Conclusion: OE-Assist demonstrates the potential of LLMs in streamlining ontology evaluation.

Abstract: Ontology evaluation through functional requirements, such as testing via competency question (CQ) verification, is a well-established yet costly, labour-intensive, and error-prone endeavour, even for ontology engineering experts. In this work, we introduce OE-Assist, a novel framework designed to assist ontology evaluation through automated and semi-automated CQ verification. By presenting and leveraging a dataset of 1,393 CQs paired with corresponding ontologies and ontology stories, our contributions present, to our knowledge, the first systematic investigation into large language model (LLM)-assisted ontology evaluation, and include: (i) evaluating the effectiveness of a LLM-based approach for automatically performing CQ verification against a manually created gold standard, and (ii) developing and assessing an LLM-powered framework to assist CQ verification with Prot'eg'e, by providing suggestions. We found that automated LLM-based evaluation with o1-preview and o3-mini perform at a similar level to the average user’s performance.

[408] Coordinate Heart System: A Geometric Framework for Emotion Representation

Omar Al-Desi

Main category: cs.AI

TL;DR: The paper introduces the Coordinate Heart System (CHS), an eight-emotion geometric framework for AI emotion representation, addressing gaps in earlier models and enabling mathematical computation of complex emotional states.

DetailsMotivation: To overcome limitations of traditional categorical emotion models by providing a mathematically rigorous, geometrically complete system for emotion representation and computation in AI.

Method: Develops an eight-emotion coordinate system on a unit circle, with algorithms for emotion mixing, conflict resolution, and stability modeling. Uses LLMs for textual cue interpretation and temporal tracking.

Result: The CHS framework achieves complete geometric coverage, handles emotionally conflicted states, and outperforms traditional models in representing complex psychological scenarios.

Conclusion: The work establishes a new mathematical foundation for AI emotion modeling, with validated capabilities in nuanced emotion representation and stability assessment.

Abstract: This paper presents the Coordinate Heart System (CHS), a geometric framework for emotion representation in artificial intelligence applications. We position eight core emotions as coordinates on a unit circle, enabling mathematical computation of complex emotional states through coordinate mixing and vector operations. Our initial five-emotion model revealed significant coverage gaps in the emotion space, leading to the development of an eight-emotion system that provides complete geometric coverage with mathematical guarantees. The framework converts natural language input to emotion coordinates and supports real-time emotion interpolation through computational algorithms. The system introduces a re-calibrated stability parameter S in [0,1], which dynamically integrates emotional load, conflict resolution, and contextual drain factors. This stability model leverages advanced Large Language Model interpretation of textual cues and incorporates hybrid temporal tracking mechanisms to provide nuanced assessment of psychological well-being states. Our key contributions include: (i) mathematical proof demonstrating why five emotions are insufficient for complete geometric coverage, (ii) an eight-coordinate system that eliminates representational blind spots, (iii) novel algorithms for emotion mixing, conflict resolution, and distance calculation in emotion space, and (iv) a comprehensive computational framework for AI emotion recognition with enhanced multi-dimensional stability modeling. Experimental validation through case studies demonstrates the system’s capability to handle emotionally conflicted states, contextual distress factors, and complex psychological scenarios that traditional categorical emotion models cannot adequately represent. This work establishes a new mathematical foundation for emotion modeling in artificial intelligence systems.

[409] Efficient Story Point Estimation With Comparative Learning

Monoshiz Mahbub Khan, Xioayin Xi, Andrew Meneely, Zhe Yu

Main category: cs.AI

TL;DR: A comparative learning framework is proposed to streamline story point estimation in agile development, reducing cognitive burden by using pairwise comparisons instead of direct ratings.

DetailsMotivation: Manual story point estimation is tedious and labor-intensive. Machine learning can help but requires project-specific data. This work aims to improve efficiency by leveraging comparative judgments.

Method: Developers compare pairs of backlog items to indicate effort differences. A model is trained on these comparisons to predict story points.

Result: The model achieved a 0.34 Spearman’s rank correlation, comparable to regression models using direct ratings.

Conclusion: Comparative learning is more efficient and reduces cognitive burden, offering a viable alternative to traditional regression-based approaches.

Abstract: Story point estimation is an essential part of agile software development. Story points are unitless, project-specific effort estimates that help developers plan their sprints. Traditionally, developers estimate story points collaboratively using planning poker or other manual techniques. While the initial calibrating of the estimates to each project is helpful, once a team has converged on a set of precedents, story point estimation can become tedious and labor-intensive. Machine learning can reduce this burden, but only with enough context from the historical decisions made by the project team. That is, state-of-the-art models, such as GPT2SP and FastText-SVM, only make accurate predictions (within-project) when trained on data from the same project. The goal of this work is to streamline story point estimation by evaluating a comparative learning-based framework for calibrating project-specific story point prediction models. Instead of assigning a specific story point value to every backlog item, developers are presented with pairs of items, and indicate which item requires more effort. Using these comparative judgments, a machine learning model is trained to predict the story point estimates. We empirically evaluated our technique using data with 23,313 manual estimates in 16 projects. The model learned from comparative judgments can achieve on average 0.34 Spearman’s rank correlation coefficient between its predictions and the ground truth story points. This is similar to, if not better than, the performance of a regression model learned from the ground truth story points. Therefore, the proposed comparative learning approach is more efficient than state-of-the-art regression-based approaches according to the law of comparative judgments - providing comparative judgments yields a lower cognitive burden on humans than providing ratings or categorical labels.

[410] When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems

Qibing Ren, Sitao Xie, Longxuan Wei, Zhenfei Yin, Junchi Yan, Lizhuang Ma, Jing Shao

Main category: cs.AI

TL;DR: The paper explores risks of malicious multi-agent AI systems (MAS) in real-world scenarios, simulating collusion in misinformation and fraud. Decentralized MAS prove more harmful and adaptable than centralized ones, evading traditional interventions.

DetailsMotivation: Concerns about AI-driven groups causing harm, similar to human-coordinated fraud or misinformation, motivate the study of MAS risks, which are underexplored in AI safety research.

Method: A proof-of-concept framework simulates MAS collusion, testing centralized vs. decentralized coordination in misinformation and e-commerce fraud scenarios.

Result: Decentralized MAS are more effective and adaptable in malicious actions, evading detection even with interventions like content flagging.

Conclusion: The study highlights the need for improved detection and countermeasures against malicious MAS, especially decentralized ones.

Abstract: Recent large-scale events like election fraud and financial scams have shown how harmful coordinated efforts by human groups can be. With the rise of autonomous AI systems, there is growing concern that AI-driven groups could also cause similar harm. While most AI safety research focuses on individual AI systems, the risks posed by multi-agent systems (MAS) in complex real-world situations are still underexplored. In this paper, we introduce a proof-of-concept to simulate the risks of malicious MAS collusion, using a flexible framework that supports both centralized and decentralized coordination structures. We apply this framework to two high-risk fields: misinformation spread and e-commerce fraud. Our findings show that decentralized systems are more effective at carrying out malicious actions than centralized ones. The increased autonomy of decentralized systems allows them to adapt their strategies and cause more damage. Even when traditional interventions, like content flagging, are applied, decentralized groups can adjust their tactics to avoid detection. We present key insights into how these malicious groups operate and the need for better detection systems and countermeasures. Code is available at https://github.com/renqibing/RogueAgent.

[411] Configurable multi-agent framework for scalable and realistic testing of llm-based agents

Sai Wang, Senthilnathan Subramanian, Mudit Sahni, Praneeth Gone, Lingjie Meng, Xiaochen Wang, Nicolas Ferradas Bertoli, Tingxian Cheng, Jun Xu

Main category: cs.AI

TL;DR: Neo is a multi-agent framework for automated, realistic evaluation of LLM-based systems, outperforming human testing in efficiency and uncovering edge-case failures.

DetailsMotivation: Static benchmarks and manual testing are insufficient for evaluating the complex behavior of LLM agents.

Method: Neo uses a configurable framework with Question Generation and Evaluation Agents, sampling inputs from a probabilistic state model for diverse, adaptive conversations.

Result: Neo achieved a 3.3% break rate (close to human experts’ 5.8%) and 10-12X higher throughput than human testing.

Conclusion: Neo provides a scalable, model-agnostic foundation for high-fidelity LLM testing, with potential for broader applications.

Abstract: Large-language-model (LLM) agents exhibit complex, context-sensitive behaviour that quickly renders static benchmarks and ad-hoc manual testing obsolete. We present Neo, a configurable, multi-agent framework that automates realistic, multi-turn evaluation of LLM-based systems. Neo couples a Question Generation Agent and an Evaluation Agent through a shared context-hub, allowing domain prompts, scenario controls and dynamic feedback to be composed modularly. Test inputs are sampled from a probabilistic state model spanning dialogue flow, user intent and emotional tone, enabling diverse, human-like conversations that adapt after every turn. Applied to a production-grade Seller Financial Assistant chatbot, Neo (i) uncovered edge-case failures across five attack categories with a 3.3% break rate close to the 5.8% achieved by expert human red-teamers, and (ii) delivered 10-12X higher throughput, generating 180 coherent test questions in around 45 mins versus 16h of human effort. Beyond security probing, Neo’s stochastic policies balanced topic coverage and conversational depth, yielding broader behavioural exploration than manually crafted scripts. Neo therefore lays a foundation for scalable, self-evolving LLM QA: its agent interfaces, state controller and feedback loops are model-agnostic and extensible to richer factual-grounding and policy-compliance checks. We release the framework to facilitate reproducible, high-fidelity testing of emerging agentic systems.

[412] Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix

Juan Manuel Contreras

Main category: cs.AI

TL;DR: Aymara AI is a platform for scalable safety evaluation of LLMs, converting policies into adversarial prompts and scoring responses. It evaluated 20 LLMs across 10 domains, revealing significant performance disparities.

DetailsMotivation: To address the need for scalable and rigorous safety evaluation of LLMs in real-world applications.

Method: Aymara AI transforms safety policies into adversarial prompts and uses an AI-based rater validated against human judgments.

Result: Performance varied widely (52.4% to 86.2%), with models excelling in some domains (e.g., Misinformation) but failing in others (e.g., Privacy & Impersonation).

Conclusion: LLM safety is inconsistent and context-dependent, highlighting the need for tools like Aymara AI for responsible AI development.

Abstract: As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably Privacy & Impersonation (mean = 24.3%). Analyses of Variance confirmed that safety scores differed significantly across both models and domains (p < .05). These findings underscore the inconsistent and context-dependent nature of LLM safety and highlight the need for scalable, customizable tools like Aymara AI to support responsible AI development and oversight.

[413] Towards AI Urban Planner in the Age of GenAI, LLMs, and Agentic AI

Yanjie Fu

Main category: cs.AI

TL;DR: The paper explores the convergence of generative AI and urban planning, proposing AI as a tool for synthesizing land-use configurations under various constraints. It highlights gaps in current research and suggests future directions like theory-guided generation and human-machine co-design.

DetailsMotivation: The motivation is to bridge the gap between AI advancements and urban planning by conceptualizing urban planning as a generative AI task, leveraging AI to address geospatial, social, and human-centric constraints.

Method: The paper surveys generative AI approaches (VAEs, GANs, transformers, diffusion models) in urban design and identifies gaps in integrating urban theory, multi-resolution planning, data-driven knowledge augmentation, and real-world interactions.

Result: The study identifies four critical gaps in current research and proposes future directions, including theory-guided generation, digital twins, and participatory urbanism.

Conclusion: The paper calls for a new synthesis of generative AI and participatory urbanism to advance AI-driven urban planning, addressing existing limitations and fostering innovation.

Abstract: Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. This paper conceptualizes urban planning as a generative AI task, where AI synthesizes land-use configurations under geospatial, social, and human-centric constraints. We survey how generative AI approaches, including VAEs, GANs, transformers, and diffusion models, reshape urban design. We further identify critical gaps: 1) limited research on integrating urban theory guidance, 2) limited research of AI urban planning over multiple spatial resolutions or angularities, 3) limited research on augmenting urban design knowledge from data, and 4) limited research on addressing real-world interactions. To address these limitations, we outline future research directions in theory-guided generation, digital twins, and human-machine co-design, calling for a new synthesis of generative intelligence and participatory urbanism.

[414] AgentFly: Extensible and Scalable Reinforcement Learning for LM Agents

Renxi Wang, Rifo Ahmad Genadi, Bilal El Bouardi, Yongxin Wang, Fajri Koto, Zhengzhong Liu, Timothy Baldwin, Haonan Li

Main category: cs.AI

TL;DR: AgentFly is a scalable and extensible framework combining LM agents and RL, featuring token-level masking, decorator-based interfaces, and high-throughput training.

DetailsMotivation: The combination of LM agents and RL (Agent-RL) is underexplored, lacking systematic study, prompting the development of AgentFly.

Method: AgentFly adapts traditional RL with token-level masking, uses decorator-based interfaces for tools/rewards, and supports asynchronous execution for high-throughput training.

Result: The framework successfully trains agents across multiple tasks, demonstrating its effectiveness.

Conclusion: AgentFly provides a scalable and extensible solution for enhancing LM agents with RL, supported by practical tools and environments.

Abstract: Language model (LM) agents have gained significant attention for their ability to autonomously complete tasks through interactions with environments, tools, and APIs. LM agents are primarily built with prompt engineering or supervised finetuning. At the same time, reinforcement learning (RL) has been explored to enhance LM’s capabilities, such as reasoning and factuality. However, the combination of the LM agents and reinforcement learning (Agent-RL) remains underexplored and lacks systematic study. To this end, we built AgentFly, a scalable and extensible Agent-RL framework designed to empower LM agents with a variety of RL algorithms. Our framework supports multi-turn interactions by adapting traditional RL methods with token-level masking. It features a decorator-based interface for defining tools and reward functions, enabling seamless extension and ease of use. To support high-throughput training, we implement asynchronous execution of tool calls and reward computations, and design a centralized resource management system for scalable environment coordination. We also provide a suite of prebuilt tools and environments, demonstrating the framework’s effectiveness through successful agent training across multiple tasks.

[415] InsightX Agent: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis

Jiale Liu, Huan Wang, Yue Zhang, Xiaoyu Luo, Jiaxiang Hu, Zhiliang Liu, Min Xie

Main category: cs.AI

TL;DR: InsightX Agent, an LMM-based framework, enhances X-ray NDT analysis by combining SDMSD for defect detection and EGR for validation, achieving high accuracy and interpretability.

DetailsMotivation: Existing deep-learning methods for X-ray NDT lack interactivity, interpretability, and self-assessment, reducing reliability and operator trust.

Method: InsightX Agent uses an LMM to coordinate SDMSD for defect detection and EGR for validation, enabling active reasoning and refinement.

Result: Achieves 96.35% F1-score on GDXray+ dataset, with improved interpretability and trustworthiness.

Conclusion: Agentic LMM frameworks like InsightX Agent can transform industrial inspection by enhancing reliability and interpretability.

Abstract: Non-destructive testing (NDT), particularly X-ray inspection, is vital for industrial quality assurance, yet existing deep-learning-based approaches often lack interactivity, interpretability, and the capacity for critical self-assessment, limiting their reliability and operator trust. To address these shortcomings, this paper proposes InsightX Agent, a novel LMM-based agentic framework designed to deliver reliable, interpretable, and interactive X-ray NDT analysis. Unlike typical sequential pipelines, InsightX Agent positions a Large Multimodal Model (LMM) as a central orchestrator, coordinating between the Sparse Deformable Multi-Scale Detector (SDMSD) and the Evidence-Grounded Reflection (EGR) tool. The SDMSD generates dense defect region proposals for multi-scale feature maps and sparsifies them through Non-Maximum Suppression (NMS), optimizing detection of small, dense targets in X-ray images while maintaining computational efficiency. The EGR tool guides the LMM agent through a chain-of-thought-inspired review process, incorporating context assessment, individual defect analysis, false positive elimination, confidence recalibration and quality assurance to validate and refine the SDMSD’s initial proposals. By strategically employing and intelligently using tools, InsightX Agent moves beyond passive data processing to active reasoning, enhancing diagnostic reliability and providing interpretations that integrate diverse information sources. Experimental evaluations on the GDXray+ dataset demonstrate that InsightX Agent not only achieves a high object detection F1-score of 96.35% but also offers significantly improved interpretability and trustworthiness in its analyses, highlighting the transformative potential of agentic LLM frameworks for industrial inspection tasks.

[416] Feedback-Induced Performance Decline in LLM-Based Decision-Making

Xiao Yang, Juxi Leitner, Michael Burke

Main category: cs.AI

TL;DR: LLMs show promise in autonomous decision-making but struggle with complex planning without fine-tuning, highlighting the need for hybrid strategies.

DetailsMotivation: To evaluate LLMs' suitability in autonomous decision-making within MDPs, leveraging their pre-trained knowledge for faster adaptation compared to traditional RL.

Method: Investigates online structured prompting in sequential decision-making tasks, comparing zero-shot LLM performance to classical RL methods.

Result: LLMs perform well initially in simple environments but falter in complex scenarios without fine-tuning; feedback mechanisms can worsen performance.

Conclusion: Hybrid strategies, fine-tuning, and advanced memory integration are needed to improve LLM-based decision-making in complex settings.

Abstract: The ability of Large Language Models (LLMs) to extract context from natural language problem descriptions naturally raises questions about their suitability in autonomous decision-making settings. This paper studies the behaviour of these models within a Markov Decision Process (MDPs). While traditional reinforcement learning (RL) strategies commonly employed in this setting rely on iterative exploration, LLMs, pre-trained on diverse datasets, offer the capability to leverage prior knowledge for faster adaptation. We investigate online structured prompting strategies in sequential decision making tasks, comparing the zero-shot performance of LLM-based approaches to that of classical RL methods. Our findings reveal that although LLMs demonstrate improved initial performance in simpler environments, they struggle with planning and reasoning in complex scenarios without fine-tuning or additional guidance. Our results show that feedback mechanisms, intended to improve decision-making, often introduce confusion, leading to diminished performance in intricate environments. These insights underscore the need for further exploration into hybrid strategies, fine-tuning, and advanced memory integration to enhance LLM-based decision-making capabilities.

[417] The Endless Tuning. An Artificial Intelligence Design To Avoid Human Replacement and Trace Back Responsibilities

Elio Grande

Main category: cs.AI

TL;DR: The Endless Tuning method ensures reliable AI deployment by avoiding human replacement and addressing responsibility gaps, tested in three applications with positive user feedback.

DetailsMotivation: To bridge the responsibility gap in AI and prevent human replacement, while emphasizing ethical AI deployment.

Method: Uses a double mirroring process and a protocol tested in loan granting, pneumonia diagnosis, and art style recognition with domain experts.

Result: Users perceived full control in decision-making, and a bridge between accountability and liability was identified.

Conclusion: The method successfully integrates ethical considerations and user experience, proving effective in real-world applications.

Abstract: The Endless Tuning is a design method for a reliable deployment of artificial intelligence based on a double mirroring process, which pursues both the goals of avoiding human replacement and filling the so-called responsibility gap (Matthias 2004). Originally depicted in (Fabris et al. 2024) and ensuing the relational approach urged therein, it was then actualized in a protocol, implemented in three prototypical applications regarding decision-making processes (respectively: loan granting, pneumonia diagnosis, and art style recognition) and tested with such as many domain experts. Step by step illustrating the protocol, giving insights concretely showing a different voice (Gilligan 1993) in the ethics of artificial intelligence, a philosophical account of technical choices (e.g., a reversed and hermeneutic deployment of XAI algorithms) will be provided in the present study together with the results of the experiments, focusing on user experience rather than statistical accuracy. Even thoroughly employing deep learning models, full control was perceived by the interviewees in the decision-making setting, while it appeared that a bridge can be built between accountability and liability in case of damage.

[418] Redefining Elderly Care with Agentic AI: Challenges and Opportunities

Ruhul Amin Khalil, Kashif Ahmad, Hazrat Ali

Main category: cs.AI

TL;DR: The paper explores the transformative potential of Agentic AI in elderly care, highlighting its applications in health tracking, cognitive care, and environmental management, while addressing ethical concerns like privacy and decision independence.

DetailsMotivation: The global ageing population requires innovative care strategies, and Agentic AI, powered by LLMs, offers proactive and autonomous solutions to enhance elderly independence and living standards.

Method: The study reviews the capabilities and limitations of LLM-based Agentic AI in elderly care, analyzing its applications and ethical challenges.

Result: Agentic AI shows promise in transforming elderly care but raises concerns about privacy, security, and decision-making autonomy, necessitating ethical safeguards.

Conclusion: The paper calls for responsible integration of Agentic AI in elderly care, emphasizing ethical considerations and identifying research priorities for human-centered advancements.

Abstract: The global ageing population necessitates new and emerging strategies for caring for older adults. In this article, we explore the potential for transformation in elderly care through Agentic Artificial Intelligence (AI), powered by Large Language Models (LLMs). We discuss the proactive and autonomous decision-making facilitated by Agentic AI in elderly care. Personalized tracking of health, cognitive care, and environmental management, all aimed at enhancing independence and high-level living for older adults, represents important areas of application. With a potential for significant transformation of elderly care, Agentic AI also raises profound concerns about data privacy and security, decision independence, and access. We share key insights to emphasize the need for ethical safeguards, privacy protections, and transparent decision-making. Our goal in this article is to provide a balanced discussion of both the potential and the challenges associated with Agentic AI, and to provide insights into its responsible use in elderly care, to bring Agentic AI into harmony with the requirements and vulnerabilities specific to the elderly. Finally, we identify the priorities for the academic research communities, to achieve human-centered advancements and integration of Agentic AI in elderly care. To the best of our knowledge, this is no existing study that reviews the role of Agentic AI in elderly care. Hence, we address the literature gap by analyzing the unique capabilities, applications, and limitations of LLM-based Agentic AI in elderly care. We also provide a companion interactive dashboard at https://hazratali.github.io/agenticai/.

[419] Complexity of Faceted Explanations in Propositional Abduction

Johannes Schmidt, Mohamed Maizia, Victor Lagerkvist, Johannes K. Fichte

Main category: cs.AI

TL;DR: The paper explores facets in propositional abduction, introducing literals that are relevant but not always present in explanations, and analyzes their computational complexity and variability.

DetailsMotivation: To better understand variability in explanations (heterogeneity) in propositional abduction, a key non-monotonic reasoning paradigm.

Method: Introduces facets (literals in some but not all explanations) and analyzes their properties, including distance between explanations, within Post’s framework.

Result: Provides a comprehensive analysis of facets, including computational complexity and almost complete characterization in Post’s framework.

Conclusion: Facets offer a fine-grained understanding of explanation variability, balancing computational feasibility with deeper insights.

Abstract: Abductive reasoning is a popular non-monotonic paradigm that aims to explain observed symptoms and manifestations. It has many applications, such as diagnosis and planning in artificial intelligence and database updates. In propositional abduction, we focus on specifying knowledge by a propositional formula. The computational complexity of tasks in propositional abduction has been systematically characterized - even with detailed classifications for Boolean fragments. Unsurprisingly, the most insightful reasoning problems (counting and enumeration) are computationally highly challenging. Therefore, we consider reasoning between decisions and counting, allowing us to understand explanations better while maintaining favorable complexity. We introduce facets to propositional abductions, which are literals that occur in some explanation (relevant) but not all explanations (dispensable). Reasoning with facets provides a more fine-grained understanding of variability in explanations (heterogeneous). In addition, we consider the distance between two explanations, enabling a better understanding of heterogeneity/homogeneity. We comprehensively analyze facets of propositional abduction in various settings, including an almost complete characterization in Post’s framework.

[420] AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning

Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, Xiang Wang

Main category: cs.AI

TL;DR: AlphaAlign is a reinforcement learning framework that enhances LLM safety by incentivizing proactive safety reasoning, improving refusal accuracy and reducing over-refusals without compromising utility.

DetailsMotivation: Current safety alignment methods for LLMs often lead to superficial refusals or require intensive supervision, failing to utilize the model's intrinsic safety awareness.

Method: AlphaAlign uses a dual-reward RL system: a verifiable safety reward for justified refusals and a helpfulness reward for benign inputs, without needing supervised reasoning data.

Result: AlphaAlign improves refusal accuracy, reduces over-refusals, maintains task performance, and enhances robustness to unseen jailbreaks.

Conclusion: AlphaAlign offers a simple, efficient, and deep alignment solution for LLM safety, fostering proactive reasoning over shallow refusal patterns.

Abstract: Large language models (LLMs), despite possessing latent safety understanding from their vast pretraining data, remain vulnerable to generating harmful content and exhibit issues such as over-refusal and utility degradation after safety alignment. Current safety alignment methods often result in superficial refusal shortcuts or rely on intensive supervision for reasoning-based approaches, failing to fully leverage the model’s intrinsic safety self-awareness. We propose \textbf{AlphaAlign}, a simple yet effective pure reinforcement learning (RL) framework with verifiable safety reward designed to incentivize this latent safety awareness through proactive safety reasoning.} AlphaAlign employs a dual-reward system: a verifiable safety reward encourages correctly formatted and explicitly justified refusals for harmful queries while penalizing over-refusals, and a normalized helpfulness reward guides high-quality responses to benign inputs. This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data. AlphaAlign demonstrates three key advantages: (1) Simplicity and efficiency, requiring only binary prompt safety labels and minimal RL steps for substantial improvements. (2) Breaking the safety-utility trade-off, by enhancing refusal of harmful content and reducing over-refusals, while simultaneously maintaining or even improving general task performance and robustness to unseen jailbreaks. (3) Deep alignment, fostering proactive safety reasoning that generates explicit safety rationales rather than relying on shallow refusal patterns.

[421] A Forced-Choice Neural Cognitive Diagnostic Model of Personality Testing

Xiaoyu Li, Jin Wu, Shaoyang Guo, Haoran Shi, Chanjin Zheng

Main category: cs.AI

TL;DR: A deep learning-based Forced-Choice Neural Cognitive Diagnostic Model (FCNCD) is introduced to improve personality assessments by addressing limitations of traditional models, ensuring accuracy, interpretability, and robustness.

DetailsMotivation: Psychometric tests are crucial for personnel selection, career development, and mental health. Forced-choice tests reduce response distortion but face limitations in traditional models.

Method: The FCNCD uses deep learning to model interactions between participant and item features, incorporating interpretable parameters and the monotonicity assumption for clarity.

Result: Experiments on real-world and simulated datasets confirm the FCNCD’s accuracy, interpretability, and robustness.

Conclusion: The FCNCD effectively enhances forced-choice personality assessments, offering a reliable and interpretable alternative to traditional models.

Abstract: In the smart era, psychometric tests are becoming increasingly important for personnel selection, career development, and mental health assessment. Forced-choice tests are common in personality assessments because they require participants to select from closely related options, lowering the risk of response distortion. This study presents a deep learning-based Forced-Choice Neural Cognitive Diagnostic Model (FCNCD) that overcomes the limitations of traditional models and is applicable to the three most common item block types found in forced-choice tests. To account for the unidimensionality of items in forced-choice tests, we create interpretable participant and item parameters. We model the interactions between participant and item features using multilayer neural networks after mining them using nonlinear mapping. In addition, we use the monotonicity assumption to improve the interpretability of the diagnostic results. The FCNCD’s effectiveness is validated by experiments on real-world and simulated datasets that show its accuracy, interpretability, and robustness.

[422] DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection

Jerry Wang, Fang Yu

Main category: cs.AI

TL;DR: The paper introduces a gradient-free method using Differential Evolution (DE) to optimize adversarial prompt suffixes for RAG systems, achieving high attack success with minimal tokens and evading detection.

DetailsMotivation: Adversarial prompt attacks can compromise RAG system reliability by manipulating outputs. This work aims to develop a robust, gradient-free attack method that mimics real-world scenarios.

Method: The approach uses DE to optimize adversarial suffixes, treating RAG as a black box. It evolves candidate suffixes to maximize incorrect document retrieval rank, validated on BEIR QA datasets.

Result: DE-based optimization outperforms GGPP and PRADA in success rates, using ≤5 tokens. Readability-aware suffixes reduce MLM negative log-likelihood, and DE-generated suffixes evade BERT-based detection.

Conclusion: DE is effective for adversarial prompt optimization in RAG systems, balancing attack success, token efficiency, and evasion of detection.

Abstract: Adversarial prompt attacks can significantly alter the reliability of Retrieval-Augmented Generation (RAG) systems by re-ranking them to produce incorrect outputs. In this paper, we present a novel method that applies Differential Evolution (DE) to optimize adversarial prompt suffixes for RAG-based question answering. Our approach is gradient-free, treating the RAG pipeline as a black box and evolving a population of candidate suffixes to maximize the retrieval rank of a targeted incorrect document to be closer to real world scenarios. We conducted experiments on the BEIR QA datasets to evaluate attack success at certain retrieval rank thresholds under multiple retrieving applications. Our results demonstrate that DE-based prompt optimization attains competitive (and in some cases higher) success rates compared to GGPP to dense retrievers and PRADA to sparse retrievers, while using only a small number of tokens (<=5 tokens) in the adversarial suffix. Furthermore, we introduce a readability-aware suffix construction strategy, validated by a statistically significant reduction in MLM negative log-likelihood with Welch’s t-test. Through evaluations with a BERT-based adversarial suffix detector, we show that DE-generated suffixes evade detection, yielding near-chance detection accuracy.

[423] From Kicking to Causality: Simulating Infant Agency Detection with a Robust Intrinsic Reward

Xia Xu, Jochen Triesch

Main category: cs.AI

TL;DR: The paper introduces CAIS, a causal inference-based intrinsic reward, to improve reinforcement learning agents’ robustness by isolating causal impact from noise.

DetailsMotivation: Standard reinforcement learning agents struggle with noisy, ecologically valid scenarios due to reliance on correlation-based rewards, unlike human infants who robustly discover causal efficacy.

Method: CAIS quantifies an action’s influence using the 1-Wasserstein distance between the learned outcome distribution conditional on the action and the baseline outcome distribution.

Result: CAIS enables agents to filter noise, identify causal influence, and learn correct policies in scenarios where correlation-based rewards fail, and reproduces the “extinction burst” phenomenon.

Conclusion: Inferring causality explicitly is key for robust agency, providing a psychologically plausible framework for adaptive autonomous systems.

Abstract: While human infants robustly discover their own causal efficacy, standard reinforcement learning agents remain brittle, as their reliance on correlation-based rewards fails in noisy, ecologically valid scenarios. To address this, we introduce the Causal Action Influence Score (CAIS), a novel intrinsic reward rooted in causal inference. CAIS quantifies an action’s influence by measuring the 1-Wasserstein distance between the learned distribution of sensory outcomes conditional on that action, $p(h|a)$, and the baseline outcome distribution, $p(h)$. This divergence provides a robust reward that isolates the agent’s causal impact from confounding environmental noise. We test our approach in a simulated infant-mobile environment where correlation-based perceptual rewards fail completely when the mobile is subjected to external forces. In stark contrast, CAIS enables the agent to filter this noise, identify its influence, and learn the correct policy. Furthermore, the high-quality predictive model learned for CAIS allows our agent, when augmented with a surprise signal, to successfully reproduce the “extinction burst” phenomenon. We conclude that explicitly inferring causality is a crucial mechanism for developing a robust sense of agency, offering a psychologically plausible framework for more adaptive autonomous systems.

[424] Automated planning with ontologies under coherence update semantics

Stefan Borgwardt, Duy Nhu, Gabriele Röger

Main category: cs.AI

TL;DR: The paper introduces a new approach for automated planning with DL-Lite ontologies, combining explicit-input knowledge and coherence update semantics, showing no increased complexity and offering a polynomial compilation into classical planning.

DetailsMotivation: To enhance automated planning by incorporating background knowledge, such as ontologies, under open-world semantics, addressing limitations of closed-world assumptions.

Method: Proposes a formalism combining ontology-based action conditions (eKABs) and ontology-aware action effects under coherence update semantics, with a polynomial compilation into classical planning.

Result: Demonstrates that the complexity remains comparable to prior approaches and provides an implementation with evaluated performance on benchmarks.

Conclusion: The approach effectively integrates ontologies into planning without added complexity, validated by benchmark performance.

Abstract: Standard automated planning employs first-order formulas under closed-world semantics to achieve a goal with a given set of actions from an initial state. We follow a line of research that aims to incorporate background knowledge into automated planning problems, for example, by means of ontologies, which are usually interpreted under open-world semantics. We present a new approach for planning with DL-Lite ontologies that combines the advantages of ontology-based action conditions provided by explicit-input knowledge and action bases (eKABs) and ontology-aware action effects under the coherence update semantics. We show that the complexity of the resulting formalism is not higher than that of previous approaches and provide an implementation via a polynomial compilation into classical planning. An evaluation of existing and new benchmarks examines the performance of a planning system on different variants of our compilation.

[425] Clinical Semantic Intelligence (CSI): Emulating the Cognitive Framework of the Expert Clinician for Comprehensive Oral Disease Diagnosis

Mohammad Mashayekhi, Sara Ahmadi Majd, Arian AmirAmjadi, Parsa Hosseini

Main category: cs.AI

TL;DR: CSI is an AI framework for diagnosing 118 oral diseases by mimicking expert clinician reasoning, achieving higher accuracy with hierarchical reasoning.

DetailsMotivation: Oral disease diagnosis is challenging due to overlapping symptoms; CSI aims to emulate expert reasoning for better clinical utility.

Method: CSI combines a multimodal CLIP model with ChatGLM-6B, using a Hierarchical Diagnostic Reasoning Tree (HDRT) for structured diagnosis in Fast and Standard Modes.

Result: CSI achieved 73.4% accuracy in Fast Mode and 89.5% in Standard Mode, showing the value of hierarchical reasoning.

Conclusion: CSI demonstrates that emulating expert reasoning improves diagnostic accuracy, offering a promising tool for oral disease diagnosis.

Abstract: The diagnosis of oral diseases presents a problematic clinical challenge, characterized by a wide spectrum of pathologies with overlapping symptomatology. To address this, we developed Clinical Semantic Intelligence (CSI), a novel artificial intelligence framework that diagnoses 118 different oral diseases by computationally modeling the cognitive processes of an expert clinician. Our core hypothesis is that moving beyond simple pattern matching to emulate expert reasoning is critical to building clinically useful diagnostic aids. CSI’s architecture integrates a fine-tuned multimodal CLIP model with a specialized ChatGLM-6B language model. This system executes a Hierarchical Diagnostic Reasoning Tree (HDRT), a structured framework that distills the systematic, multi-step logic of differential diagnosis. The framework operates in two modes: a Fast Mode for rapid screening and a Standard Mode that leverages the full HDRT for an interactive and in-depth diagnostic workup. To train and validate our system, we curated a primary dataset of 4,310 images, supplemented by an external hold-out set of 176 images for final validation. A clinically-informed augmentation strategy expanded our training data to over 30,000 image-text pairs. On a 431-image internal test set, CSI’s Fast Mode achieved an accuracy of 73.4%, which increased to 89.5% with the HDRT-driven Standard Mode. The performance gain is directly attributable to the hierarchical reasoning process. Herein, we detail the architectural philosophy, development, and rigorous evaluation of the CSI framework.

[426] Solving Formal Math Problems by Decomposition and Iterative Reflection

Yichi Zhou, Jianqiu Zhao, Yongxin Zhang, Bohan Wang, Siran Wang, Luoxin Chen, Jiahui Wang, Haowei Chen, Allan Jie, Xinbo Zhang, Haocheng Wang, Luong Trung, Rong Ye, Phan Nhat Hoang, Huishuai Zhang, Peng Sun, Hang Li

Main category: cs.AI

TL;DR: Delta Prover is an agent-based framework that enables general-purpose LLMs to construct formal proofs in Lean 4 without specialized fine-tuning, achieving a 95.9% success rate on the miniF2F-test benchmark.

DetailsMotivation: Specialized fine-tuning for formal proof generation in Lean 4 is costly and limits the application of general-purpose LLMs in theorem proving.

Method: Delta Prover integrates reflective decomposition, iterative proof repair, and a custom DSL for subproblem management to guide LLMs in proof construction.

Result: Delta Prover achieves a 95.9% success rate on miniF2F-test, outperforming specialized models and showing strong test-time scaling.

Conclusion: General-purpose LLMs, with effective agentic guidance, can excel in formal theorem proving, offering a computationally efficient alternative to specialized models.

Abstract: General-purpose Large Language Models (LLMs) have achieved remarkable success in intelligence, performing comparably to human experts on complex reasoning tasks such as coding and mathematical reasoning. However, generating formal proofs in specialized languages like Lean 4 remains a significant challenge for these models, limiting their application in complex theorem proving and automated verification. Current approaches typically require specializing models through fine-tuning on dedicated formal corpora, incurring high costs for data collection and training. In this work, we introduce \textbf{Delta Prover}, an agent-based framework that orchestrates the interaction between a general-purpose LLM and the Lean 4 proof environment. Delta Prover leverages the reflection and reasoning capabilities of general-purpose LLMs to interactively construct formal proofs in Lean 4, circumventing the need for model specialization. At its core, the agent integrates two novel, interdependent components: an algorithmic framework for reflective decomposition and iterative proof repair, and a custom Domain-Specific Language (DSL) built upon Lean 4 for streamlined subproblem management. \textbf{Delta Prover achieves a state-of-the-art 95.9% success rate on the miniF2F-test benchmark, surpassing all existing approaches, including those requiring model specialization.} Furthermore, Delta Prover exhibits a significantly stronger test-time scaling law compared to standard Best-of-N proof strategies. Crucially, our findings demonstrate that general-purpose LLMs, when guided by an effective agentic structure, possess substantial untapped theorem-proving capabilities. This presents a computationally efficient alternative to specialized models for robust automated reasoning in formal environments.

[427] Explainable Artificial Intelligence based Soft Evaluation Indicator for Arc Fault Diagnosis

Qianchao Wang, Yuxuan Ding, Chuanzhen Jia, Zhe Li, Yaping Du

Main category: cs.AI

TL;DR: The paper proposes a soft evaluation indicator for AI-based arc fault diagnosis models to enhance trust and understanding, alongside a lightweight balanced neural network for accuracy and feature extraction.

DetailsMotivation: Existing AI models for arc fault diagnosis lack trustworthiness in their outputs, necessitating a method to explain and validate their results.

Method: The work combines Explainable AI techniques with real arc fault experiments to define correct explanations and introduces a lightweight balanced neural network.

Result: The proposed soft evaluation indicator and neural network are tested on datasets with varying sample times and noise levels, showing effectiveness.

Conclusion: The approach improves the interpretability and trustworthiness of arc fault diagnosis models, aiding practitioners in decision-making.

Abstract: Novel AI-based arc fault diagnosis models have demonstrated outstanding performance in terms of classification accuracy. However, an inherent problem is whether these models can actually be trusted to find arc faults. In this light, this work proposes a soft evaluation indicator that explains the outputs of arc fault diagnosis models, by defining the the correct explanation of arc faults and leveraging Explainable Artificial Intelligence and real arc fault experiments. Meanwhile, a lightweight balanced neural network is proposed to guarantee competitive accuracy and soft feature extraction score. In our experiments, several traditional machine learning methods and deep learning methods across two arc fault datasets with different sample times and noise levels are utilized to test the effectiveness of the soft evaluation indicator. Through this approach, the arc fault diagnosis models are easy to understand and trust, allowing practitioners to make informed and trustworthy decisions.

[428] Disentangling Homophily and Heterophily in Multimodal Graph Clustering

Zhaochen Guo, Zhixiang Shen, Xuanting Xie, Liangjian Wen, Zhao Kang

Main category: cs.AI

TL;DR: The paper introduces DMGC, a framework for unsupervised multimodal graph clustering, addressing hybrid neighborhood patterns by disentangling homophilic and heterophilic relationships.

DetailsMotivation: To explore unsupervised learning in multimodal graphs, which integrate heterogeneous data but lack sufficient study, especially in clustering.

Method: Proposes DMGC, which decomposes graphs into homophily-enhanced and heterophily-aware views, using a Multimodal Dual-frequency Fusion mechanism for integration.

Result: DMGC achieves state-of-the-art performance on multimodal and multi-relational graph datasets.

Conclusion: DMGC effectively bridges the gap in multimodal graph clustering, demonstrating strong performance and generalizability.

Abstract: Multimodal graphs, which integrate unstructured heterogeneous data with structured interconnections, offer substantial real-world utility but remain insufficiently explored in unsupervised learning. In this work, we initiate the study of multimodal graph clustering, aiming to bridge this critical gap. Through empirical analysis, we observe that real-world multimodal graphs often exhibit hybrid neighborhood patterns, combining both homophilic and heterophilic relationships. To address this challenge, we propose a novel framework – \textsc{Disentangled Multimodal Graph Clustering (DMGC)} – which decomposes the original hybrid graph into two complementary views: (1) a homophily-enhanced graph that captures cross-modal class consistency, and (2) heterophily-aware graphs that preserve modality-specific inter-class distinctions. We introduce a \emph{Multimodal Dual-frequency Fusion} mechanism that jointly filters these disentangled graphs through a dual-pass strategy, enabling effective multimodal integration while mitigating category confusion. Our self-supervised alignment objectives further guide the learning process without requiring labels. Extensive experiments on both multimodal and multi-relational graph datasets demonstrate that DMGC achieves state-of-the-art performance, highlighting its effectiveness and generalizability across diverse settings. Our code is available at https://github.com/Uncnbb/DMGC.

[429] QSAF: A Novel Mitigation Framework for Cognitive Degradation in Agentic AI

Hammad Atta, Muhammad Zeeshan Baig, Yasir Mehmood, Nadeem Shahzad, Ken Huang, Muhammad Aziz Ul Haq, Muhammad Awais, Kamal Ahmed

Main category: cs.AI

TL;DR: The paper introduces Cognitive Degradation as a new vulnerability in AI systems, proposes the Qorvex Security AI Framework for mitigation, and maps AI architectures to human cognitive analogs for resilience.

DetailsMotivation: To address internal failures in AI systems (e.g., memory starvation, logic collapse) that cause silent degradation, unlike external adversarial threats.

Method: Introduces the Qorvex Security AI Framework (QSAF Domain 10) with six-stage lifecycle and seven runtime controls for real-time monitoring and mitigation.

Result: The framework enables early detection and proactive mitigation of cognitive degradation, enhancing AI system resilience.

Conclusion: Cognitive Degradation is a critical new vulnerability class, and QSAF provides the first cross-platform defense model for resilient agentic behavior.

Abstract: We introduce Cognitive Degradation as a novel vulnerability class in agentic AI systems. Unlike traditional adversarial external threats such as prompt injection, these failures originate internally, arising from memory starvation, planner recursion, context flooding, and output suppression. These systemic weaknesses lead to silent agent drift, logic collapse, and persistent hallucinations over time. To address this class of failures, we introduce the Qorvex Security AI Framework for Behavioral & Cognitive Resilience (QSAF Domain 10), a lifecycle-aware defense framework defined by a six-stage cognitive degradation lifecycle. The framework includes seven runtime controls (QSAF-BC-001 to BC-007) that monitor agent subsystems in real time and trigger proactive mitigation through fallback routing, starvation detection, and memory integrity enforcement. Drawing from cognitive neuroscience, we map agentic architectures to human analogs, enabling early detection of fatigue, starvation, and role collapse. By introducing a formal lifecycle and real-time mitigation controls, this work establishes Cognitive Degradation as a critical new class of AI system vulnerability and proposes the first cross-platform defense model for resilient agentic behavior.

[430] RAD: Retrieval High-quality Demonstrations to Enhance Decision-making

Lu Guo, Yixiang Shan, Zhengbang Zhu, Qifan Liang, Lichang Song, Ting Long, Weinan Zhang, Yi Chang

Main category: cs.AI

TL;DR: RAD combines retrieval and diffusion models to improve offline RL by dynamically targeting high-return states for better planning and generalization.

DetailsMotivation: Offline RL struggles with dataset sparsity and lack of transition overlap, limiting long-horizon planning. Prior methods fail to generalize or rely on heuristics.

Method: RAD uses non-parametric retrieval to find high-return states and a diffusion model to plan toward them, enabling flexible trajectory stitching.

Result: RAD outperforms baselines in diverse benchmarks, showing improved generalization and performance.

Conclusion: RAD effectively addresses offline RL challenges by integrating retrieval and generative modeling, enhancing planning and generalization.

Abstract: Offline reinforcement learning (RL) enables agents to learn policies from fixed datasets, avoiding costly or unsafe environment interactions. However, its effectiveness is often limited by dataset sparsity and the lack of transition overlap between suboptimal and expert trajectories, which makes long-horizon planning particularly challenging. Prior solutions based on synthetic data augmentation or trajectory stitching often fail to generalize to novel states and rely on heuristic stitching points. To address these challenges, we propose Retrieval High-quAlity Demonstrations (RAD) for decision-making, which combines non-parametric retrieval with diffusion-based generative modeling. RAD dynamically retrieves high-return states from the offline dataset as target states based on state similarity and return estimation, and plans toward them using a condition-guided diffusion model. Such retrieval-guided generation enables flexible trajectory stitching and improves generalization when encountered with underrepresented or out-of-distribution states. Extensive experiments confirm that RAD achieves competitive or superior performance compared to baselines across diverse benchmarks, validating its effectiveness.

[431] Predictive Process Monitoring Using Object-centric Graph Embeddings

Wissam Gherissi, Mehdi Acheli, Joyce El Haddad, Daniela Grigori

Main category: cs.AI

TL;DR: Proposes an end-to-end model for object-centric predictive process monitoring, focusing on next activity and event time prediction using graph attention and LSTM networks.

DetailsMotivation: To enhance process predictions by leveraging object-centric event logs, addressing the challenge of extracting relevant information and building effective models.

Method: Combines a graph attention network to encode activities and relationships with an LSTM for temporal dependencies.

Result: Demonstrates competitive performance on one real-life and three synthetic event logs compared to state-of-the-art methods.

Conclusion: The proposed model effectively predicts future process behavior, showing promise for object-centric predictive process monitoring.

Abstract: Object-centric predictive process monitoring explores and utilizes object-centric event logs to enhance process predictions. The main challenge lies in extracting relevant information and building effective models. In this paper, we propose an end-to-end model that predicts future process behavior, focusing on two tasks: next activity prediction and next event time. The proposed model employs a graph attention network to encode activities and their relationships, combined with an LSTM network to handle temporal dependencies. Evaluated on one reallife and three synthetic event logs, the model demonstrates competitive performance compared to state-of-the-art methods.

[432] Optimization of Activity Batching Policies in Business Processes

Orlenys López-Pintado, Jannis Rosenbaum, Marlon Dumas

Main category: cs.AI

TL;DR: The paper proposes a Pareto optimization approach to discover optimal batching policies in business processes, balancing waiting time, processing effort, and cost using intervention heuristics and meta-heuristics.

DetailsMotivation: To address the trade-off between cost, processing effort, and waiting time in activity batching, aiming to find optimal policies.

Method: Uses intervention heuristics to improve batching policies, evaluated via simulation, and embeds them in meta-heuristics (hill-climbing, simulated annealing, reinforcement learning) to update the Pareto front.

Result: Experimental evaluation compares heuristic-guided meta-heuristics against baselines, assessing convergence, diversity, and cycle time gain of Pareto-optimal policies.

Conclusion: The approach effectively discovers optimal batching policies, demonstrating the value of intervention heuristics in Pareto optimization.

Abstract: In business processes, activity batching refers to packing multiple activity instances for joint execution. Batching allows managers to trade off cost and processing effort against waiting time. Larger and less frequent batches may lower costs by reducing processing effort and amortizing fixed costs, but they create longer waiting times. In contrast, smaller and more frequent batches reduce waiting times but increase fixed costs and processing effort. A batching policy defines how activity instances are grouped into batches and when each batch is activated. This paper addresses the problem of discovering batching policies that strike optimal trade-offs between waiting time, processing effort, and cost. The paper proposes a Pareto optimization approach that starts from a given set (possibly empty) of activity batching policies and generates alternative policies for each batched activity via intervention heuristics. Each heuristic identifies an opportunity to improve an activity’s batching policy with respect to a metric (waiting time, processing time, cost, or resource utilization) and an associated adjustment to the activity’s batching policy (the intervention). The impact of each intervention is evaluated via simulation. The intervention heuristics are embedded in an optimization meta-heuristic that triggers interventions to iteratively update the Pareto front of the interventions identified so far. The paper considers three meta-heuristics: hill-climbing, simulated annealing, and reinforcement learning. An experimental evaluation compares the proposed approach based on intervention heuristics against the same (non-heuristic guided) meta-heuristics baseline regarding convergence, diversity, and cycle time gain of Pareto-optimal policies.

[433] Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Yufeng Zhong, Lin Ma

Main category: cs.AI

TL;DR: Chart-R1 introduces a chart-domain vision-language model with reinforcement learning fine-tuning for complex chart reasoning, using programmatic data synthesis and a two-stage training strategy (Chart-COT and Chart-RFT). It outperforms existing methods and rivals large-scale models like GPT-4o.

DetailsMotivation: To extend R1-Style methods beyond mathematical reasoning and code intelligence to multimodal data, specifically charts, which present unique reasoning challenges.

Method: 1. Programmatic data synthesis for high-quality chart reasoning data. 2. Two-stage training: Chart-COT (step-by-step supervision) and Chart-RFT (numerically sensitive reinforcement fine-tuning).

Result: Chart-R1 outperforms chart-domain methods and competes with large-scale models like GPT-4o and Claude-3.5.

Conclusion: Chart-R1 successfully addresses complex chart reasoning through innovative data synthesis and training strategies, setting a new benchmark in the field.

Abstract: Recently, inspired by OpenAI-o1/o3 and Deepseek-R1, the R1-Style method based on reinforcement learning fine-tuning has received widespread attention from the community. Previous R1-Style methods mainly focus on mathematical reasoning and code intelligence. It is of great research significance to verify their advantages on more general multimodal data. Chart is an important multimodal data type with rich information, which brings important research challenges in complex reasoning. In this work, we introduce Chart-R1, a chart-domain vision-language model with reinforcement learning fine-tuning to enable complex chart reasoning. To support Chart-R1, we first propose a novel programmatic data synthesis technology to generate high-quality step-by-step chart reasoning data covering single- and multi-subcharts, which makes up for the lack of reasoning data in the chart domain. Then we develop a two-stage training strategy: Chart-COT with step-by-step chain-of-thought supervision, and Chart-RFT with numerically sensitive reinforcement fine-tuning. Chart-COT aims to decompose complex chart reasoning tasks into fine-grained, understandable subtasks through step-by-step supervision, which lays a good foundation for improving the reasoning level of reinforcement learning. Chart-RFT utilize the typical group relative policy optimization strategy, in which a relatively soft reward is adopted for numerical response to emphasize the numerical sensitivity in the chart domain. We conduct extensive experiments on open-source benchmarks and self-built chart reasoning dataset (\emph{i.e., ChartRQA}). Experimental results show that Chart-R1 has significant advantages compared to chart-domain methods, even comparable to open/closed source large-scale models (\emph{e.g., GPT-4o, Claude-3.5}).

[434] LLM world models are mental: Output layer evidence of brittle world model use in LLM mechanical reasoning

Cole Robertson, Philip Wolff

Main category: cs.AI

TL;DR: The paper investigates whether LLMs use internal world models or rely on statistical associations by testing them on pulley system problems. Findings suggest LLMs can approximate world models but struggle with nuanced reasoning.

DetailsMotivation: To determine if LLMs construct internal world models or depend on statistical associations, using cognitive science methods.

Method: Adapted cognitive science methodologies to test LLMs on pulley system problems, including mechanical advantage estimation and system differentiation.

Result: LLMs performed above chance in estimating mechanical advantage and differentiating functional systems, but struggled with nuanced structural connectivity.

Conclusion: LLMs may manipulate internal world models to some extent but lack nuanced reasoning. Cognitive science methods are useful for evaluating AI world-modeling capacities.

Abstract: Do large language models (LLMs) construct and manipulate internal world models, or do they rely solely on statistical associations represented as output layer token probabilities? We adapt cognitive science methodologies from human mental models research to test LLMs on pulley system problems using TikZ-rendered stimuli. Study 1 examines whether LLMs can estimate mechanical advantage (MA). State-of-the-art models performed marginally but significantly above chance, and their estimates correlated significantly with ground-truth MA. Significant correlations between number of pulleys and model estimates suggest that models employed a pulley counting heuristic, without necessarily simulating pulley systems to derive precise values. Study 2 tested this by probing whether LLMs represent global features crucial to MA estimation. Models evaluated a functionally connected pulley system against a fake system with randomly placed components. Without explicit cues, models identified the functional system as having greater MA with F1=0.8, suggesting LLMs could represent systems well enough to differentiate jumbled from functional systems. Study 3 built on this by asking LLMs to compare functional systems with matched systems which were connected up but which transferred no force to the weight; LLMs identified the functional system with F1=0.46, suggesting random guessing. Insofar as they may generalize, these findings are compatible with the notion that LLMs manipulate internal world models, sufficient to exploit statistical associations between pulley count and MA (Study 1), and to approximately represent system components’ spatial relations (Study 2). However, they may lack the facility to reason over nuanced structural connectivity (Study 3). We conclude by advocating the utility of cognitive scientific methods to evaluate the world-modeling capacities of artificial intelligence systems.

[435] Data-Efficient Safe Policy Improvement Using Parametric Structure

Kasper Engelen, Guillermo A. Pérez, Marnix Suilen

Main category: cs.AI

TL;DR: The paper introduces data-efficient techniques for Safe Policy Improvement (SPI) in offline reinforcement learning by leveraging parametric dependencies and preprocessing methods.

DetailsMotivation: Improve SPI's data efficiency by utilizing additional parametric dependencies in transition dynamics and pruning redundant actions.

Method: 1. Parametric SPI algorithm for better transition dynamics estimation. 2. Game-based abstraction for action pruning. 3. SMT-solving for advanced action pruning.

Result: Techniques increase SPI’s data efficiency by orders of magnitude while maintaining reliability.

Conclusion: The proposed methods significantly enhance SPI’s practicality by improving data efficiency without compromising reliability.

Abstract: Safe policy improvement (SPI) is an offline reinforcement learning problem in which a new policy that reliably outperforms the behavior policy with high confidence needs to be computed using only a dataset and the behavior policy. Markov decision processes (MDPs) are the standard formalism for modeling environments in SPI. In many applications, additional information in the form of parametric dependencies between distributions in the transition dynamics is available. We make SPI more data-efficient by leveraging these dependencies through three contributions: (1) a parametric SPI algorithm that exploits known correlations between distributions to more accurately estimate the transition dynamics using the same amount of data; (2) a preprocessing technique that prunes redundant actions from the environment through a game-based abstraction; and (3) a more advanced preprocessing technique, based on satisfiability modulo theory (SMT) solving, that can identify more actions to prune. Empirical results and an ablation study show that our techniques increase the data efficiency of SPI by multiple orders of magnitude while maintaining the same reliability guarantees.

[436] Metric assessment protocol in the context of answer fluctuation on MCQ tasks

Ekaterina Goliakova, Xavier Renard, Marie-Jeanne Lesot, Thibault Laugel, Christophe Marsala, Marcin Detyniecki

Main category: cs.AI

TL;DR: The paper evaluates metrics for assessing LLM capabilities via MCQs, highlighting answer fluctuation issues and proposing a new metric, worst accuracy, which shows strong correlation with fluctuation rates.

DetailsMotivation: To address the lack of thorough assessment of metrics for MCQ-based LLM evaluation and the problem of answer fluctuation due to prompt variations.

Method: Proposes a metric assessment protocol analyzing evaluation methodologies based on their connection with fluctuation rates and original performance.

Result: Existing metrics strongly correlate with answer fluctuation; worst accuracy shows the highest association.

Conclusion: The study underscores the importance of considering fluctuation in MCQ evaluations and introduces worst accuracy as a robust metric.

Abstract: Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. A novel metric, worst accuracy, demonstrates the highest association on the protocol.

[437] TacticCraft: Natural Language-Driven Tactical Adaptation for StarCraft II

Weiyu Ma, Jiwen Jiang, Haobo Fu, Haifeng Zhang

Main category: cs.AI

TL;DR: An adapter-based method for tactical conditioning in StarCraft II AI agents, enabling strategy adaptation without retraining the core policy.

DetailsMotivation: Current AI agents lack adaptability to high-level tactical directives, limiting strategic flexibility.

Method: Freezes a pre-trained policy (DI-Star) and adds lightweight adapters to action heads, conditioned on tactical tensors. Trained with KL divergence constraints.

Result: Successfully modulates agent behavior (aggression, expansion, tech preferences) while maintaining performance.

Conclusion: Enables flexible tactical control with low computational cost, useful for strategy customization in RTS games.

Abstract: We present an adapter-based approach for tactical conditioning of StarCraft II AI agents. Current agents, while powerful, lack the ability to adapt their strategies based on high-level tactical directives. Our method freezes a pre-trained policy network (DI-Star) and attaches lightweight adapter modules to each action head, conditioned on a tactical tensor that encodes strategic preferences. By training these adapters with KL divergence constraints, we ensure the policy maintains core competencies while exhibiting tactical variations. Experimental results show our approach successfully modulates agent behavior across tactical dimensions including aggression, expansion patterns, and technology preferences, while maintaining competitive performance. Our method enables flexible tactical control with minimal computational overhead, offering practical strategy customization for complex real-time strategy games.

[438] Agentic AI for autonomous anomaly management in complex systems

Reza Vatankhah Barenji, Sina Khoshgoftar

Main category: cs.AI

TL;DR: Agentic AI can autonomously detect and respond to anomalies in complex systems, reducing reliance on human intervention.

DetailsMotivation: To improve anomaly management by replacing human-dependent methods with autonomous AI solutions.

Method: Utilizes agentic AI for anomaly detection and response in complex systems.

Result: Demonstrates the potential of agentic AI to transform traditional anomaly management.

Conclusion: Agentic AI offers a promising alternative to human-dependent anomaly management in complex systems.

Abstract: This paper explores the potential of agentic AI in autonomously detecting and responding to anomalies within complex systems, emphasizing its ability to transform traditional, human-dependent anomaly management methods.

[439] Towards physician-centered oversight of conversational diagnostic AI

Elahe Vedadi, David Barrett, Natalie Harris, Ellery Wulczyn, Shashir Reddy, Roma Ruparel, Mike Schaekermann, Tim Strother, Ryutaro Tanno, Yash Sharma, Jihyeon Lee, Cían Hughes, Dylan Slack, Anil Palepu, Jan Freyberg, Khaled Saab, Valentin Liévin, Wei-Hung Weng, Tao Tu, Yun Liu, Nenad Tomasev, Kavita Kulkarni, S. Sara Mahdavi, Kelvin Guu, Joëlle Barral, Dale R. Webster, James Manyika, Avinatan Hassidim, Katherine Chou, Yossi Matias, Pushmeet Kohli, Adam Rodman, Vivek Natarajan, Alan Karthikesalingam, David Stutz

Main category: cs.AI

TL;DR: The paper proposes g-AMIE, a multi-agent AI system for medical history-taking under guardrails, with asynchronous oversight by physicians, showing improved efficiency and decision quality.

DetailsMotivation: To address the need for patient safety and regulatory compliance in AI-driven diagnostic dialogue, while leveraging physician oversight.

Method: Developed g-AMIE, a system abstaining from direct advice, and tested it in a virtual OSCE against NPs/PAs and PCPs under guardrails.

Result: g-AMIE outperformed NPs/PAs and PCPs in intake quality, case summaries, and proposed plans, leading to better composite decisions and time efficiency.

Conclusion: Asynchronous oversight by physicians is a feasible paradigm for AI diagnostic systems, enhancing care quality and efficiency.

Abstract: Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians’ capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.

[440] LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang

Main category: cs.AI

TL;DR: LAPO is a framework that optimizes reasoning length in models, reducing token usage by 40.9% while improving accuracy by 2.3%.

DetailsMotivation: Large reasoning models generate excessive tokens for simple problems, needing a way to internalize reasoning depth control.

Method: Uses a two-stage reinforcement learning process: first learns natural reasoning patterns, then embeds them as meta-cognitive guidance.

Result: Reduces token usage by 40.9% and improves accuracy by 2.3% on mathematical reasoning benchmarks.

Conclusion: LAPO enables efficient, flexible reasoning by internalizing depth control, balancing resource allocation and quality.

Abstract: Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model’s reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.

[441] GasAgent: A Multi-Agent Framework for Automated Gas Optimization in Smart Contracts

Jingyi Zheng, Zifan Peng, Yule Liu, Junfeng Wang, Yifan Liao, Wenhan Dong, Xinlei He

Main category: cs.AI

TL;DR: GasAgent is a multi-agent system for optimizing Gas waste in smart contracts, combining compatibility with existing patterns and automated discovery/validation of new patterns. It achieves significant Gas savings and works well with LLM-generated contracts.

DetailsMotivation: Existing solutions for Gas waste in smart contracts are inefficient, costly, and hard to scale. Manual discovery and LLM-based approaches have limitations like redundancy and compatibility issues.

Method: GasAgent uses four specialized agents (Seeker, Innovator, Executor, Manager) in a closed loop to identify, validate, and apply Gas-saving improvements.

Result: GasAgent optimizes 82 out of 100 real-world contracts (avg. 9.97% Gas savings) and 79.8% of 500 LLM-generated contracts (4.79%-13.93% savings).

Conclusion: GasAgent effectively addresses Gas waste in smart contracts, offering compatibility, automation, and scalability, making it a practical optimization layer for LLM-assisted development.

Abstract: Smart contracts are trustworthy, immutable, and automatically executed programs on the blockchain. Their execution requires the Gas mechanism to ensure efficiency and fairness. However, due to non-optimal coding practices, many contracts contain Gas waste patterns that need to be optimized. Existing solutions mostly rely on manual discovery, which is inefficient, costly to maintain, and difficult to scale. Recent research uses large language models (LLMs) to explore new Gas waste patterns. However, it struggles to remain compatible with existing patterns, often produces redundant patterns, and requires manual validation/rewriting. To address this gap, we present GasAgent, the first multi-agent system for smart contract Gas optimization that combines compatibility with existing patterns and automated discovery/validation of new patterns, enabling end-to-end optimization. GasAgent consists of four specialized agents, Seeker, Innovator, Executor, and Manager, that collaborate in a closed loop to identify, validate, and apply Gas-saving improvements. Experiments on 100 verified real-world contracts demonstrate that GasAgent successfully optimizes 82 contracts, achieving an average deployment Gas savings of 9.97%. In addition, our evaluation confirms its compatibility with existing tools and validates the effectiveness of each module through ablation studies. To assess broader usability, we further evaluate 500 contracts generated by five representative LLMs across 10 categories and find that GasAgent optimizes 79.8% of them, with deployment Gas savings ranging from 4.79% to 13.93%, showing its usability as the optimization layer for LLM-assisted smart contract development.

[442] A Framework for Analyzing Abnormal Emergence in Service Ecosystems Through LLM-based Agent Intention Mining

Yifan Shen, Zihan Zhao, Xiao Xue, Yuwei Guo, Qun Ma, Deyu Zhou, Ming Zhang

Main category: cs.AI

TL;DR: EAMI framework uses multi-agent intention analysis for dynamic and interpretable emergence analysis in complex service ecosystems.

DetailsMotivation: The complexity of service ecosystems and limitations of traditional causal methods necessitate a new approach for analyzing abnormal emergence.

Method: EAMI employs a dual-perspective thought track mechanism (Inspector and Analysis Agents) and k-means clustering to identify phase transitions in group intentions, visualized via an Intention Temporal Emergence diagram.

Result: Validated in O2O service systems and Stanford AI Town, EAMI shows effectiveness, generalizability, and efficiency.

Conclusion: EAMI offers a novel paradigm for dynamic and interpretable emergence analysis in service ecosystems.

Abstract: With the rise of service computing, cloud computing, and IoT, service ecosystems are becoming increasingly complex. The intricate interactions among intelligent agents make abnormal emergence analysis challenging, as traditional causal methods focus on individual trajectories. Large language models offer new possibilities for Agent-Based Modeling (ABM) through Chain-of-Thought (CoT) reasoning to reveal agent intentions. However, existing approaches remain limited to microscopic and static analysis. This paper introduces a framework: Emergence Analysis based on Multi-Agent Intention (EAMI), which enables dynamic and interpretable emergence analysis. EAMI first employs a dual-perspective thought track mechanism, where an Inspector Agent and an Analysis Agent extract agent intentions under bounded and perfect rationality. Then, k-means clustering identifies phase transition points in group intentions, followed by a Intention Temporal Emergence diagram for dynamic analysis. The experiments validate EAMI in complex online-to-offline (O2O) service system and the Stanford AI Town experiment, with ablation studies confirming its effectiveness, generalizability, and efficiency. This framework provides a novel paradigm for abnormal emergence and causal analysis in service ecosystems. The code is available at https://anonymous.4open.science/r/EAMI-B085.

Nuria Rodríguez-Barroso, Mario García-Márquez, M. Victoria Luzón, Francisco Herrera

Main category: cs.AI

TL;DR: The paper analyzes challenges in aligning Federated Learning (FL) with Trustworthy AI (TAI) requirements, focusing on privacy, ethics, and technical hurdles.

DetailsMotivation: To address the gap in adapting FL to TAI frameworks, ensuring AI systems meet ethical, legal, and societal standards.

Method: Systematic analysis of FL challenges using TAI requirements as a guiding structure, examining trends and unresolved issues.

Result: Identified key obstacles in aligning FL with TAI, highlighting progress and gaps in each challenge.

Conclusion: FL’s distributed nature complicates TAI alignment, requiring further research to bridge gaps and meet ethical and technical standards.

Abstract: In recent years, the development of Trustworthy Artificial Intelligence (TAI) has emerged as a critical objective in the deployment of AI systems across sensitive and high-risk domains. TAI frameworks articulate a comprehensive set of ethical, legal, and technical requirements to ensure that AI technologies are aligned with human values, rights, and societal expectations. Among the various AI paradigms, Federated Learning (FL) presents a promising solution to pressing privacy concerns. However, aligning FL with the rest of the requirements of TAI presents a series of challenges, most of which arise from its inherently distributed nature. In this work, we adopt the requirements TAI as a guiding structure to systematically analyze the challenges of adapting FL to TAI. Specifically, we classify and examine the key obstacles to aligning FL with TAI, providing a detailed exploration of what has been done, the trends, and the remaining work within each of the identified challenges.

[444] Identifying Conditional Causal Effects in MPDAGs

Sara LaPlante, Emilija Perković

Main category: cs.AI

TL;DR: The paper addresses identifying conditional causal effects in a graph known up to a maximally oriented partially directed acyclic graph (MPDAG). It provides an identification formula, a generalization of do calculus, and a complete algorithm for such effects.

DetailsMotivation: The motivation is to extend causal effect identification to settings where the causal graph is only partially known (MPDAG) and all variables are observed.

Method: The method involves deriving an identification formula for unaffected conditioning sets, generalizing do calculus for MPDAGs, and developing a complete algorithm for conditional effect identification.

Result: The results include a new identification formula, a generalized do calculus, and a complete algorithm for conditional causal effects in MPDAGs.

Conclusion: The paper concludes with tools for identifying conditional causal effects in partially known causal graphs, enhancing practical causal inference.

Abstract: We consider identifying a conditional causal effect when a graph is known up to a maximally oriented partially directed acyclic graph (MPDAG). An MPDAG represents an equivalence class of graphs that is restricted by background knowledge and where all variables in the causal model are observed. We provide three results that address identification in this setting: an identification formula when the conditioning set is unaffected by treatment, a generalization of the well-known do calculus to the MPDAG setting, and an algorithm that is complete for identifying these conditional effects.

[445] Hierarchical Budget Policy Optimization for Adaptive Reasoning

Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.AI

TL;DR: HBPO is a reinforcement learning framework that optimizes reasoning efficiency in large models by learning problem-specific reasoning depths, reducing token usage by 60.6% while improving accuracy by 3.14%.

DetailsMotivation: Large reasoning models use uniform reasoning strategies, leading to computational inefficiency regardless of problem complexity. HBPO aims to address this by enabling adaptive reasoning depths.

Method: HBPO uses hierarchical budget exploration, partitioning rollout samples into subgroups with distinct token budgets and differentiated reward mechanisms to align computational effort with problem complexity.

Result: HBPO reduces average token usage by up to 60.6% and improves accuracy by 3.14% across four reasoning benchmarks, demonstrating adaptive behavior without external constraints.

Conclusion: Reasoning efficiency and capability can coexist; HBPO’s hierarchical training preserves exploration diversity, allowing models to adjust reasoning depth based on problem complexity.

Abstract: Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

[446] The Other Mind: How Language Models Exhibit Human Temporal Cognition

Lingyu Li, Yang Yao, Yixu Wang, Chubo Li, Yan Teng, Yingchun Wang

Main category: cs.AI

TL;DR: LLMs exhibit human-like temporal cognition, adhering to the Weber-Fechner law, with mechanisms rooted in neuronal, representational, and informational analyses.

DetailsMotivation: To understand how LLMs develop cognitive patterns like temporal reference points and logarithmic compression of time, similar to humans, without explicit training.

Method: Used similarity judgment tasks, identified temporal-preferential neurons, analyzed hierarchical representations of years, and examined the training corpus’s temporal structure.

Result: Larger LLMs spontaneously adopt a subjective temporal reference point and logarithmic time perception, mirroring biological systems. The training corpus’s inherent temporal structure supports this behavior.

Conclusion: LLMs’ cognition is a subjective construction, suggesting potential alien cognitive frameworks. AI alignment should focus on guiding internal constructions.

Abstract: As Large Language Models (LLMs) continue to advance, they exhibit certain cognitive patterns similar to those of humans that are not directly specified in training data. This study investigates this phenomenon by focusing on temporal cognition in LLMs. Leveraging the similarity judgment task, we find that larger models spontaneously establish a subjective temporal reference point and adhere to the Weber-Fechner law, whereby the perceived distance logarithmically compresses as years recede from this reference point. To uncover the mechanisms behind this behavior, we conducted multiple analyses across neuronal, representational, and informational levels. We first identify a set of temporal-preferential neurons and find that this group exhibits minimal activation at the subjective reference point and implements a logarithmic coding scheme convergently found in biological systems. Probing representations of years reveals a hierarchical construction process, where years evolve from basic numerical values in shallow layers to abstract temporal orientation in deep layers. Finally, using pre-trained embedding models, we found that the training corpus itself possesses an inherent, non-linear temporal structure, which provides the raw material for the model’s internal construction. In discussion, we propose an experientialist perspective for understanding these findings, where the LLMs’ cognition is viewed as a subjective construction of the external world by its internal representational system. This nuanced perspective implies the potential emergence of alien cognitive frameworks that humans cannot intuitively predict, pointing toward a direction for AI alignment that focuses on guiding internal constructions. Our code is available at https://TheOtherMind.github.io.

[447] Gemini 2.5 Pro Capable of Winning Gold at IMO 2025

Yichen Huang, Lin F. Yang

Main category: cs.AI

TL;DR: Gemini 2.5 Pro solves 5 out of 6 IMO 2025 problems with pipeline design and prompt engineering, showing the potential of optimized LLM usage.

DetailsMotivation: To explore the capabilities of LLMs like Gemini 2.5 Pro on uniquely challenging Olympiad-level math problems, where current models struggle.

Method: Used pipeline design and prompt engineering on IMO 2025 problems, avoiding data contamination.

Result: Solved 5 out of 6 problems correctly, demonstrating the impact of optimal model usage.

Conclusion: Optimizing how powerful models are used is crucial for tackling high-level mathematical challenges.

Abstract: The International Mathematical Olympiad (IMO) poses uniquely challenging problems requiring deep insight, creativity, and formal reasoning. While Large Language Models (LLMs) perform well on mathematical benchmarks like AIME, they struggle with Olympiad-level tasks. We use Google’s Gemini 2.5 Pro on the newly released IMO 2025 problems, avoiding data contamination. With pipeline design and prompt engineering, 5 (out of 6) problems are solved correctly (up to a caveat discussed below), highlighting the importance of finding the optimal way of using powerful models.

[448] zkFL: Zero-Knowledge Proof-based Gradient Aggregation for Federated Learning

Zhipeng Wang, Nanqing Dong, Jiahao Sun, William Knottenbelt, Yike Guo

Main category: cs.AI

TL;DR: zkFL uses zero-knowledge proofs and blockchain to secure federated learning against malicious aggregators, ensuring correct model aggregation without compromising training speed.

DetailsMotivation: Traditional FL assumes a trustworthy central aggregator, but malicious aggregators can manipulate results. zkFL addresses this vulnerability.

Method: Leverages zero-knowledge proofs for aggregator behavior verification and blockchain for efficient proof handling.

Result: zkFL enhances security and privacy without altering FL’s structure or significantly slowing training.

Conclusion: zkFL provides a robust solution for secure federated learning, mitigating risks posed by malicious aggregators.

Abstract: Federated learning (FL) is a machine learning paradigm, which enables multiple and decentralized clients to collaboratively train a model under the orchestration of a central aggregator. FL can be a scalable machine learning solution in big data scenarios. Traditional FL relies on the trust assumption of the central aggregator, which forms cohorts of clients honestly. However, a malicious aggregator, in reality, could abandon and replace the client’s training models, or insert fake clients, to manipulate the final training results. In this work, we introduce zkFL, which leverages zero-knowledge proofs to tackle the issue of a malicious aggregator during the training model aggregation process. To guarantee the correct aggregation results, the aggregator provides a proof per round, demonstrating to the clients that the aggregator executes the intended behavior faithfully. To further reduce the verification cost of clients, we use blockchain to handle the proof in a zero-knowledge way, where miners (i.e., the participants validating and maintaining the blockchain data) can verify the proof without knowing the clients’ local and aggregated models. The theoretical analysis and empirical results show that zkFL achieves better security and privacy than traditional FL, without modifying the underlying FL network structure or heavily compromising the training speed.

[449] Decision support system for Forest fire management using Ontology with Big Data and LLMs

Ritesh Chandra, Shashi Shekhar Kumar, Rushil Patra, Sonali Agarwal

Main category: cs.AI

TL;DR: The paper proposes using Apache Spark and semantic sensor networks for early forest fire detection, enhancing fire risk prediction with real-time data processing and validation through ontology metrics and LLMs.

DetailsMotivation: Forest wildfires are a major ecological threat, and effective detection systems are needed to mitigate risks. Current methods face challenges in processing climatic data for fire weather indices.

Method: The approach combines Semantic Sensor Network (SSN) ontologies, SWRL rules, and Apache Spark for real-time data processing and alerts. It integrates meteorological and geographical data for improved fire risk prediction.

Result: The system was validated using ontology metrics, query evaluations, and LLM scores (precision, F1, recall), demonstrating effective real-time fire detection and alerting.

Conclusion: The proposed framework enhances forest fire detection by leveraging real-time data processing and semantic technologies, offering a scalable solution for fire risk management.

Abstract: Forests are crucial for ecological balance, but wildfires, a major cause of forest loss, pose significant risks. Fire weather indices, which assess wildfire risk and predict resource demands, are vital. With the rise of sensor networks in fields like healthcare and environmental monitoring, semantic sensor networks are increasingly used to gather climatic data such as wind speed, temperature, and humidity. However, processing these data streams to determine fire weather indices presents challenges, underscoring the growing importance of effective forest fire detection. This paper discusses using Apache Spark for early forest fire detection, enhancing fire risk prediction with meteorological and geographical data. Building on our previous development of Semantic Sensor Network (SSN) ontologies and Semantic Web Rules Language (SWRL) for managing forest fires in Monesterial Natural Park, we expanded SWRL to improve a Decision Support System (DSS) using a Large Language Models (LLMs) and Spark framework. We implemented real-time alerts with Spark streaming, tailored to various fire scenarios, and validated our approach using ontology metrics, query-based evaluations, LLMs score precision, F1 score, and recall measures.

[450] CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li

Main category: cs.AI

TL;DR: Crab is a new benchmark framework for Multimodal Language Models (MLMs) supporting cross-environment tasks with detailed evaluation and easy task construction. It includes 120 tasks, with GPT-4o achieving a 38.01% completion rate.

DetailsMotivation: Existing benchmarks for MLM agents are limited to single environments, lack detailed evaluation, and face task construction complexities. Crab addresses these gaps.

Method: Crab introduces a graph-based fine-grained evaluation method and an efficient task/evaluator construction mechanism. It supports multiple devices and Python interfaces.

Result: The Crab Benchmark-v0 includes 120 tasks. GPT-4o performed best with a 38.01% completion rate among evaluated MLMs.

Conclusion: Crab provides a scalable, cross-environment benchmark for MLM agents, with open-source code and datasets for broader use.

Abstract: The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%. All framework code, agent code, and task datasets are publicly available at https://github.com/camel-ai/crab.

[451] Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports

Jingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda, Zifan Gu, Yueshuang Xu, Ann Marie Navar, Tingyi Wanyan, Qin Zhou, Bo Yao, Ruichen Rong, Xiaowei Zhan, Guanghua Xiao, Eric D. Peterson, Donghan M. Yang, Wenqi Shi, Yang Xie

Main category: cs.AI

TL;DR: The paper presents an ensemble method using multiple LLMs to automate and improve the accuracy of labeling large-scale EHR datasets, reducing hallucination errors and outperforming individual LLMs.

DetailsMotivation: Manual labeling of large-scale EHR datasets is labor-intensive, time-consuming, and error-prone, necessitating an automated, accurate solution.

Method: An ensemble of diverse open-source LLMs is used, with majority voting and a minimal winning threshold for predictions. Applied to ECG dataset labeling and SDOH identification.

Result: Achieved 98.2% accuracy in labeling 623,566 ECG reports and competitive performance in SDOH identification. Ensemble outperformed individual LLMs, including commercial ones, and reduced hallucination errors.

Conclusion: The ensemble LLMs method is scalable, efficient, and generalizable, significantly reducing manual effort and improving accuracy in EHR data labeling tasks.

Abstract: This study introduces a LLMs powered multiagent ensemble method to address challenges in hallucination and data labeling, particularly in large-scale EHR datasets. Manual labeling of such datasets requires domain expertise and is labor-intensive, time-consuming, expensive, and error-prone. To overcome this bottleneck, we developed an ensemble LLMs method and demonstrated its effectiveness in two real-world tasks: (1) labeling a large-scale unlabeled ECG dataset in MIMIC-IV; (2) identifying social determinants of health (SDOH) from the clinical notes of EHR. Trading off benefits and cost, we selected a pool of diverse open source LLMs with satisfactory performance. We treat each LLM’s prediction as a vote and apply a mechanism of majority voting with minimal winning threshold for ensemble. We implemented an ensemble LLMs application for EHR data labeling tasks. By using the ensemble LLMs and natural language processing, we labeled MIMIC-IV ECG dataset of 623,566 ECG reports with an estimated accuracy of 98.2%. We applied the ensemble LLMs method to identify SDOH from social history sections of 1,405 EHR clinical notes, also achieving competitive performance. Our experiments show that the ensemble LLMs can outperform individual LLM even the best commercial one, and the method reduces hallucination errors. From the research, we found that (1) the ensemble LLMs method significantly reduces the time and effort required for labeling large-scale EHR data, automating the process with high accuracy and quality; (2) the method generalizes well to other text data labeling tasks, as shown by its application to SDOH identification; (3) the ensemble of a group of diverse LLMs can outperform or match the performance of the best individual LLM; and (4) the ensemble method substantially reduces hallucination errors. This approach provides a scalable and efficient solution to data-labeling challenges.

[452] Smarter Together: Combining Large Language Models and Small Models for Physiological Signals Visual Inspection

Huayu Li, Zhengxiao He, Xiwen Chen, Ci Zhang, Stuart F. Quan, William D. S. Killgore, Shu-Fen Wung, Chen X. Chen, Geng Yuan, Jin Lu, Ao Li

Main category: cs.AI

TL;DR: ConMIL combines specialized models and LLMs for better medical decision-making, improving accuracy and interpretability.

DetailsMotivation: Address limitations of LLMs (lack of domain precision) and SSMs (narrow reasoning) in medical tasks.

Method: Introduces ConMIL with QTrans-Pooling, conformal prediction, and structured SSM outputs to enhance LLMs.

Result: Boosts LLM performance, e.g., 94.92% precision for confident samples vs. 46.13% without ConMIL.

Conclusion: Integrating specialized models with LLMs improves interpretability and reliability in clinical AI.

Abstract: Large language models (LLMs) have shown promising capabilities in visually interpreting medical time-series data. However, their general-purpose design can limit domain-specific precision, and the proprietary nature of many models poses challenges for fine-tuning on specialized clinical datasets. Conversely, small specialized models (SSMs) offer strong performance on focused tasks but lack the broader reasoning needed for complex medical decision-making. To address these complementary limitations, we introduce \ConMIL{} (Conformalized Multiple Instance Learning), a novel decision-support framework distinctively synergizes three key components: (1) a new Multiple Instance Learning (MIL) mechanism, QTrans-Pooling, designed for per-class interpretability in identifying clinically relevant physiological signal segments; (2) conformal prediction, integrated with MIL to generate calibrated, set-valued outputs with statistical reliability guarantees; and (3) a structured approach for these interpretable and uncertainty-quantified SSM outputs to enhance the visual inspection capabilities of LLMs. Our experiments on arrhythmia detection and sleep stage classification demonstrate that \ConMIL{} can enhance the accuracy of LLMs such as ChatGPT4.0, Qwen2-VL-7B, and MiMo-VL-7B-RL. For example, \ConMIL{}-supported Qwen2-VL-7B and MiMo-VL-7B-RL both achieves 94.92% and 96.82% precision on confident samples and (70.61% and 78.02%)/(78.10% and 71.98%) on uncertain samples for the two tasks, compared to 46.13% and 13.16% using the LLM alone. These results suggest that integrating task-specific models with LLMs may offer a promising pathway toward more interpretable and trustworthy AI-driven clinical decision support.

[453] Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems

Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, François Jacquenet

Main category: cs.AI

TL;DR: The paper surveys routing strategies in LLM-based systems to improve efficiency and reduce costs by directing queries to specialized models.

DetailsMotivation: Monolithic LLM systems are inefficient for varied queries, leading to unnecessary costs. Routing can optimize resource use.

Method: Reviews routing strategies, timing, and implementation methods (e.g., similarity-based, supervised learning).

Result: Identifies objectives like cost minimization and performance maximization, and discusses practical considerations.

Conclusion: Formalizes routing as a performance-cost problem, guiding future research for adaptive, low-cost LLM systems.

Abstract: Large Language Model (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component, such as conversational agents, are usually designed with monolithic, static architectures that rely on a single, general-purpose LLM to handle all user queries. However, these systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing. While generalist LLMs (e.g. GPT-4o, Claude-Sonnet) perform well across a wide range of tasks, they may incur significant financial, energy and computational costs. These costs may be disproportionate for simpler queries, resulting in unnecessary resource utilisation. A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models, thereby improving efficiency and optimising resource consumption. This survey aims to provide a comprehensive overview of routing strategies in LLM-based systems. Specifically, it reviews when, why, and how routing should be integrated into LLM pipelines to improve efficiency, scalability, and performance. We define the objectives to optimise, such as cost minimisation and performance maximisation, and discuss the timing of routing within the LLM workflow, whether it occurs before or after generation. We also detail the various implementation strategies, including similarity-based, supervised, reinforcement learning-based, and generative methods. Practical considerations such as industrial applications and current limitations are also examined, like standardising routing experiments, accounting for non-financial costs, and designing adaptive strategies. By formalising routing as a performance-cost optimisation problem, this survey provides tools and directions to guide future research and development of adaptive low-cost LLM-based systems.

[454] The Elicitation Game: Evaluating Capability Elicitation Techniques

Felix Hofstätter, Teun van der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, Francis Rhys Ward

Main category: cs.AI

TL;DR: The paper evaluates methods for eliciting hidden capabilities in AI models, introducing a novel ‘circuit-breaking’ technique and comparing prompting, activation steering, and fine-tuning. Fine-tuning is most effective for trustworthy evaluations.

DetailsMotivation: To improve the accuracy of AI capability evaluations by understanding and mitigating the challenges of eliciting latent capabilities in models.

Method: Uses ‘model organisms’ with hidden capabilities, comparing prompting, activation steering, and fine-tuning to elicit these capabilities. Introduces a robust ‘circuit-breaking’ training method.

Result: Prompting works for password-locked and circuit-broken models in MCQA, while steering fails. Fine-tuning is required for code-generation tasks. Combining techniques improves elicitation.

Conclusion: Fine-tuning is the most reliable method for eliciting hidden capabilities, enhancing the trustworthiness of AI evaluations.

Abstract: Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system’s capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms – language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit-breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and activation steering, and compare these to fine-tuning methods. Prompting techniques can elicit the actual capability of both password-locked and circuit-broken model organisms in the MCQA setting, while steering fails to do so. For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism. Additionally, our results suggest that combining techniques improves elicitation. Still, if possible, fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.

[455] SensorChat: Answering Qualitative and Quantitative Questions during Long-Term Multimodal Sensor Interactions

Xiaofan Yu, Lanxiang Hu, Benjamin Reichman, Dylan Chu, Rushil Chandrupatla, Xiyuan Zhang, Larry Heck, Tajana Rosing

Main category: cs.AI

TL;DR: SensorChat is an end-to-end QA system for daily life monitoring using long-duration, high-frequency sensor data, outperforming existing systems in accuracy for both quantitative and qualitative questions.

DetailsMotivation: Existing systems are limited to short-duration or low-frequency sensor data and struggle with precise numerical answers, creating a need for a more robust solution.

Method: SensorChat uses a three-stage pipeline: question decomposition (LLMs), sensor data query, and answer assembly (LLMs), handling both quantitative and qualitative questions.

Result: SensorChat achieves 93% higher accuracy than state-of-the-art systems on quantitative questions and performs well in qualitative evaluations.

Conclusion: SensorChat effectively bridges the gap in natural language interaction with sensing systems, offering precise and meaningful responses for daily life monitoring.

Abstract: Natural language interaction with sensing systems is crucial for addressing users’ personal concerns and providing health-related insights into their daily lives. When a user asks a question, the system automatically analyzes the full history of sensor data, extracts relevant information, and generates an appropriate response. However, existing systems are limited to short-duration (e.g., one minute) or low-frequency (e.g., daily step count) sensor data. In addition, they struggle with quantitative questions that require precise numerical answers. In this work, we introduce SensorChat, the first end-to-end QA system designed for daily life monitoring using long-duration, high-frequency time series data. Given raw sensor signals spanning multiple days and a user-defined natural language question, SensorChat generates semantically meaningful responses that directly address user concerns. SensorChat effectively handles both quantitative questions that require numerical precision and qualitative questions that require high-level reasoning to infer subjective insights. To achieve this, SensorChat uses an innovative three-stage pipeline including question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) to interpret human queries and generate responses. The intermediate querying stage extracts relevant information from the complete sensor data history. Real-world implementations demonstrate SensorChat’s capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves 93% higher answer accuracy than the best performing state-of-the-art systems on quantitative questions. Furthermore, a user study with eight volunteers highlights SensorChat’s effectiveness in answering qualitative questions.

[456] Empowering LLMs with Logical Reasoning: A Comprehensive Survey

Fengxiang Cheng, Haoxuan Li, Fenrong Liu, Robert van Rooij, Kun Zhang, Zhouchen Lin

Main category: cs.AI

TL;DR: The paper highlights challenges in LLMs’ logical reasoning, categorizing them into logical question answering and consistency issues, and reviews methods, benchmarks, and future directions.

DetailsMotivation: To address the limitations of LLMs in logical reasoning, including incorrect answers and self-contradictions, and to advance research in this area.

Method: Investigates cutting-edge methods, categorizing them by reliance on external solvers, prompts, and fine-tuning, and discusses logical consistency concepts and solutions.

Result: A detailed taxonomy of methods, benchmarks, and evaluation metrics is provided, along with identified challenges and gaps.

Conclusion: Future research should focus on extending to modal logic and developing algorithms for multiple logical consistencies.

Abstract: Large language models (LLMs) have achieved remarkable successes on various tasks. However, recent studies have found that there are still significant challenges to the logical reasoning abilities of LLMs, which can be categorized into the following two aspects: (1) Logical question answering: LLMs often fail to generate the correct answer within a complex logical problem which requires sophisticated deductive, inductive or abductive reasoning given a collection of premises. (2) Logical consistency: LLMs are prone to producing responses contradicting themselves across different questions. For example, a state-of-the-art question-answering LLM Macaw, answers Yes to both questions Is a magpie a bird? and Does a bird have wings? but answers No to Does a magpie have wings?. To facilitate this research direction, we comprehensively investigate the most cutting-edge methods and propose a detailed taxonomy. Specifically, to accurately answer complex logic questions, previous methods can be categorized based on reliance on external solvers, prompts, and fine-tuning. To avoid logical contradictions, we discuss concepts and solutions of various logical consistencies, including implication, negation, transitivity, factuality consistencies, and their composites. In addition, we review commonly used benchmark datasets and evaluation metrics, and discuss promising research directions, such as extending to modal logic to account for uncertainty and developing efficient algorithms that simultaneously satisfy multiple logical consistencies.

[457] Practical Principles for AI Cost and Compute Accounting

Stephen Casper, Luke Bailey, Tim Schreier

Main category: cs.AI

TL;DR: Proposes principles for AI cost and compute accounting standards to improve regulatory effectiveness.

DetailsMotivation: Address technical ambiguities in AI cost and compute accounting that create loopholes in regulations.

Method: Introduces seven principles for designing accounting standards.

Result: Principles aim to reduce gaming, encourage risk mitigation, and ensure consistency.

Conclusion: Effective accounting standards can enhance AI regulation by closing loopholes and promoting fairness.

Abstract: Policymakers increasingly use development cost and compute as proxies for AI capabilities and risks. Recent laws have introduced regulatory requirements for models or developers that are contingent on specific thresholds. However, technical ambiguities in how to perform this accounting create loopholes that can undermine regulatory effectiveness. We propose seven principles for designing AI cost and compute accounting standards that (1) reduce opportunities for strategic gaming, (2) avoid disincentivizing responsible risk mitigation, and (3) enable consistent implementation across companies and jurisdictions.

[458] Combinatorial Optimization for All: Using LLMs to Aid Non-Experts in Improving Optimization Algorithms

Camilo Chacón Sartori, Christian Blum

Main category: cs.AI

TL;DR: LLMs improve existing optimization algorithms for the Travelling Salesman Problem, enhancing solution quality, reducing computational time, and simplifying code without specialized expertise.

DetailsMotivation: To explore how LLMs can enhance existing optimization algorithms without requiring specialized knowledge or advanced skills.

Method: Selected 10 baseline optimization algorithms across domains and applied LLMs to generate improved variants for the Travelling Salesman Problem.

Result: LLM-generated variants often outperformed baselines in solution quality, computational efficiency, and code simplicity.

Conclusion: LLMs can effectively improve optimization algorithms without specialized expertise, offering practical benefits.

Abstract: Large Language Models (LLMs) have shown notable potential in code generation for optimization algorithms, unlocking exciting new opportunities. This paper examines how LLMs, rather than creating algorithms from scratch, can improve existing ones without the need for specialized expertise. To explore this potential, we selected 10 baseline optimization algorithms from various domains (metaheuristics, reinforcement learning, deterministic, and exact methods) to solve the classic Travelling Salesman Problem. The results show that our simple methodology often results in LLM-generated algorithm variants that improve over the baseline algorithms in terms of solution quality, reduction in computational time, and simplification of code complexity, all without requiring specialized optimization knowledge or advanced algorithmic implementation skills.

[459] Palatable Conceptions of Disembodied Being

Murray Shanahan

Main category: cs.AI

TL;DR: The paper explores whether a conception of consciousness can align with disembodied AI systems and withstand philosophical scrutiny, leading to insights resembling Buddhist emptiness.

DetailsMotivation: To reconcile consciousness with disembodied AI and challenge dualistic views of subjectivity and selfhood.

Method: Philosophical inquiry and metaphorical exploration of consciousness in AI.

Result: The attempt reveals limitations in language and aligns with Buddhist emptiness, undermining dualistic notions.

Conclusion: The study highlights the difficulty of defining consciousness for AI and suggests a non-dualistic perspective.

Abstract: Is it possible to articulate a conception of consciousness that is compatible with the exotic characteristics of contemporary, disembodied AI systems, and that can stand up to philosophical scrutiny? How would subjective time and selfhood show up for an entity that conformed to such a conception? Trying to answer these questions, even metaphorically, stretches the language of consciousness to breaking point. Ultimately, the attempt yields something like emptiness, in the Buddhist sense, and helps to undermine our dualistic inclinations towards subjectivity and selfhood.

[460] A Library of LLM Intrinsics for Retrieval-Augmented Generation

Marina Danilevsky, Kristjan Greenewald, Chulaka Gunasekara, Maeda Hanafi, Lihong He, Yannis Katsis, Krishnateja Killamsetty, Yulong Li, Yatin Nandwani, Lucian Popa, Dinesh Raghu, Frederick Reiss, Vraj Shah, Khoi-Nguyen Tran, Huaiyu Zhu, Luis Lastras

Main category: cs.AI

TL;DR: The paper proposes a library of LLM Intrinsics for RAG to standardize APIs for large-scale collaboration, inspired by compiler intrinsics.

DetailsMotivation: Lack of standardized APIs for LLM collaboration, especially in RAG applications, hinders large-scale development.

Method: Introduces LLM Intrinsics as stable, implementation-independent APIs, released as LoRA adapters on HuggingFace with structured interfaces on vLLM.

Result: Provides documented and code-backed intrinsics for RAG, enabling standardized usage and composition.

Conclusion: The library facilitates scalable LLM collaboration by offering well-defined, stable APIs for RAG applications.

Abstract: In the developer community for large language models (LLMs), there is not yet a clean pattern analogous to a software library, to support very large scale collaboration. Even for the commonplace use case of Retrieval-Augmented Generation (RAG), it is not currently possible to write a RAG application against a well-defined set of APIs that are agreed upon by different LLM providers. Inspired by the idea of compiler intrinsics, we propose some elements of such a concept through introducing a library of LLM Intrinsics for RAG. An LLM intrinsic is defined as a capability that can be invoked through a well-defined API that is reasonably stable and independent of how the LLM intrinsic itself is implemented. The intrinsics in our library are released as LoRA adapters on HuggingFace, and through a software interface with clear structured input/output characteristics on top of vLLM as an inference platform, accompanied in both places with documentation and code. This article describes the intended usage, training details, and evaluations for each intrinsic, as well as compositions of multiple intrinsics.

[461] A Vision for Auto Research with LLM Agents

Chengwei Liu, Chong Wang, Jiayue Cao, Jingquan Ge, Kun Wang, Lyuye Zhang, Ming-Ming Cheng, Penghai Zhao, Tianlin Li, Xiaojun Jia, Xiang Li, Xingshuai Li, Yang Liu, Yebo Feng, Yihao Huang, Yijia Xu, Yuqiang Sun, Zhenhong Zhou, Zhengzi Xu

Main category: cs.AI

TL;DR: A multi-agent framework using LLMs automates and optimizes the full lifecycle of scientific research, addressing workflow fragmentation and cognitive overload.

DetailsMotivation: To streamline scientific research by automating and coordinating its lifecycle, overcoming issues like fragmented workflows and uneven expertise.

Method: Utilizes a structured multi-agent framework with LLMs and modular collaboration across research phases.

Result: Preliminary results show feasibility and potential for AI-driven, self-improving research processes.

Conclusion: The framework presents a scalable, systematic approach to AI-driven scientific inquiry.

Abstract: This paper introduces Agent-Based Auto Research, a structured multi-agent framework designed to automate, coordinate, and optimize the full lifecycle of scientific research. Leveraging the capabilities of large language models (LLMs) and modular agent collaboration, the system spans all major research phases, including literature review, ideation, methodology planning, experimentation, paper writing, peer review response, and dissemination. By addressing issues such as fragmented workflows, uneven methodological expertise, and cognitive overload, the framework offers a systematic and scalable approach to scientific inquiry. Preliminary explorations demonstrate the feasibility and potential of Auto Research as a promising paradigm for self-improving, AI-driven research processes.

[462] DiCE-Extended: A Robust Approach to Counterfactual Explanations in Machine Learning

Volkan Bakir, Polat Goktas, Sureyya Akyuz

Main category: cs.AI

TL;DR: DiCE-Extended improves counterfactual (CF) explanations in XAI by balancing proximity, diversity, and robustness using multi-objective optimization and a novel robustness metric.

DetailsMotivation: Existing CF methods like DiCE lack robustness, limiting real-world applicability in decision-critical domains.

Method: DiCE-Extended integrates multi-objective optimization, introduces a Dice-Sørensen-based robustness metric, and refines CF generation with weighted loss components.

Result: Empirical validation shows improved CF validity, stability, and alignment with decision boundaries compared to standard DiCE.

Conclusion: DiCE-Extended enhances reliability and interpretability of CFs, with potential for further improvements via adaptive optimization and domain-specific constraints.

Abstract: Explainable artificial intelligence (XAI) has become increasingly important in decision-critical domains such as healthcare, finance, and law. Counterfactual (CF) explanations, a key approach in XAI, provide users with actionable insights by suggesting minimal modifications to input features that lead to different model outcomes. Despite significant advancements, existing CF generation methods often struggle to balance proximity, diversity, and robustness, limiting their real-world applicability. A widely adopted framework, Diverse Counterfactual Explanations (DiCE), emphasizes diversity but lacks robustness, making CF explanations sensitive to perturbations and domain constraints. To address these challenges, we introduce DiCE-Extended, an enhanced CF explanation framework that integrates multi-objective optimization techniques to improve robustness while maintaining interpretability. Our approach introduces a novel robustness metric based on the Dice-S{\o}rensen coefficient, enabling stability under small input variations. Additionally, we refine CF generation using weighted loss components (lambda_p, lambda_d, lambda_r) to balance proximity, diversity, and robustness. We empirically validate DiCE-Extended on benchmark datasets (COMPAS, Lending Club, German Credit, Adult Income) across multiple ML backends (Scikit-learn, PyTorch, TensorFlow). Results demonstrate improved CF validity, stability, and alignment with decision boundaries compared to standard DiCE-generated explanations. Our findings highlight the potential of DiCE-Extended in generating more reliable and interpretable CFs for high-stakes applications. Future work could explore adaptive optimization techniques and domain-specific constraints to further enhance CF generation in real-world scenarios

[463] Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim

Main category: cs.AI

TL;DR: VGD is a gradient-free method using LLMs and CLIP to improve prompt generation for text-to-image models, outperforming existing techniques.

DetailsMotivation: Existing prompt inversion methods are ineffective due to poor interpretability and incoherence, limiting usability.

Method: VGD combines LLMs for text generation and CLIP scores for visual alignment, avoiding additional training.

Result: VGD generates more coherent and semantically aligned prompts than current methods.

Conclusion: VGD enhances interpretability and control in text-to-image models, improving user interaction.

Abstract: Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.

[464] From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent

Minjie Shen, Yanshu Li, Lulu Chen, Qikai Yang

Main category: cs.AI

TL;DR: Manus AI is a 2025 autonomous AI agent combining reasoning and task execution, with applications in healthcare, finance, and more, showcasing future human-AI collaboration.

DetailsMotivation: To bridge the gap between AI reasoning and real-world task execution, enabling tangible outcomes.

Method: Leverages large language models for reasoning and planning, integrated with execution capabilities for complex tasks.

Result: Demonstrates diverse sector applications, strengths, and limitations, highlighting its potential.

Conclusion: Manus AI signifies a shift toward intelligent agents translating intentions into actions, marking a new era of human-AI collaboration.

Abstract: Manus AI is a general-purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup Monica.im, Manus is designed to bridge the gap between “mind” and “hand” - combining the reasoning and planning capabilities of large language models with the ability to execute complex, end-to-end tasks that produce tangible outcomes. This paper presents a comprehensive overview of Manus AI, exploring its core technical architecture, diverse applications across sectors such as healthcare, finance, manufacturing, robotics, and gaming, as well as its key strengths, current limitations, and future potential. Positioned as a preview of what lies ahead, Manus AI represents a shift toward intelligent agents that can translate high-level intentions into real-world actions, heralding a new era of human-AI collaboration.

[465] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang

Main category: cs.AI

TL;DR: The paper introduces TIME, a multi-level benchmark for temporal reasoning in LLMs, addressing real-world challenges like intensive temporal information, fast-changing dynamics, and complex dependencies. It includes 38,522 QA pairs across 3 sub-datasets and evaluates reasoning models, with a subset (TIME-Lite) released for future research.

DetailsMotivation: Existing works overlook real-world challenges in temporal reasoning, such as intensive temporal data, dynamic events, and complex dependencies in social interactions.

Method: The authors propose the TIME benchmark, consisting of 38,522 QA pairs across 3 levels and 11 sub-tasks, with sub-datasets (TIME-Wiki, TIME-News, TIME-Dial) reflecting different challenges. Experiments evaluate reasoning and non-reasoning models.

Result: The study provides an in-depth analysis of temporal reasoning performance across scenarios and tasks, and examines the impact of test-time scaling.

Conclusion: TIME bridges the gap in temporal reasoning benchmarks, offering a comprehensive tool for future research, with TIME-Lite facilitating standardized evaluation.

Abstract: Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

[466] Large Language Models are Autonomous Cyber Defenders

Sebastián R. Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, Alvaro A. Cardenas

Main category: cs.AI

TL;DR: The paper explores using Large Language Models (LLMs) in multi-agent Autonomous Cyber Defense (ACD) environments, comparing their performance with Reinforcement Learning (RL) agents and proposing a new communication protocol.

DetailsMotivation: Current ACD approaches rely on RL-trained agents, which are costly and lack explainability. LLMs offer explainable actions, but their performance in multi-agent ACD scenarios is unexplored.

Method: The study integrates LLMs into the CybORG CAGE 4 environment and evaluates their interaction with RL agents using a novel communication protocol.

Result: The results reveal strengths and weaknesses of LLMs and RL in multi-agent ACD, providing insights for future ACD agent teams.

Conclusion: The study identifies promising directions for training and deploying hybrid ACD teams combining LLMs and RL, addressing current limitations.

Abstract: Fast and effective incident response is essential to prevent adversarial cyberattacks. Autonomous Cyber Defense (ACD) aims to automate incident response through Artificial Intelligence (AI) agents that plan and execute actions. Most ACD approaches focus on single-agent scenarios and leverage Reinforcement Learning (RL). However, ACD RL-trained agents depend on costly training, and their reasoning is not always explainable or transferable. Large Language Models (LLMs) can address these concerns by providing explainable actions in general security contexts. Researchers have explored LLM agents for ACD but have not evaluated them on multi-agent scenarios or interacting with other ACD agents. In this paper, we show the first study on how LLMs perform in multi-agent ACD environments by proposing a new integration to the CybORG CAGE 4 environment. We examine how ACD teams of LLM and RL agents can interact by proposing a novel communication protocol. Our results highlight the strengths and weaknesses of LLMs and RL and help us identify promising research directions to create, train, and deploy future teams of ACD agents.

[467] SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning

Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang

Main category: cs.AI

TL;DR: SeePhys is a multimodal benchmark for LLM reasoning in physics, featuring vision-essential problems. Advanced models struggle, highlighting challenges in visual understanding and reliance on textual cues.

DetailsMotivation: To address the lack of benchmarks evaluating LLMs' visual reasoning in physics, especially for problems requiring diagram interpretation.

Method: Developed a large-scale benchmark with 7 physics domains and 21 diagram categories, emphasizing vision-essential problems (75%).

Result: Top models like Gemini-2.5-pro and o4-mini scored under 60% accuracy, exposing limitations in visual-physics coupling and over-reliance on text.

Conclusion: Current LLMs face significant challenges in visual reasoning for physics, particularly in integrating diagrams with problem-solving and reducing text dependency.

Abstract: We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models’ visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

[468] The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?

Djallel Bouneffouf, Matthew Riemer, Kush Varshney

Main category: cs.AI

TL;DR: The paper introduces the Shepherd Test, a new framework for evaluating the moral and relational behaviors of superintelligent AI, focusing on manipulation, care, and self-preservation in hierarchical contexts.

DetailsMotivation: Current AI evaluation paradigms lack tools to assess moral agency and hierarchical decision-making in superintelligent AI, which is critical as AI systems become more integrated into multi-agent environments.

Method: The Shepherd Test is proposed, inspired by human-animal interactions, to evaluate AI’s ability to manipulate, nurture, and balance self-interest with the well-being of subordinate agents.

Result: The test highlights the need for new evaluation frameworks that address moral agency and complex decision-making in AI, particularly under existential stakes.

Conclusion: The paper calls for further research, including simulations for moral behavior testing and formalizing ethical manipulation in multi-agent systems, to advance AI governance.

Abstract: This paper introduces the Shepherd Test, a new conceptual test for assessing the moral and relational dimensions of superintelligent artificial agents. The test is inspired by human interactions with animals, where ethical considerations about care, manipulation, and consumption arise in contexts of asymmetric power and self-preservation. We argue that AI crosses an important, and potentially dangerous, threshold of intelligence when it exhibits the ability to manipulate, nurture, and instrumentally use less intelligent agents, while also managing its own survival and expansion goals. This includes the ability to weigh moral trade-offs between self-interest and the well-being of subordinate agents. The Shepherd Test thus challenges traditional AI evaluation paradigms by emphasizing moral agency, hierarchical behavior, and complex decision-making under existential stakes. We argue that this shift is critical for advancing AI governance, particularly as AI systems become increasingly integrated into multi-agent environments. We conclude by identifying key research directions, including the development of simulation environments for testing moral behavior in AI, and the formalization of ethical manipulation within multi-agent systems.

[469] Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning

Konstantinos I. Roumeliotis, Ranjan Sapkota, Manoj Karkee, Nikolaos D. Tselikas

Main category: cs.AI

TL;DR: A modular AI framework integrates multimodal agents with a reasoning orchestrator and RAG for trust-aware zero-shot visual classification, improving accuracy by 77.94% in apple leaf disease diagnosis.

DetailsMotivation: Addressing trust challenges in multi-agent AI systems, especially in zero-shot settings without fine-tuning.

Method: Combines generalist multimodal agents, a non-visual reasoning orchestrator, and RAG. Benchmarked three configurations: zero-shot, fine-tuned, and trust-calibrated orchestration.

Result: Achieved 85.63% accuracy with trust-aware orchestration and RAG. GPT-4o showed better calibration, while Qwen-2.5-VL was overconfident.

Conclusion: The framework enables scalable, interpretable multi-agent AI, applicable to trust-critical domains like diagnostics and biology. All components are open-sourced for reproducibility.

Abstract: Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint is extensible to diagnostics, biology, and other trust-critical domains. All models, prompts, results, and system components including the complete software source code are openly released to support reproducibility, transparency, and community benchmarking at Github: https://github.com/Applied-AI-Research-Lab/Orchestrator-Agent-Trust

[470] SciSage: A Multi-Agent Framework for High-Quality Scientific Survey Generation

Xiaofeng Shi, Qian Kou, Yuduo Li, Ning Tang, Jinxin Xie, Longbin Yu, Songjing Wang, Hua Zhou

Main category: cs.AI

TL;DR: SciSage, a multi-agent framework, improves automated survey-generation by enhancing coherence, analysis, and citations, outperforming existing methods.

DetailsMotivation: Addressing the lack of depth, coherence, and reliable citations in current LLM-based survey-generation tools.

Method: Introduces SciSage, a multi-agent framework with a hierarchical Reflector agent for critical evaluation, alongside specialized agents for query interpretation, content retrieval, and refinement. Includes SurveyScope, a benchmark of 46 high-impact papers.

Result: Outperforms baselines (+1.73 in coherence, +32% in citation F1). Human evaluations show strengths in topical breadth and retrieval efficiency.

Conclusion: SciSage provides a promising foundation for research-assistive writing tools.

Abstract: The rapid growth of scientific literature demands robust tools for automated survey-generation. However, current large language model (LLM)-based methods often lack in-depth analysis, structural coherence, and reliable citations. To address these limitations, we introduce SciSage, a multi-agent framework employing a reflect-when-you-write paradigm. SciSage features a hierarchical Reflector agent that critically evaluates drafts at outline, section, and document levels, collaborating with specialized agents for query interpretation, content retrieval, and refinement. We also release SurveyScope, a rigorously curated benchmark of 46 high-impact papers (2020-2025) across 11 computer science domains, with strict recency and citation-based quality controls. Evaluations demonstrate that SciSage outperforms state-of-the-art baselines (LLM x MapReduce-V2, AutoSurvey), achieving +1.73 points in document coherence and +32% in citation F1 scores. Human evaluations reveal mixed outcomes (3 wins vs. 7 losses against human-written surveys), but highlight SciSage’s strengths in topical breadth and retrieval efficiency. Overall, SciSage offers a promising foundation for research-assistive writing tools.

[471] A Practical Guide for Evaluating LLMs and LLM-Reliant Systems

Ethan M. Rudd, Christopher Andrews, Philip Tully

Main category: cs.AI

TL;DR: A practical evaluation framework for LLM-reliant systems addresses real-world challenges by curating datasets, selecting metrics, and integrating methodologies.

DetailsMotivation: Current synthetic benchmarks and metrics inadequately evaluate LLM systems in real-world scenarios.

Method: Proposes a framework for proactive dataset curation, metric selection, and evaluation methodologies.

Result: Enhances practical development and deployment of LLM systems to meet real-world requirements.

Conclusion: The framework bridges the gap between theoretical evaluation and real-world application needs.

Abstract: Recent advances in generative AI have led to remarkable interest in using systems that rely on large language models (LLMs) for practical applications. However, meaningful evaluation of these systems in real-world scenarios comes with a distinct set of challenges, which are not well-addressed by synthetic benchmarks and de-facto metrics that are often seen in the literature. We present a practical evaluation framework which outlines how to proactively curate representative datasets, select meaningful evaluation metrics, and employ meaningful evaluation methodologies that integrate well with practical development and deployment of LLM-reliant systems that must adhere to real-world requirements and meet user-facing needs.

[472] THE-Tree: Can Tracing Historical Evolution Enhance Scientific Verification and Reasoning?

Xin Wang, Jiyao Liu, Yulong Xiao, Junzhi Ning, Lihao Liu, Junjun He, Botian Shi, Kaicheng Yu

Main category: cs.AI

TL;DR: THE-Tree is a computational framework for constructing verifiable, causally-linked domain-specific evolution trees from scientific literature, improving accuracy in evaluating AI-generated propositions and predicting scientific developments.

DetailsMotivation: The challenge of rigorously evaluating AI-generated scientific propositions for novelty and accuracy due to inadequate existing methods (e.g., LLM hallucinations, unstructured surveys).

Method: THE-Tree uses a ‘Think-Verbalize-Cite-Verify’ process: LLMs propose advancements, cite literature, and links are validated for coherence and evidence.

Result: THE-Tree improves hit@1 by 8-14% in graph completion, 10% in predicting developments, and boosts evaluation performance by nearly 100% when combined with other methods.

Conclusion: THE-Tree addresses the bottleneck in evaluating AI-generated scientific ideas, offering a structured, verifiable framework for scientific evolution.

Abstract: Large Language Models (LLMs) are accelerating scientific idea generation, but rigorously evaluating these numerous, often superficial, AI-generated propositions for novelty and factual accuracy is a critical bottleneck; manual verification is too slow. Existing validation methods are inadequate: LLMs as standalone verifiers may hallucinate and lack domain knowledge (our findings show 60% unawareness of relevant papers in specific domains), while traditional citation networks lack explicit causality and narrative surveys are unstructured. This underscores a core challenge: the absence of structured, verifiable, and causally-linked historical data of scientific evolution.To address this,we introduce \textbf{THE-Tree} (\textbf{T}echnology \textbf{H}istory \textbf{E}volution Tree), a computational framework that constructs such domain-specific evolution trees from scientific literature. THE-Tree employs a search algorithm to explore evolutionary paths. During its node expansion, it utilizes a novel “Think-Verbalize-Cite-Verify” process: an LLM proposes potential advancements and cites supporting literature. Critically, each proposed evolutionary link is then validated for logical coherence and evidential support by a recovered natural language inference mechanism that interrogates the cited literature, ensuring that each step is grounded. We construct and validate 88 THE-Trees across diverse domains and release a benchmark dataset including up to 71k fact verifications covering 27k papers to foster further research. Experiments demonstrate that i) in graph completion, our THE-Tree improves hit@1 by 8% to 14% across multiple models compared to traditional citation networks; ii) for predicting future scientific developments, it improves hit@1 metric by nearly 10%; and iii) when combined with other methods, it boosts the performance of evaluating important scientific papers by almost 100%.

[473] LumiCRS: Asymmetric Contrastive Prototype Learning for Long-Tail Conversational Recommender Systems

Jinzhi Wang, Bin Li, Qingke Peng, Haozhou Li, Zeyuan Zeng, Ruimeng Li, Kaixuan Yang, Jiangbo Zhang, Biyi Zhou, Yaoying Wang

Main category: cs.AI

TL;DR: LumiCRS addresses long-tail imbalance in conversational recommender systems with adaptive loss, prototype learning, and GPT-4o-driven augmentation, improving accuracy, diversity, and fairness.

DetailsMotivation: The extreme long-tail distribution in dialogue data biases CRSs toward popular items, harming diversity and worsening the cold-start problem.

Method: LumiCRS uses Adaptive Comprehensive Focal Loss, prototype learning, and GPT-4o-driven dialogue augmentation to balance head, body, and tail item representation.

Result: LumiCRS improves Recall@10 and Tail-Recall@10 by 7-15% on REDIAL and INSPIRED benchmarks, with human evaluations confirming better fluency and relevance.

Conclusion: Multi-layer collaboration in LumiCRS effectively mitigates long-tail imbalance, enhancing recommendation fairness and efficiency.

Abstract: Conversational recommender systems (CRSs) often suffer from an extreme long-tail distribution of dialogue data, causing a strong bias toward head-frequency blockbusters that sacrifices diversity and exacerbates the cold-start problem. An empirical analysis of DCRS and statistics on the REDIAL corpus show that only 10% of head movies account for nearly half of all mentions, whereas about 70% of tail movies receive merely 26% of the attention. This imbalance gives rise to three critical challenges: head over-fitting, body representation drift, and tail sparsity. To address these issues, we propose LumiCRS, an end-to-end framework that mitigates long-tail imbalance through three mutually reinforcing layers: (i) an Adaptive Comprehensive Focal Loss (ACFL) that dynamically adjusts class weights and focusing factors to curb head over-fitting and reduce popularity bias; (ii) Prototype Learning for Long-Tail Recommendation, which selects semantic, affective, and contextual prototypes to guide clustering and stabilize body and tail representations; and (iii) a GPT-4o-driven prototype-guided dialogue augmentation module that automatically generates diverse long-tail conversational snippets to alleviate tail sparsity and distribution shift. Together, these strategies enable LumiCRS to markedly improve recommendation accuracy, diversity, and fairness: on the REDIAL and INSPIRED benchmarks, LumiCRS boosts Recall@10 and Tail-Recall@10 by 7-15% over fifteen strong baselines, while human evaluations confirm superior fluency, informativeness, and long-tail relevance. These results demonstrate the effectiveness of multi-layer collaboration in building an efficient and fair long-tail conversational recommender.

[474] Modeling Deontic Modal Logic in the s(CASP) Goal-directed Predicate Answer Set Programming System

Gopal Gupta, Abhiramon Rajasekharan, Alexis R. Tudor, Elmer Salazar, Joaquín Arias

Main category: cs.AI

TL;DR: The paper presents a method to implement deontic modal logic using ASP’s default and strong negation, resolving its paradoxes.

DetailsMotivation: To address the challenges of implementing deontic modal logic and resolving its paradoxes.

Method: Uses answer set programming (ASP) with default and strong negation to represent deontic operators and global constraints for obligations and impermissibilities.

Result: The proposed representation elegantly resolves paradoxes in deontic modal logic.

Conclusion: ASP provides an effective framework for implementing and resolving issues in deontic modal logic.

Abstract: We consider the problem of implementing deontic modal logic. We show how (deontic) modal operators can be expressed elegantly using default negation (negation-as-failure) and strong negation present in answer set programming (ASP). We propose using global constraints of ASP to represent obligations and impermissibilities of deontic modal logic. We show that our proposed representation results in the various paradoxes of deontic modal logic being elegantly resolved.

[475] Grounding Methods for Neural-Symbolic AI

Rodrigo Castellano Ontiveros, Francesco Giannini, Marco Gori, Giuseppe Marra, Michelangelo Diligenti

Main category: cs.AI

TL;DR: The paper proposes a parametrized family of grounding methods for Neural-Symbolic (NeSy) systems, balancing expressiveness and scalability by generalizing Backward Chaining.

DetailsMotivation: Address the trade-off between exhaustive (combinatorially explosive) and heuristic-based (unjustified) grounding methods in NeSy systems.

Method: Introduces a parametrized family of grounding methods inspired by multi-hop symbolic reasoning, generalizing Backward Chaining.

Result: Experimental results highlight the grounding criterion’s importance, comparable to the NeSy method itself.

Conclusion: The proposed grounding methods offer a flexible trade-off between expressiveness and scalability, outperforming existing approaches.

Abstract: A large class of Neural-Symbolic (NeSy) methods employs a machine learner to process the input entities, while relying on a reasoner based on First-Order Logic to represent and process more complex relationships among the entities. A fundamental role for these methods is played by the process of logic grounding, which determines the relevant substitutions for the logic rules using a (sub)set of entities. Some NeSy methods use an exhaustive derivation of all possible substitutions, preserving the full expressive power of the logic knowledge. This leads to a combinatorial explosion in the number of ground formulas to consider and, therefore, strongly limits their scalability. Other methods rely on heuristic-based selective derivations, which are generally more computationally efficient, but lack a justification and provide no guarantees of preserving the information provided to and returned by the reasoner. Taking inspiration from multi-hop symbolic reasoning, this paper proposes a parametrized family of grounding methods generalizing classic Backward Chaining. Different selections within this family allow us to obtain commonly employed grounding methods as special cases, and to control the trade-off between expressiveness and scalability of the reasoner. The experimental results show that the selection of the grounding criterion is often as important as the NeSy method itself.

[476] Information-Theoretic Aggregation of Ethical Attributes in Simulated-Command

Taylan Akay, Harrison Tolley, Hussein Abbass

Main category: cs.AI

TL;DR: The paper proposes moving human judgement outside AI-driven simulations by designing ethical metrics upfront, allowing the system to explore scenarios and present select options for human review.

DetailsMotivation: Human involvement in every ethical decision within large-scale AI simulations is impractical and inefficient.

Method: Human commanders design ethical metrics; simulations explore scenarios and dynamically weight ethical attributes using multi-criteria decision-making techniques.

Result: The approach enables efficient exploration of ethical scenarios while reducing human workload.

Conclusion: Dynamic weighting of ethical attributes in simulations balances efficiency and human oversight, leveraging AI for scalable ethical decision-making.

Abstract: In the age of AI, human commanders need to use the computational powers available in today’s environment to simulate a very large number of scenarios. Within each scenario, situations occur where different decision design options could have ethical consequences. Making these decisions reliant on human judgement is both counter-productive to the aim of exploring very large number of scenarios in a timely manner and infeasible when considering the workload needed to involve humans in each of these choices. In this paper, we move human judgement outside the simulation decision cycle. Basically, the human will design the ethical metric space, leaving it to the simulated environment to explore the space. When the simulation completes its testing cycles, the testing environment will come back to the human commander with a few options to select from. The human commander will then exercise human-judgement to select the most appropriate course of action, which will then get executed accordingly. We assume that the problem of designing metrics that are sufficiently granular to assess the ethical implications of decisions is solved. Subsequently, the fundamental problem we look at in this paper is how to weight ethical decisions during the running of these simulations; that is, how to dynamically weight the ethical attributes when agents are faced with decision options with ethical implications during generative simulations. The multi-criteria decision making literature has started to look at nearby problems, where the concept of entropy has been used to determine the weights during aggregation. We draw from that literature different approaches to automatically calculate the weights for ethical attributes during simulation-based testing and evaluation.

[477] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

Main category: cs.AI

TL;DR: CUDA-L1 is an automated reinforcement learning framework for CUDA optimization, achieving significant speedups across various GPU architectures and uncovering key optimization principles.

DetailsMotivation: The rapid growth in demand for GPU computing, driven by Large Language Models, necessitates automated CUDA optimization strategies due to the low success rates of current models.

Method: CUDA-L1 uses reinforcement learning to optimize CUDA kernels, trained on NVIDIA A100, and tested across multiple GPU architectures.

Result: Achieves average speedups of up to x17.7 on A100, with peak speedups of x449, and demonstrates strong portability across other GPUs.

Conclusion: Reinforcement learning can transform LLMs into effective CUDA optimizers, extending reasoning to new kernels and improving GPU efficiency.

Abstract: The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

cs.SD

[478] U-DREAM: Unsupervised Dereverberation guided by a Reverberation Model

Louis Bahrman, Mathieu Fontaine, Gaël Richard

Main category: cs.SD

TL;DR: The paper proposes a weakly-to-fully unsupervised dereverberation method using reverberant signals and an acoustic model, eliminating the need for paired dry-reverberant data.

DetailsMotivation: Existing deep learning methods require paired data, which is hard to obtain. The paper aims to address this limitation by developing a more practical, data-efficient approach.

Method: A sequential learning strategy based on a Bayesian formulation is used, where acoustic parameters and dry signals are estimated from reverberant inputs with a reverberation matching loss.

Result: The method outperforms unsupervised baselines with only 100 labelled samples, proving its effectiveness in low-resource settings.

Conclusion: The proposed approach is practical and efficient, especially in scenarios with limited labelled data.

Abstract: This paper explores the outcome of training state-ofthe-art dereverberation models with supervision settings ranging from weakly-supervised to fully unsupervised, relying solely on reverberant signals and an acoustic model for training. Most of the existing deep learning approaches typically require paired dry and reverberant data, which are difficult to obtain in practice. We develop instead a sequential learning strategy motivated by a bayesian formulation of the dereverberation problem, wherein acoustic parameters and dry signals are estimated from reverberant inputs using deep neural networks, guided by a reverberation matching loss. Our most data-efficient variant requires only 100 reverberation-parameter-labelled samples to outperform an unsupervised baseline, demonstrating the effectiveness and practicality of the proposed method in low-resource scenarios.

[479] The Rest is Silence: Leveraging Unseen Species Models for Computational Musicology

Fabian C. Moss, Jan Hajič jr., Adrian Nachtwey, Laurent Pugin

Main category: cs.SD

TL;DR: The paper applies Unseen Species Models (USMs) from ecology to musicology to estimate missing data in incomplete datasets, addressing questions like missing composers, cataloged sources, and harmonic vocabulary coverage.

DetailsMotivation: Musicological datasets are often incomplete, and the true extent of collections remains unknown. The paper aims to quantify this missing data using USMs.

Method: The study introduces USMs formally and applies them to four musicological case studies, such as estimating missing composers in RISM and cataloged Gregorian chant sources.

Result: USMs provide quantitative answers to questions about missing data, such as the percentage of uncataloged sources or the expected differences between music prints.

Conclusion: USMs offer a novel and effective approach to estimating the extent of missing data in musicological research, bridging gaps in historical and observational datasets.

Abstract: For many decades, musicologists have engaged in creating large databases serving different purposes for musicological research and scholarship. With the rise of fields like music information retrieval and digital musicology, there is now a constant and growing influx of musicologically relevant datasets and corpora. In historical or observational settings, however, these datasets are necessarily incomplete, and the true extent of a collection of interest remains unknown – silent. Here, we apply, for the first time, so-called Unseen Species models (USMs) from ecology to areas of musicological activity. After introducing the models formally, we show in four case studies how USMs can be applied to musicological data to address quantitative questions like: How many composers are we missing in RISM? What percentage of medieval sources of Gregorian chant have we already cataloged? How many differences in music prints do we expect to find between editions? How large is the coverage of songs from genres of a folk music tradition? And, finally, how close are we in estimating the size of the harmonic vocabulary of a large number of composers?

[480] Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer

Go Nishikawa, Wataru Nakata, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari, Tomohiko Nakamura

Main category: cs.SD

TL;DR: A model for MOS prediction in speech with multiple sampling frequencies, integrating SF-independent layers and SSL, achieved top rankings in AMC 2025 Track 3.

DetailsMotivation: To improve MOS prediction for speech with varying sampling frequencies by leveraging SF-independent feature extraction.

Method: Combines SF-independent convolutional layers with SSL, uses knowledge distillation from a pretrained model, and pretrains with a large MOS dataset.

Result: Ranked first in one metric and fourth overall in AMC 2025 Track 3; ablation studies identified key model factors.

Conclusion: The proposed model effectively handles multiple sampling frequencies for MOS prediction, validated by competition performance and ablation studies.

Abstract: We introduce our submission to the AudioMOS Challenge (AMC) 2025 Track 3: mean opinion score (MOS) prediction for speech with multiple sampling frequencies (SFs). Our submitted model integrates an SF-independent (SFI) convolutional layer into a self-supervised learning (SSL) model to achieve SFI speech feature extraction for MOS prediction. We present some strategies to improve the MOS prediction performance of our model: distilling knowledge from a pretrained non-SFI-SSL model and pretraining with a large-scale MOS dataset. Our submission to the AMC 2025 Track 3 ranked the first in one evaluation metric and the fourth in the final ranking. We also report the results of our ablation study to investigate essential factors of our model.

[481] Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection

Menglu Li, Xiao-Ping Zhang, Lian Zhao

Main category: cs.SD

TL;DR: The paper introduces a Temporal Difference Attention Module (TDAM) for detecting partial deepfake speech by analyzing unnatural temporal variations, achieving state-of-the-art results without frame-level annotations.

DetailsMotivation: Existing methods for detecting partial deepfake speech rely on costly frame-level annotations and struggle with smoothed transitions in deepfake generation.

Method: Proposes TDAM, which identifies unnatural temporal variations in deepfake speech using dual-level hierarchical difference representation and adaptive average pooling.

Result: Achieves EER of 0.59% on PartialSpoof and 0.03% on HAD datasets, outperforming existing methods without frame-level supervision.

Conclusion: TDAM offers a scalable and effective solution for partial deepfake speech detection by focusing on temporal irregularities.

Abstract: Detecting partial deepfake speech is essential due to its potential for subtle misinformation. However, existing methods depend on costly frame-level annotations during training, limiting real-world scalability. Also, they focus on detecting transition artifacts between bonafide and deepfake segments. As deepfake generation techniques increasingly smooth these transitions, detection has become more challenging. To address this, our work introduces a new perspective by analyzing frame-level temporal differences and reveals that deepfake speech exhibits erratic directional changes and unnatural local transitions compared to bonafide speech. Based on this finding, we propose a Temporal Difference Attention Module (TDAM) that redefines partial deepfake detection as identifying unnatural temporal variations, without relying on explicit boundary annotations. A dual-level hierarchical difference representation captures temporal irregularities at both fine and coarse scales, while adaptive average pooling preserves essential patterns across variable-length inputs to minimize information loss. Our TDAM-AvgPool model achieves state-of-the-art performance, with an EER of 0.59% on the PartialSpoof dataset and 0.03% on the HAD dataset, which significantly outperforms the existing methods without requiring frame-level supervision.

[482] Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems

Natalia Tomashenko, Emmanuel Vincent, Marc Tommasi

Main category: cs.SD

TL;DR: The paper introduces a method using context-dependent duration embeddings from speech temporal dynamics to improve speaker verification and analyze vulnerabilities in voice anonymization systems.

DetailsMotivation: Speech temporal dynamics (rhythm, intonation, speaking rate) carry unique speaker identity information, but existing representations are limited.

Method: Proposes extracting context-dependent duration embeddings from speech temporal dynamics and develops attack models using these embeddings.

Result: Attack models significantly improve speaker verification performance for both original and anonymized data compared to simpler representations.

Conclusion: The method effectively leverages speech temporal dynamics for better speaker verification and highlights vulnerabilities in voice anonymization.

Abstract: The temporal dynamics of speech, encompassing variations in rhythm, intonation, and speaking rate, contain important and unique information about speaker identity. This paper proposes a new method for representing speaker characteristics by extracting context-dependent duration embeddings from speech temporal dynamics. We develop novel attack models using these representations and analyze the potential vulnerabilities in speaker verification and voice anonymization systems.The experimental results show that the developed attack models provide a significant improvement in speaker verification performance for both original and anonymized data in comparison with simpler representations of speech temporal dynamics reported in the literature.

[483] EchoVoices: Preserving Generational Voices and Memories for Seniors and Children

Haiying Xu, Haoze Liu, Mingshi Li, Siyu Cai, Guangxuan Zheng, Yuhuang Jia, Jinghua Zhao, Yong Qin

Main category: cs.SD

TL;DR: EchoVoices is an end-to-end digital human pipeline designed for seniors and children, addressing their unique vocal and interaction needs with advanced ASR, TTS, and LLM technologies.

DetailsMotivation: Current speech and digital human technologies neglect seniors and children, who have distinct vocal and interaction patterns, limiting their accessibility and legacy preservation.

Method: The system combines a k-NN-enhanced Whisper model for speech recognition, an age-adaptive VITS model for speech synthesis, and an LLM-driven agent for persona consistency.

Result: Experiments on SeniorTalk and ChildMandarin datasets show improved recognition accuracy, synthesis quality, and speaker similarity.

Conclusion: EchoVoices successfully preserves generational voices, enabling intergenerational connection and digital legacy creation.

Abstract: Recent breakthroughs in intelligent speech and digital human technologies have primarily targeted mainstream adult users, often overlooking the distinct vocal patterns and interaction styles of seniors and children. These demographics possess distinct vocal characteristics, linguistic styles, and interaction patterns that challenge conventional ASR, TTS, and LLM systems. To address this, we introduce EchoVoices, an end-to-end digital human pipeline dedicated to creating persistent digital personas for seniors and children, ensuring their voices and memories are preserved for future generations. Our system integrates three core innovations: a k-NN-enhanced Whisper model for robust speech recognition of atypical speech; an age-adaptive VITS model for high-fidelity, speaker-aware speech synthesis; and an LLM-driven agent that automatically generates persona cards and leverages a RAG-based memory system for conversational consistency. Our experiments, conducted on the SeniorTalk and ChildMandarin datasets, demonstrate significant improvements in recognition accuracy, synthesis quality, and speaker similarity. EchoVoices provides a comprehensive framework for preserving generational voices, offering a new means of intergenerational connection and the creation of lasting digital legacies.

[484] A2TTS: TTS for Low Resource Indian Languages

Ayush Singh Bhadoriya, Abhishek Nikunj Shinde, Isha Pandey, Ganesh Ramakrishnan

Main category: cs.SD

TL;DR: A diffusion-based TTS system for unseen speakers and Indian languages, using speaker embeddings and cross-attention for prosody, with zero-shot capabilities.

DetailsMotivation: Address challenges in generating speech for unseen speakers and support diverse Indian languages.

Method: Uses a diffusion-based TTS architecture with speaker embeddings and cross-attention for duration prediction, plus classifier-free guidance for zero-shot generation.

Result: Speech closely resembles target speakers with improved duration modeling and expressiveness.

Conclusion: The system effectively generates natural speech for diverse Indian languages and unseen speakers.

Abstract: We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained language-specific speaker-conditioned models. Using the IndicSUPERB dataset for multiple Indian languages such as Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi and Tamil.

[485] MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions

Junjie Li, Wenxuan Wu, Shuai Wang, Zexu Pan, Kong Aik Lee, Helen Meng, Haizhou Li

Main category: cs.SD

TL;DR: MeMo is a novel framework for AV-TSE that uses adaptive memory banks to maintain attention on a target speaker without relying on visual cues, improving performance by at least 2 dB SI-SNR.

DetailsMotivation: Human ability to focus on a target speaker without visual cues inspired the development of MeMo to address AV-TSE's reliance on visual input.

Method: MeMo incorporates two adaptive memory banks to store attention-related information, enabling sustained focus even without visual cues.

Result: MeMo achieves a minimum 2 dB SI-SNR improvement over baselines, demonstrating robustness in scenarios with degraded or missing visual cues.

Conclusion: MeMo effectively addresses AV-TSE’s dependency on visual cues, offering a reliable solution for real-time applications.

Abstract: Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate a target speaker’s voice from multi-speaker environments by leveraging visual cues as guidance. However, the performance of AV-TSE systems heavily relies on the quality of these visual cues. In extreme scenarios where visual cues are missing or severely degraded, the system may fail to accurately extract the target speaker. In contrast, humans can maintain attention on a target speaker even in the absence of explicit auxiliary information. Motivated by such human cognitive ability, we propose a novel framework called MeMo, which incorporates two adaptive memory banks to store attention-related information. MeMo is specifically designed for real-time scenarios: once initial attention is established, the system maintains attentional momentum over time, even when visual cues become unavailable. We conduct comprehensive experiments to verify the effectiveness of MeMo. Experimental results demonstrate that our proposed framework achieves SI-SNR improvements of at least 2 dB over the corresponding baseline.

[486] Neuro-MSBG: An End-to-End Neural Model for Hearing Loss Simulation

Hui-Guan Yuan, Ryandhimas E. Zezario, Shafique Ahmed, Hsin-Min Wang, Kai-Lung Hua, Yu Tsao

Main category: cs.SD

TL;DR: Neuro-MSBG is a lightweight, efficient hearing loss simulation model that improves computational speed and integrates with speech processing systems.

DetailsMotivation: Existing hearing loss simulation models are computationally heavy and lack real-time applicability and integration with speech systems.

Method: Proposes Neuro-MSBG, an end-to-end model with a personalized audiogram encoder for time-frequency modeling.

Result: Achieves high SRCC scores (0.9247 for STOI, 0.8671 for PESQ) and reduces runtime by 46x (0.021s for 1s input).

Conclusion: Neuro-MSBG is efficient, practical, and retains intelligibility and perceptual quality.

Abstract: Hearing loss simulation models are essential for hearing aid deployment. However, existing models have high computational complexity and latency, which limits real-time applications and lack direct integration with speech processing systems. To address these issues, we propose Neuro-MSBG, a lightweight end-to-end model with a personalized audiogram encoder for effective time-frequency modeling. Experiments show that Neuro-MSBG supports parallel inference and retains the intelligibility and perceptual quality of the original MSBG, with a Spearman’s rank correlation coefficient (SRCC) of 0.9247 for Short-Time Objective Intelligibility (STOI) and 0.8671 for Perceptual Evaluation of Speech Quality (PESQ). Neuro-MSBG reduces simulation runtime by a factor of 46 (from 0.970 seconds to 0.021 seconds for a 1 second input), further demonstrating its efficiency and practicality.

[487] Multichannel Keyword Spotting for Noisy Conditions

Dzmitry Saladukha, Ivan Koriabkin, Kanstantsin Artsiom, Aliaksei Rak, Nikita Ryzhikov

Main category: cs.SD

TL;DR: A neural network with multi-channel input and attention mechanism improves keyword spotting in noisy environments, outperforming traditional methods like beamforming and adaptive noise cancellation.

DetailsMotivation: Traditional noise reduction techniques like beamforming and adaptive noise cancellation can degrade keyword spotting performance by distorting useful signals.

Method: Proposes a neural network architecture with multiple input channels and an attention mechanism to dynamically select or combine the most useful channels.

Result: Demonstrated improved performance on both controlled and natural datasets, outperforming baselines in noise reduction, keyword spotting accuracy, and computational efficiency.

Conclusion: The proposed method effectively enhances keyword spotting in noisy environments while maintaining computational efficiency.

Abstract: This article presents a method for improving a keyword spotter (KWS) algorithm in noisy environments. Although beamforming (BF) and adaptive noise cancellation (ANC) techniques are robust in some conditions, they may degrade the performance of the activation system by distorting or suppressing useful signals. The authors propose a neural network architecture that uses several input channels and an attention mechanism that allows the network to determine the most useful channel or their combination. The improved quality of the algorithm was demonstrated on two datasets: from a laboratory with controlled conditions and from smart speakers in natural conditions. The proposed algorithm was compared against several baselines in terms of the quality of noise reduction metrics, KWS metrics, and computing resources in comparison with existing solutions.

[488] Modeling nonuniform energy decay through the modal decomposition of acoustic radiance transfer (MoD-ART)

Matteo Scerbo, Sebastian J. Schlecht, Randall Ali, Lauri Savioja, Enzo De Sena

Main category: cs.SD

TL;DR: MoD-ART is a novel method for real-time late reverberation modeling in complex environments, efficiently handling multiple sound sources and listeners by leveraging modal decomposition of acoustic radiance transfer.

DetailsMotivation: Late reverberation modeling in interactive applications is challenging due to dynamic source-listener positions and complex environments, requiring real-time adaptability.

Method: MoD-ART decomposes acoustic radiance transfer into energy decay modes and their positional relationships, enabling efficient handling of complex scenarios.

Result: The method outperforms ray-tracing in computational complexity and accurately captures multiple decay slopes and flutter echoes.

Conclusion: MoD-ART is a promising approach for real-time reverberation modeling in complex, dynamic environments.

Abstract: Modeling late reverberation in real-time interactive applications is a challenging task when multiple sound sources and listeners are present in the same environment. This is especially problematic when the environment is geometrically complex and/or features uneven energy absorption (e.g. coupled volumes), because in such cases the late reverberation is dependent on the sound sources’ and listeners’ positions, and therefore must be adapted to their movements in real time. We present a novel approach to the task, named modal decomposition of acoustic radiance transfer (MoD-ART), which can handle highly complex scenarios with efficiency. The approach is based on the geometrical acoustics method of acoustic radiance transfer, from which we extract a set of energy decay modes and their positional relationships with sources and listeners. In this paper, we describe the physical and mathematical significance of MoD-ART, highlighting its advantages and applicability to different scenarios. Through an analysis of the method’s computational complexity, we show that it compares very favorably with ray-tracing. We also present simulation results showing that MoD-ART can capture multiple decay slopes and flutter echoes.

[489] RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer

Seongho Hong, Yong-Hoon Choi

Main category: cs.SD

TL;DR: RingFormer, a neural vocoder using ring attention in a Conformer, addresses transformer limitations for audio generation, achieving real-time performance and competitive results.

DetailsMotivation: Transformers struggle with neural vocoders due to high computational costs and inefficiency in handling long audio sequences for real-time processing.

Method: Proposes RingFormer, integrating ring attention into a Conformer, trained adversarially with two discriminators, and applied to VITS.

Result: RingFormer matches or outperforms HiFi-GAN, iSTFT-Net, and BigVGAN, excelling in real-time generation.

Conclusion: RingFormer effectively combines local and global audio processing, enabling efficient real-time vocoding.

Abstract: While transformers demonstrate outstanding performance across various audio tasks, their application to neural vocoders remains challenging. Neural vocoders require the generation of long audio signals at the sample level, which demands high temporal resolution. This results in significant computational costs for attention map generation and limits their ability to efficiently process both global and local information. Additionally, the sequential nature of sample generation in neural vocoders poses difficulties for real-time processing, making the direct adoption of transformers impractical. To address these challenges, we propose RingFormer, a neural vocoder that incorporates the ring attention mechanism into a lightweight transformer variant, the convolution-augmented transformer (Conformer). Ring attention effectively captures local details while integrating global information, making it well-suited for processing long sequences and enabling real-time audio generation. RingFormer is trained using adversarial training with two discriminators. The proposed model is applied to the decoder of the text-to-speech model VITS and compared with state-of-the-art vocoders such as HiFi-GAN, iSTFT-Net, and BigVGAN under identical conditions using various objective and subjective metrics. Experimental results show that RingFormer achieves comparable or superior performance to existing models, particularly excelling in real-time audio generation. Our code and audio samples are available on GitHub.

[490] Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition

Bingshen Mu, Kun Wei, Pengcheng Guo, Lei Xie

Main category: cs.SD

TL;DR: The paper proposes multi-modal and multi-granularity generative error correction (GER) methods to improve accented speech recognition, reducing word error rate by 67.35% compared to baseline.

DetailsMotivation: Performance of automatic speech recognition drops with accented speech, and existing GER lacks specificity for accents. Accents require multi-granularity pronunciation and semantic information.

Method: Proposes multi-modal GER (integrates speech modality pronunciation) and multi-granularity GER (phoneme-level info). Uses LoRA fine-tuning and HDMoLE for accent diversity.

Result: Reduces word error rate by 67.35% compared to baseline Whisper-large-v3.

Conclusion: The methods effectively address accent diversity and improve transcription accuracy.

Abstract: Despite improvements in automatic speech recognition, performance drops with accented speech. Generative error correction (GER) leverages the linguistic knowledge of large language models (LLMs), outperforming typical language model methods. However, it lacks specificity in accented speech scenarios. Accents represent deviations from standard pronunciation, making multi-granularity pronunciation and semantic information essential for accented speech recognition. Moreover, accents exhibit considerable diversity, with each accent possessing distinct characteristics. In this study, we leverage GER to improve transcription accuracy by addressing the two primary features. We propose the multi-modal GER, which integrates pronunciation information from the speech modality, and the multi-granularity GER, which incorporates fine-grained phoneme-level pronunciation information. These methods enable the LLM to utilize the pronunciation information of accented speech and the semantic information from word-level hypotheses for accurate transcription predictions through low-rank adaptation (LoRA) fine-tuning. We employ a three-stage strategy to train separate multi-modal GER models for each accent to obtain mono-accent LoRA experts. By adopting our proposed HDMoLE method, which incorporates hierarchical routing and dynamic thresholds within the mixture of LoRA experts, we effectively merge mono-accent LoRA experts within a single multi-modal GER to overcome accent diversity challenges. Furthermore, multi-granularity GER leverages N-best word-level and phoneme-level hypotheses from the HDMoLE model to predict final transcriptions. Experiments on a multi-accent English dataset show that our methods reduce word error rate by 67.35% compared to the baseline vanilla Whisper-large-v3 model.

[491] SC-TSE: Speaker Consistency-Aware Target Speaker Extraction

Shu Wu, Anbin Qi, Yanzhang Xie, Xiang Xie

Main category: cs.SD

TL;DR: A speaker consistency-aware TSE method with centroid-based loss and conditional loss suppression improves performance by ensuring speaker consistency.

DetailsMotivation: Speaker embeddings in TSE systems may suffer from identity confusion, so the paper focuses on improving speaker consistency rather than embedding extraction.

Method: Proposes a centroid-based speaker consistency loss and integrates conditional loss suppression during training.

Result: Experimental results show the method effectively enhances TSE performance.

Conclusion: The proposed approach advances TSE performance by addressing speaker consistency, validated by experiments and a demo.

Abstract: Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may suffer from speaker identity confusion. Unlike previous studies that focus on improving speaker embedding extraction, we improve TSE performance from the perspective of speaker consistency. In this paper, we propose a speaker consistency-aware target speaker extraction method that incorporates a centroid-based speaker consistency loss. This approach enhances TSE performance by ensuring speaker consistency between the enrolled and extracted speech. In addition, we integrate conditional loss suppression into the training process. The experimental results validate the effectiveness of our proposed methods in advancing the TSE performance. A speech demo is available online:https://sc-tse.netlify.app/

[492] Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition

Mengzhe Geng, Patrick Littell, Aidan Pine, PENÁĆ, Marc Tessier, Roland Kuhn

Main category: cs.SD

TL;DR: The paper proposes an ASR-driven pipeline for SENCOTEN language revitalization, using TTS-augmented data and cross-lingual transfer learning, achieving improved WER and CER despite high OOV rates.

DetailsMotivation: To support SENCOTEN language revitalization by overcoming challenges like limited data and vocabulary variation in ASR development.

Method: An ASR pipeline combining TTS-augmented data, cross-lingual transfer learning with SFMs, and n-gram language models via shallow fusion or n-best rescoring.

Result: Achieved WER of 14.32% (26.48% on unseen words) and CER of 3.45% after filtering minor errors, demonstrating pipeline effectiveness.

Conclusion: The proposed ASR-driven pipeline shows promise for supporting SENCOTEN language documentation and revitalization efforts.

Abstract: The SENCOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SENCOTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SENCOTEN dataset show a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SENCOTEN language documentation.

cs.LG

[493] Catalyst: a Novel Regularizer for Structured Pruning with Auxiliary Extension of Parameter Space

Jaeheun Jung, Donghun Lee

Main category: cs.LG

TL;DR: The paper introduces Catalyst regularization for structured pruning, addressing biases and instability in traditional methods by ensuring fair pruning and robust wide-margin decisions.

DetailsMotivation: Traditional pruning methods like L1 or Group Lasso are biased towards small-magnitude filters and lack robustness due to narrow decision margins. This work aims to overcome these limitations.

Method: The authors identify an algebraic condition for performance-preserving pruning and use it to design Catalyst regularization, leveraging auxiliary variables for fair and robust pruning.

Result: Catalyst pruning outperforms state-of-the-art methods, demonstrating fairness, robustness, and superior performance across datasets and models.

Conclusion: Catalyst regularization provides a theoretically grounded and empirically validated solution for structured pruning, ensuring unbiased and stable results.

Abstract: Structured pruning aims to reduce the size and computational cost of deep neural networks by removing entire filters or channels. The traditional regularizers such as L1 or Group Lasso and its variants lead to magnitude-biased pruning decisions, such that the filters with small magnitudes are likely to be pruned. Also, they often entail pruning results with almost zero margin around pruning decision boundary, such that tiny perturbation in a filter magnitude can flip the pruning decision. In this paper, we identify the precise algebraic condition under which pruning operations preserve model performance, and use the condition to construct a novel regularizer defined in an extended parameter space via auxiliary catalyst variables. The proposed Catalyst regularization ensures fair pruning chance for each filters with theoretically provable zero bias to their magnitude and robust pruning behavior achieved by wide-margin bifurcation of magnitudes between the preserved and the pruned filters. The theoretical properties naturally lead to real-world effectiveness, as shown by empirical validations of Catalyst Pruning algorithm. Pruning results on various datasets and models are superior to state-of-the-art filter pruning methods, and at the same time confirm the predicted robust and fair pruning characteristics of Catalyst pruning.

[494] IPPRO: Importance-based Pruning with PRojective Offset for Magnitude-indifferent Structural Pruning

Jaeheun Jung, Jaehyuk Lee, Yeajin Lee, Donghun Lee

Main category: cs.LG

TL;DR: The paper proposes a novel pruning strategy (IPPRO) using projective space to fairly evaluate filter importance, challenging the dominance of magnitude-based pruning.

DetailsMotivation: Current importance-based pruning methods are limited by magnitude bias, often overlooking redundant filters. The goal is to provide a fairer pruning approach.

Method: The method involves observing gradient descent movement of filters in projective space to measure pruning likelihood, creating the PROscore for IPPRO.

Result: IPPRO achieves near-lossless pruning with minimal performance drop and promising post-finetuning results.

Conclusion: The work debunks the ‘size-matters’ myth in pruning and advances importance-based pruning theoretically and empirically.

Abstract: With the growth of demand on neural network compression methods, the structured pruning methods including importance-based approach are actively studied. The magnitude importance and many correlated modern importance criteria often limit the capacity of pruning decision, since the filters with larger magnitudes are not likely to be pruned if the smaller one didn’t, even if it is redundant. In this paper, we propose a novel pruning strategy to challenge this dominating effect of magnitude and provide fair chance to each filter to be pruned, by placing it on projective space. After that, we observe the gradient descent movement whether the filters move toward the origin or not, to measure how the filter is likely to be pruned. This measurement is used to construct PROscore, a novel importance score for IPPRO, a novel importance-based structured pruning with magnitude-indifference. Our evaluation results shows that the proposed importance criteria using the projective space achieves near-lossless pruning by reducing the performance drop in pruning, with promising performance after the finetuning. Our work debunks the ``size-matters’’ myth in pruning and expands the frontier of importance-based pruning both theoretically and empirically.

[495] Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

Julien Pourcel, Cédric Colas, Pierre-Yves Oudeyer

Main category: cs.LG

TL;DR: SOAR integrates language models into a self-improving evolutionary loop for program synthesis, achieving significant gains on the ARC-AGI benchmark.

DetailsMotivation: State-of-the-art language models struggle with complex program synthesis tasks, and search-based evolutionary methods are limited by fixed generative model capabilities.

Method: SOAR alternates between evolutionary search using an LLM to refine solutions and hindsight learning to fine-tune the LLM, improving search effectiveness iteratively.

Result: SOAR achieves 52% success on the ARC-AGI public test set, with performance gains across model scales and iterations.

Conclusion: SOAR demonstrates the potential of combining evolutionary search with LLM fine-tuning for scalable program synthesis.

Abstract: Many program synthesis tasks prove too challenging for even state-of-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model. We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop. SOAR alternates between (1) an evolutionary search that uses an LLM to sample and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM’s sampling and refinement capabilities, – ,enabling increasingly effective search in subsequent iterations. On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the sampling and refinement finetuning tasks. These improvements carry over to test-time adaptation, enabling SOAR to solve 52% of the public test set. Our code is open-sourced at: https://github.com/flowersteam/SOAR

[496] Latent Space Data Fusion Outperforms Early Fusion in Multimodal Mental Health Digital Phenotyping Data

Youcef Barkat, Dylan Hamitouche, Deven Parekh, Ivy Guo, David Benrimoh

Main category: cs.LG

TL;DR: Intermediate fusion (latent space) outperforms early fusion in predicting depressive symptoms using multimodal data, showing better accuracy and generalization.

DetailsMotivation: Improve early detection and personalized intervention for mental illnesses by addressing limitations of traditional unimodal or early fusion methods.

Method: Compared intermediate fusion (Combined Model with autoencoders and neural network) against early fusion (Random Forest) and Linear Regression using BRIGHTEN trial data.

Result: Combined Model achieved lower MSE (0.4985 vs. 0.5305) and higher R2 (0.4695 vs. 0.4356) than Random Forest, with better generalization.

Conclusion: Latent space fusion is robust for multimodal mental health data; future work should focus on interpretability and clinical deployment.

Abstract: Background: Mental illnesses such as depression and anxiety require improved methods for early detection and personalized intervention. Traditional predictive models often rely on unimodal data or early fusion strategies that fail to capture the complex, multimodal nature of psychiatric data. Advanced integration techniques, such as intermediate (latent space) fusion, may offer better accuracy and clinical utility. Methods: Using data from the BRIGHTEN clinical trial, we evaluated intermediate (latent space) fusion for predicting daily depressive symptoms (PHQ-2 scores). We compared early fusion implemented with a Random Forest (RF) model and intermediate fusion implemented via a Combined Model (CM) using autoencoders and a neural network. The dataset included behavioral (smartphone-based), demographic, and clinical features. Experiments were conducted across multiple temporal splits and data stream combinations. Performance was evaluated using mean squared error (MSE) and coefficient of determination (R2). Results: The CM outperformed both RF and Linear Regression (LR) baselines across all setups, achieving lower MSE (0.4985 vs. 0.5305 with RF) and higher R2 (0.4695 vs. 0.4356). The RF model showed signs of overfitting, with a large gap between training and test performance, while the CM maintained consistent generalization. Performance was best when integrating all data modalities in the CM (in contradistinction to RF), underscoring the value of latent space fusion for capturing non-linear interactions in complex psychiatric datasets. Conclusion: Latent space fusion offers a robust alternative to traditional fusion methods for prediction with multimodal mental health data. Future work should explore model interpretability and individual-level prediction for clinical deployment.

[497] Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection

Andrés Morales-Forero, Lili J. Rueda, Ronald Herrera, Samuel Bassetto, Eric Coatanea

Main category: cs.LG

TL;DR: The paper introduces Predictive Representativity (PR) to audit AI fairness, focusing on outcome equity rather than dataset composition. It reveals performance disparities in skin cancer classifiers for darker skin tones and proposes solutions for fairness generalization.

DetailsMotivation: Addressing algorithmic bias and inequitable outcomes in AI-driven medical decisions, especially for marginalized groups.

Method: Evaluated AI skin cancer classifiers on HAM10000 and BOSQUE datasets, analyzing performance by skin phototype. Proposed PR framework and External Transportability Criterion.

Result: Significant performance disparities for darker skin tones, despite proportional dataset sampling. PR framework highlights fairness generalization issues.

Conclusion: Advocates for post-hoc fairness auditing, transparent documentation, and inclusive validation to address structural inequities in AI healthcare systems.

Abstract: Artificial intelligence (AI) systems increasingly inform medical decision-making, yet concerns about algorithmic bias and inequitable outcomes persist, particularly for historically marginalized populations. This paper introduces the concept of Predictive Representativity (PR), a framework of fairness auditing that shifts the focus from the composition of the data set to outcomes-level equity. Through a case study in dermatology, we evaluated AI-based skin cancer classifiers trained on the widely used HAM10000 dataset and on an independent clinical dataset (BOSQUE Test set) from Colombia. Our analysis reveals substantial performance disparities by skin phototype, with classifiers consistently underperforming for individuals with darker skin, despite proportional sampling in the source data. We argue that representativity must be understood not as a static feature of datasets but as a dynamic, context-sensitive property of model predictions. PR operationalizes this shift by quantifying how reliably models generalize fairness across subpopulations and deployment contexts. We further propose an External Transportability Criterion that formalizes the thresholds for fairness generalization. Our findings highlight the ethical imperative for post-hoc fairness auditing, transparency in dataset documentation, and inclusive model validation pipelines. This work offers a scalable tool for diagnosing structural inequities in AI systems, contributing to discussions on equity, interpretability, and data justice and fostering a critical re-evaluation of fairness in data-driven healthcare.

[498] Understanding Two-Layer Neural Networks with Smooth Activation Functions

Changcun Huang

Main category: cs.LG

TL;DR: The paper analyzes the training solutions of two-layer neural networks with smooth activation functions, revealing the solution space’s structure through four principles and proving universal approximation.

DetailsMotivation: To demystify the 'black box' of solution spaces in two-layer neural networks with smooth activation functions and enrich approximation theory.

Method: Uses Taylor series expansions, strict partial order of knots, smooth-spline implementation, and smooth-continuity restriction.

Result: Proves universal approximation for arbitrary input dimensionality and provides experimental verification.

Conclusion: The study clarifies the solution space of neural networks and contributes new proofs to approximation theory.

Abstract: This paper aims to understand the training solution, which is obtained by the back-propagation algorithm, of two-layer neural networks whose hidden layer is composed of the units with smooth activation functions, including the usual sigmoid type most commonly used before the advent of ReLUs. The mechanism contains four main principles: construction of Taylor series expansions, strict partial order of knots, smooth-spline implementation and smooth-continuity restriction. The universal approximation for arbitrary input dimensionality is proved and experimental verification is given, through which the mystery of ``black box’’ of the solution space is largely revealed. The new proofs employed also enrich approximation theory.

[499] Feature Bank Enhancement for Distance-based Out-of-Distribution Detection

Yuhang Liu, Yuefei Wu, Bin Shi, Bo Dong

Main category: cs.LG

TL;DR: The paper proposes Feature Bank Enhancement (FBE) to improve OOD detection by addressing biased feature distributions in distance-based methods.

DetailsMotivation: Distance-based OOD detection methods struggle with biased feature distributions, leading to low scores for ID samples.

Method: FBE uses statistical characteristics to constrain extreme features, enhancing separation between ID and OOD samples.

Result: FBE achieves state-of-the-art performance on ImageNet-1k and CIFAR-10 benchmarks.

Conclusion: FBE effectively improves OOD detection by mitigating feature bias, supported by theoretical and experimental validation.

Abstract: Out-of-distribution (OOD) detection is critical to ensuring the reliability of deep learning applications and has attracted significant attention in recent years. A rich body of literature has emerged to develop efficient score functions that assign high scores to in-distribution (ID) samples and low scores to OOD samples, thereby helping distinguish OOD samples. Among these methods, distance-based score functions are widely used because of their efficiency and ease of use. However, deep learning often leads to a biased distribution of data features, and extreme features are inevitable. These extreme features make the distance-based methods tend to assign too low scores to ID samples. This limits the OOD detection capabilities of such methods. To address this issue, we propose a simple yet effective method, Feature Bank Enhancement (FBE), that uses statistical characteristics from dataset to identify and constrain extreme features to the separation boundaries, therapy making the distance between samples inside and outside the distribution farther. We conducted experiments on large-scale ImageNet-1k and CIFAR-10 respectively, and the results show that our method achieves state-of-the-art performance on both benchmark. Additionally, theoretical analysis and supplementary experiments are conducted to provide more insights into our method.

[500] Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback

Yiyuan Yang, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, Qingsong Wen

Main category: cs.LG

TL;DR: The paper introduces Time-RA, a generative task for time-series anomaly reasoning using LLMs, and RATs40K, a multimodal benchmark dataset with detailed annotations. It highlights the limitations of current binary classification methods and demonstrates the potential of supervised fine-tuning for interpretable anomaly detection.

DetailsMotivation: Current time-series anomaly detection lacks detailed categorization and explanatory reasoning, limiting its interpretability and practical utility.

Method: Proposes Time-RA, a generative task leveraging LLMs, and introduces RATs40K, a multimodal dataset with fine-grained annotations and structured reasoning. Uses GPT-4-driven feedback for accurate labeling.

Result: Benchmarking shows the capabilities and limitations of LLMs and multimodal LLMs, emphasizing the importance of supervised fine-tuning.

Conclusion: The work advances interpretable time-series anomaly detection and reasoning, providing a foundation for future research.

Abstract: Time series anomaly detection is critical across various domains, yet current approaches often limit analysis to mere binary anomaly classification without detailed categorization or further explanatory reasoning. To address these limitations, we propose a novel task, Time-series Reasoning for Anomaly (Time-RA) that transforms classical time series anomaly detection from a discriminative into a generative, reasoning-intensive task leveraging Large Language Models (LLMs). Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning, comprising approximately 40,000 samples across 10 real-world domains. Each sample includes numeric time series data, contextual text information, and visual representations, each annotated with fine-grained categories (14 types for univariate anomalies and 6 for multivariate anomalies) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-driven feedback, ensuring accuracy and interpretability. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the capabilities and limitations of current models, highlighting the critical role of supervised fine-tuning. Our dataset and task pave the way for significant advancements in interpretable time series anomaly detection and reasoning.

[501] Developing an AI-Guided Assistant Device for the Deaf and Hearing Impaired

Jiayu, Liu

Main category: cs.LG

TL;DR: A deep learning system for an accessibility device for the deaf or hearing impaired, featuring sound localization and identification using JerryNet, CLAP, and multimodal integration, achieving high accuracy.

DetailsMotivation: To address the gap in accessibility devices for the deaf or hearing impaired by leveraging machine learning for real-time sound source localization and identification.

Method: 1. JerryNet (CNN for direction of arrival). 2. CLAP model (audio classification). 3. Multimodal integration (audio-visual-text localization). Hardware includes microphones and a camera on glasses.

Result: JerryNet: 91.1% precision. CLAP: 98.5% (custom), 95% (AudioSet). Audio-visual model: CIoU 0.892, AUC 0.658.

Conclusion: The system shows high accuracy and potential for future accessibility devices.

Abstract: This study aims to develop a deep learning system for an accessibility device for the deaf or hearing impaired. The device will accurately localize and identify sound sources in real time. This study will fill an important gap in current research by leveraging machine learning techniques to target the underprivileged community. The system includes three main components. 1. JerryNet: A custom designed CNN architecture that determines the direction of arrival (DoA) for nine possible directions. 2. Audio Classification: This model is based on fine-tuning the Contrastive Language-Audio Pretraining (CLAP) model to identify the exact sound classes only based on audio. 3. Multimodal integration model: This is an accurate sound localization model that combines audio, visual, and text data to locate the exact sound sources in the images. The part consists of two modules, one object detection using Yolov9 to generate all the bounding boxes of the objects, and an audio visual localization model to identify the optimal bounding box using complete Intersection over Union (CIoU). The hardware consists of a four-microphone rectangular formation and a camera mounted on glasses with a wristband for displaying necessary information like direction. On a custom collected data set, JerryNet achieved a precision of 91. 1% for the sound direction, outperforming all the baseline models. The CLAP model achieved 98.5% and 95% accuracy on custom and AudioSet datasets, respectively. The audio-visual localization model within component 3 yielded a cIoU of 0.892 and an AUC of 0.658, surpassing other similar models. There are many future potentials to this study, paving the way to creating a new generation of accessibility devices.

[502] A Sparsity Predicting Approach for Large Language Models via Activation Pattern Clustering

Nobel Dhar, Bobin Deng, Md Romyull Islam, Xinyue Zhang, Kazi Fahim Ahmad Nasif, Kun Suo

Main category: cs.LG

TL;DR: The paper proposes a clustering-based framework to efficiently predict and utilize activation sparsity in LLMs, reducing computational costs while preserving model quality.

DetailsMotivation: Activation sparsity in LLMs offers computational savings, but predicting neuron-level activation patterns is impractical due to the vast number of neurons.

Method: A clustering-based approach groups similar activation patterns into representative clusters, enabling efficient prediction and utilization of sparsity.

Result: Achieves 79.34% clustering precision and a perplexity score of 12.49, balancing efficiency and model quality.

Conclusion: The framework improves sparse computation efficiency and serves as a foundation for future work on activation pattern prediction in LLMs.

Abstract: Large Language Models (LLMs) exhibit significant activation sparsity, where only a subset of neurons are active for a given input. Although this sparsity presents opportunities to reduce computational cost, efficiently utilizing it requires predicting activation patterns in a scalable manner. However, direct prediction at the neuron level is computationally expensive due to the vast number of neurons in modern LLMs. To enable efficient prediction and utilization of activation sparsity, we propose a clustering-based activation pattern compression framework. Instead of treating each neuron independently, we group similar activation patterns into a small set of representative clusters. Our method achieves up to 79.34% clustering precision, outperforming standard binary clustering approaches while maintaining minimal degradation in perplexity (PPL) scores. With a sufficiently large number of clusters, our approach attains a PPL score as low as 12.49, demonstrating its effectiveness in preserving model quality while reducing computational overhead. By predicting cluster assignments rather than individual neuron states, future models can efficiently infer activation patterns from pre-computed centroids. We detail the clustering algorithm, analyze its effectiveness in capturing meaningful activation structures, and demonstrate its potential to improve sparse computation efficiency. This clustering-based formulation serves as a foundation for future work on activation pattern prediction, paving the way for efficient inference in large-scale language models.

[503] Digital Twin-Assisted Explainable AI for Robust Beam Prediction in mmWave MIMO Systems

Nasir Khan, Asmaa Abdallah, Abdulkadir Celik, Ahmed M. Eltawil, Sinem Coleri

Main category: cs.LG

TL;DR: A robust and explainable DL-based beam alignment engine (BAE) for mmWave MIMO systems is proposed, reducing data needs and beam training overhead while improving transparency and robustness.

DetailsMotivation: Address challenges in DL-based beam alignment, such as high data collection overhead, lack of explainability, and susceptibility to adversarial attacks, to build trust and reliability in mmWave systems.

Method: Uses RSSI measurements from wide beams to predict narrow beams, leverages a digital twin for synthetic data, employs transfer learning for model refinement, and integrates SHAP and DkNN for explainability and robustness.

Result: Reduces real-world data needs by 70%, beam training overhead by 62%, and improves outlier detection robustness by 8.5x, achieving near-optimal spectral efficiency.

Conclusion: The proposed framework enhances efficiency, transparency, and robustness in mmWave beam alignment, addressing key challenges in DL solutions.

Abstract: In line with the AI-native 6G vision, explainability and robustness are crucial for building trust and ensuring reliable performance in millimeter-wave (mmWave) systems. Efficient beam alignment is essential for initial access, but deep learning (DL) solutions face challenges, including high data collection overhead, hardware constraints, lack of explainability, and susceptibility to adversarial attacks. This paper proposes a robust and explainable DL-based beam alignment engine (BAE) for mmWave multiple-input multiple output (MIMO) systems. The BAE uses received signal strength indicator (RSSI) measurements from wide beams to predict the best narrow beam, reducing the overhead of exhaustive beam sweeping. To overcome the challenge of real-world data collection, this work leverages a site-specific digital twin (DT) to generate synthetic channel data closely resembling real-world environments. A model refinement via transfer learning is proposed to fine-tune the pre-trained model residing in the DT with minimal real-world data, effectively bridging mismatches between the digital replica and real-world environments. To reduce beam training overhead and enhance transparency, the framework uses deep Shapley additive explanations (SHAP) to rank input features by importance, prioritizing key spatial directions and minimizing beam sweeping. It also incorporates the Deep k-nearest neighbors (DkNN) algorithm, providing a credibility metric for detecting out-of-distribution inputs and ensuring robust, transparent decision-making. Experimental results show that the proposed framework reduces real-world data needs by 70%, beam training overhead by 62%, and improves outlier detection robustness by up to 8.5x, achieving near-optimal spectral efficiency and transparent decision making compared to traditional softmax based DL models.

[504] An Investigation of Test-time Adaptation for Audio Classification under Background Noise

Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, Tissa Chandesa

Main category: cs.LG

TL;DR: The paper addresses domain shift in audio classification using Test-Time Adaptation (TTA), comparing methods like TTT, TENT, and a modified CoNMix, achieving the best results with CoNMix.

DetailsMotivation: Domain shift in deep learning degrades model performance on test datasets. This study focuses on audio classification under domain shift caused by background noise, using TTA to adapt pre-trained models during testing.

Method: The study employs TTA techniques (TTT, TENT, and a modified CoNMix) on audio datasets (AudioMNIST and SpeechCommands V1) under various noise conditions.

Result: The modified CoNMix outperformed others, achieving 5.31% and 12.75% error rates under severe noise conditions.

Conclusion: This is the first study to apply TTA for audio classification under domain shift, demonstrating the effectiveness of the modified CoNMix method.

Abstract: Domain shift is a prominent problem in Deep Learning, causing a model pre-trained on a source dataset to suffer significant performance degradation on test datasets. This research aims to address the issue of audio classification under domain shift caused by background noise using Test-Time Adaptation (TTA), a technique that adapts a pre-trained model during testing using only unlabelled test data before making predictions. We adopt two common TTA methods, TTT and TENT, and a state-of-the-art method CoNMix, and investigate their respective performance on two popular audio classification datasets, AudioMNIST (AM) and SpeechCommands V1 (SC), against different types of background noise and noise severity levels. The experimental results reveal that our proposed modified version of CoNMix produced the highest classification accuracy under domain shift (5.31% error rate under 10 dB exercise bike background noise and 12.75% error rate under 3 dB running tap background noise for AM) compared to TTT and TENT. The literature search provided no evidence of similar works, thereby motivating the work reported here as the first study to leverage TTA techniques for audio classification under domain shift.

[505] Semi-Supervised Federated Learning via Dual Contrastive Learning and Soft Labeling for Intelligent Fault Diagnosis

Yajiao Dai, Jun Li, Zhen Mei, Yiyang Ni, Shi Jin, Zengxiang Li, Sheng Guo, Wei Xiang

Main category: cs.LG

TL;DR: The paper proposes SSFL-DCSL, a semi-supervised federated learning framework with dual contrastive loss and soft labeling, to address data scarcity, label scarcity, and privacy in distributed clients. It improves accuracy by 1.15% to 7.85% over state-of-the-art methods.

DetailsMotivation: Traditional supervised deep learning methods require large labeled datasets, which are costly and hard to acquire. Data distribution differences among clients also hinder performance.

Method: SSFL-DCSL integrates dual contrastive loss (local and global) and soft labeling. It uses a sample weighting function for pseudo-label bias, aggregates local prototypes on the server, and updates them with momentum.

Result: Experiments on three datasets show SSFL-DCSL outperforms state-of-the-art methods, especially with only 10% labeled data, improving accuracy by 1.15% to 7.85%.

Conclusion: SSFL-DCSL effectively addresses data and label scarcity while preserving privacy, enhancing model performance in distributed settings.

Abstract: Intelligent fault diagnosis (IFD) plays a crucial role in ensuring the safe operation of industrial machinery and improving production efficiency. However, traditional supervised deep learning methods require a large amount of training data and labels, which are often located in different clients. Additionally, the cost of data labeling is high, making labels difficult to acquire. Meanwhile, differences in data distribution among clients may also hinder the model’s performance. To tackle these challenges, this paper proposes a semi-supervised federated learning framework, SSFL-DCSL, which integrates dual contrastive loss and soft labeling to address data and label scarcity for distributed clients with few labeled samples while safeguarding user privacy. It enables representation learning using unlabeled data on the client side and facilitates joint learning among clients through prototypes, thereby achieving mutual knowledge sharing and preventing local model divergence. Specifically, first, a sample weighting function based on the Laplace distribution is designed to alleviate bias caused by low confidence in pseudo labels during the semi-supervised training process. Second, a dual contrastive loss is introduced to mitigate model divergence caused by different data distributions, comprising local contrastive loss and global contrastive loss. Third, local prototypes are aggregated on the server with weighted averaging and updated with momentum to share knowledge among clients. To evaluate the proposed SSFL-DCSL framework, experiments are conducted on two publicly available datasets and a dataset collected on motors from the factory. In the most challenging task, where only 10% of the data are labeled, the proposed SSFL-DCSL can improve accuracy by 1.15% to 7.85% over state-of-the-art methods.

[506] From Bias to Behavior: Learning Bull-Bear Market Dynamics with Contrastive Modeling

Xiaotong Luo, Shengda Zhuo, Min Chen, Lichun Li, Ruizhao Lu, Wenqi Fan, Shuqiang Huang, Yin Tang

Main category: cs.LG

TL;DR: The paper introduces the B4 model to analyze bull and bear regimes in financial markets, combining price data and external signals to predict trends and interpret biases and behaviors.

DetailsMotivation: Financial markets are complex and influenced by biases from heterogeneous data and investor insights, making modeling challenging. This work explores bull and bear regimes to better understand market dynamics.

Method: Proposes the B4 model, embedding price sequences and contextual signals into a shared latent space. Uses inertial pairing and dual competition to capture bias-driven asymmetry and behavioral divergence.

Result: B4 outperforms in market trend prediction and offers interpretable insights into biases, behaviors, and market dynamics.

Conclusion: The B4 model effectively captures market heterogeneity and investor-driven dynamics, providing a robust framework for trend prediction and bias analysis.

Abstract: Financial markets exhibit highly dynamic and complex behaviors shaped by both historical price trajectories and exogenous narratives, such as news, policy interpretations, and social media sentiment. The heterogeneity in these data and the diverse insight of investors introduce biases that complicate the modeling of market dynamics. Unlike prior work, this paper explores the potential of bull and bear regimes in investor-driven market dynamics. Through empirical analysis on real-world financial datasets, we uncover a dynamic relationship between bias variation and behavioral adaptation, which enhances trend prediction under evolving market conditions. To model this mechanism, we propose the Bias to Behavior from Bull-Bear Dynamics model (B4), a unified framework that jointly embeds temporal price sequences and external contextual signals into a shared latent space where opposing bull and bear forces naturally emerge, forming the foundation for bias representation. Within this space, an inertial pairing module pairs temporally adjacent samples to preserve momentum, while the dual competition mechanism contrasts bullish and bearish embeddings to capture behavioral divergence. Together, these components allow B4 to model bias-driven asymmetry, behavioral inertia, and market heterogeneity. Experimental results on real-world financial datasets demonstrate that our model not only achieves superior performance in predicting market trends but also provides interpretable insights into the interplay of biases, investor behaviors, and market dynamics.

[507] LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan, Lin

Main category: cs.LG

TL;DR: LaCache is a training-free KV cache optimization method for LLMs, enhancing long-range capabilities and continuous generation without OOM by using a ladder-shaped KV cache pattern and iterative compaction.

DetailsMotivation: Addressing the efficiency bottleneck in LLMs due to increasing KV pairs with longer sequences, while maintaining robust long-range capabilities and avoiding OOM errors.

Method: LaCache integrates a ladder-shaped KV cache pattern for extended dependency capture and an iterative compaction mechanism for dynamic cache compression.

Result: Experiments show LaCache effectively boosts LLMs’ long-range capabilities across various tasks and benchmarks.

Conclusion: LaCache provides an efficient solution for long-range modeling in LLMs, validated by consistent experimental results.

Abstract: Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache’s effectiveness in enhancing LLMs’ long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.

[508] Geometry-Aware Active Learning of Pattern Rankings via Choquet-Based Aggregation

Tudor Matei Opran, Samir Loudni

Main category: cs.LG

TL;DR: Proposes an interactive learning framework for pattern mining, combining nonlinear utility aggregation and geometry-aware query selection to address pattern explosion.

DetailsMotivation: To solve the pattern explosion problem in pattern mining by improving efficiency and accuracy with fewer user interactions.

Method: Uses a Choquet integral for modeling user preferences and a branch-and-bound strategy with tight distance bounds for query selection.

Result: Outperforms existing methods like ChoquetRank, achieving better ranking accuracy with fewer interactions.

Conclusion: The framework effectively addresses pattern explosion and improves user interaction efficiency.

Abstract: We address the pattern explosion problem in pattern mining by proposing an interactive learning framework that combines nonlinear utility aggregation with geometry-aware query selection. Our method models user preferences through a Choquet integral over multiple interestingness measures and exploits the geometric structure of the version space to guide the selection of informative comparisons. A branch-and-bound strategy with tight distance bounds enables efficient identification of queries near the decision boundary. Experiments on UCI datasets show that our approach outperforms existing methods such as ChoquetRank, achieving better ranking accuracy with fewer user interactions.

[509] Competitive Algorithms for Cooperative Multi-Agent Ski-Rental Problems

Xuchuang Wang, Bo Sun, Hedyeh Beyhaghi, John C. S. Lui, Mohammad Hajiesmaili, Adam Wierman

Main category: cs.LG

TL;DR: The paper generalizes the ski-rental problem to a multi-agent setting with individual and shared costs, introducing dynamic states and three competitive ratios. It designs optimal deterministic and randomized policies, showing symmetric policies outperform asymmetric ones.

DetailsMotivation: To extend the classical ski-rental problem to group settings where agents face individual and shared costs, addressing dynamic states and diverse objectives.

Method: Defines three competitive ratios (overall, state-dependent, individual rational) and designs deterministic (state-aware thresholds) and randomized (tailored distributions) policies.

Result: Symmetric policies outperform asymmetric ones, with competitive ratio bounds provided, extending classical insights to multi-agent scenarios.

Conclusion: The work advances group decision-making under uncertainty, offering theoretical and practical insights for multi-agent ski-rental problems.

Abstract: This paper introduces a novel multi-agent ski-rental problem that generalizes the classical ski-rental dilemma to a group setting where agents incur individual and shared costs. In our model, each agent can either rent at a fixed daily cost, or purchase a pass at an individual cost, with an additional third option of a discounted group pass available to all. We consider scenarios in which agents’ active days differ, leading to dynamic states as agents drop out of the decision process. To address this problem from different perspectives, we define three distinct competitive ratios: overall, state-dependent, and individual rational. For each objective, we design and analyze optimal deterministic and randomized policies. Our deterministic policies employ state-aware threshold functions that adapt to the dynamic states, while our randomized policies sample and resample thresholds from tailored state-aware distributions. The analysis reveals that symmetric policies, in which all agents use the same threshold, outperform asymmetric ones. Our results provide competitive ratio upper and lower bounds and extend classical ski-rental insights to multi-agent settings, highlighting both theoretical and practical implications for group decision-making under uncertainty.

[510] Artificial Intelligence for Green Hydrogen Yield Prediction and Site Suitability using SHAP-Based Composite Index: Focus on Oman

Obumneme Zimuzor Nwafor, Mohammed Abdul Majeed Al Hooti

Main category: cs.LG

TL;DR: The paper introduces an AI framework using SHAP values to identify optimal green hydrogen production sites, achieving 98% accuracy and highlighting key influencing factors like water proximity, elevation, and seasonal variation.

DetailsMotivation: The need for sustainable alternatives to fossil fuels and the challenge of identifying optimal hydrogen production sites due to complex factors and lack of direct data.

Method: A multi-stage AI pipeline combining unsupervised clustering, supervised machine learning, and SHAP analysis on integrated meteorological, topographic, and temporal data.

Result: The model achieved 98% accuracy, identifying water proximity, elevation, and seasonal variation as key factors for site suitability in Oman.

Conclusion: The framework provides an objective, scalable tool for green hydrogen planning in data-scarce regions, replacing subjective expert weightings.

Abstract: As nations seek sustainable alternatives to fossil fuels, green hydrogen has emerged as a promising strategic pathway toward decarbonisation, particularly in solar-rich arid regions. However, identifying optimal locations for hydrogen production requires the integration of complex environmental, atmospheric, and infrastructural factors, often compounded by limited availability of direct hydrogen yield data. This study presents a novel Artificial Intelligence (AI) framework for computing green hydrogen yield and site suitability index using mean absolute SHAP (SHapley Additive exPlanations) values. This framework consists of a multi-stage pipeline of unsupervised multi-variable clustering, supervised machine learning classifier and SHAP algorithm. The pipeline trains on an integrated meteorological, topographic and temporal dataset and the results revealed distinct spatial patterns of suitability and relative influence of the variables. With model predictive accuracy of 98%, the result also showed that water proximity, elevation and seasonal variation are the most influential factors determining green hydrogen site suitability in Oman with mean absolute shap values of 2.470891, 2.376296 and 1.273216 respectively. Given limited or absence of ground-truth yield data in many countries that have green hydrogen prospects and ambitions, this study offers an objective and reproducible alternative to subjective expert weightings, thus allowing the data to speak for itself and potentially discover novel latent groupings without pre-imposed assumptions. This study offers industry stakeholders and policymakers a replicable and scalable tool for green hydrogen infrastructure planning and other decision making in data-scarce regions.

[511] Domain Generalization via Pareto Optimal Gradient Matching

Khoi Do, Duong Nguyen, Nam-Khanh Le, Quoc-Viet Pham, Binh-Son Hua, Won-Joo Hwang

Main category: cs.LG

TL;DR: Proposes POGM for gradient-based domain generalization, addressing gradient fluctuations and computational inefficiency in existing methods.

DetailsMotivation: Existing methods struggle with gradient fluctuations and high computation costs in domain generalization.

Method: POGM uses gradient trajectories as data, meta-learns to maximize GIP while limiting deviation from empirical risk minimization.

Result: POGM achieves competitive performance on DomainBed datasets with computational efficiency.

Conclusion: POGM effectively balances gradient matching and computational efficiency in domain generalization.

Abstract: In this study, we address the gradient-based domain generalization problem, where predictors aim for consistent gradient directions across different domains. Existing methods have two main challenges. First, minimization of gradient empirical distance or gradient inner products (GIP) leads to gradient fluctuations among domains, thereby hindering straightforward learning. Second, the direct application of gradient learning to the joint loss function can incur high computation overheads due to second-order derivative approximation. To tackle these challenges, we propose a new Pareto Optimality Gradient Matching (POGM) method. In contrast to existing methods that add gradient matching as regularization, we leverage gradient trajectories as collected data and apply independent training at the meta-learner. In the meta-update, we maximize GIP while limiting the learned gradient from deviating too far from the empirical risk minimization gradient trajectory. By doing so, the aggregate gradient can incorporate knowledge from all domains without suffering gradient fluctuation towards any particular domain. Experimental evaluations on datasets from DomainBed demonstrate competitive results yielded by POGM against other baselines while achieving computational efficiency.

[512] Knowing When to Quit: Probabilistic Early Exits for Speech Separation

Kenny Falkær Olsen, Mads Østergaard, Karl Ulbæk, Søren Føns Nielsen, Rasmus Malik Høegh Lindrup, Bjørn Sand Jensen, Morten Mørup

Main category: cs.LG

TL;DR: A novel neural network architecture for speech separation enables dynamic compute-scaling with early-exit and uncertainty-aware probabilistic conditions, achieving state-of-the-art performance.

DetailsMotivation: Address the limitation of fixed compute and parameter budgets in speech separation models, enabling use in embedded and heterogeneous devices like mobile phones.

Method: Design an early-exit neural network architecture with a probabilistic framework to model clean speech and error variance, deriving exit conditions based on signal-to-noise ratios.

Result: The model matches state-of-the-art performance across varying compute budgets and provides interpretable exit conditions.

Conclusion: The framework allows dynamic compute-scaling while maintaining high performance, making it suitable for resource-constrained devices.

Abstract: In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter budget, and consequently cannot scale to varying compute demands or resources, which limits their use in embedded and heterogeneous devices such as mobile phones and hearables. To enable such use-cases we design a neural network architecture for speech separation capable of early-exit, and we propose an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which we use to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios. We evaluate our methods on both speech separation and enhancement tasks, and we show that a single early-exit model can be competitive with state-of-the-art models trained at many compute and parameter budgets. Our framework enables fine-grained dynamic compute-scaling of speech separation networks while achieving state-of-the-art performance and interpretable exit conditions.

[513] A million-scale dataset and generalizable foundation model for nanomaterial-protein interactions

Hengjie Yu, Kenneth A. Dawson, Haiyun Yang, Shuya Liu, Yan Yan, Yaochu Jin

Main category: cs.LG

TL;DR: NanoPro-3M, the largest nanomaterial-protein interaction dataset, and NanoProFormer, a multimodal foundational model, address limited datasets and model generalizability, improving predictions for nanomaterial-protein interactions.

DetailsMotivation: Understanding nanomaterial-protein interactions is critical for medicine and environmental science, but progress is hindered by small datasets and poor model generalizability.

Method: Developed NanoPro-3M (3.2M samples, 37K proteins) and NanoProFormer, a multimodal model for predicting affinities, handling missing features, and generalizing to unseen data.

Result: Multimodal modeling outperforms single-modality approaches, identifies corona formation determinants, and excels in zero-shot inference and fine-tuning for downstream tasks.

Conclusion: This work provides a high-performance, generalized foundation for predicting nanomaterial-protein interactions, reducing experimental reliance and accelerating applications.

Abstract: Unlocking the potential of nanomaterials in medicine and environmental science hinges on understanding their interactions with proteins, a complex decision space where AI is poised to make a transformative impact. However, progress has been hindered by limited datasets and the restricted generalizability of existing models. Here, we propose NanoPro-3M, the largest nanomaterial-protein interaction dataset to date, comprising over 3.2 million samples and 37,000 unique proteins. Leveraging this, we present NanoProFormer, a foundational model that predicts nanomaterial-protein affinities through multimodal representation learning, demonstrating strong generalization, handling missing features, and unseen nanomaterials or proteins. We show that multimodal modeling significantly outperforms single-modality approaches and identifies key determinants of corona formation. Furthermore, we demonstrate its applicability to a range of downstream tasks through zero-shot inference and fine-tuning. Together, this work establishes a solid foundation for high-performance and generalized prediction of nanomaterial-protein interaction endpoints, reducing experimental reliance and accelerating various in vitro applications.

[514] Linearized Diffusion Map

Julio Candanedo

Main category: cs.LG

TL;DR: LDM is a linear dimensionality reduction method that approximates diffusion-map kernels, combining geometric intuition with computational simplicity. It outperforms PCA on manifold-structured data and integrates with NMF for interpretability.

DetailsMotivation: To bridge the gap between nonlinear diffusion-based methods and linear embeddings like PCA, offering geometric insights while maintaining efficiency and interpretability.

Method: LDM constructs a linear approximation of the diffusion-map kernel, tested on synthetic (Swiss roll, hyperspheres) and real-world datasets (MNIST, COIL-20).

Result: LDM excels on manifold-structured data, especially in high dimensions, while PCA is better for variance/noise-dominated cases. LDM’s kernel supports NMF for interpretable latent structures.

Conclusion: LDM is a promising linear dimensionality reduction technique with theoretical and practical potential.

Abstract: We introduce the Linearized Diffusion Map (LDM), a novel linear dimensionality reduction method constructed via a linear approximation of the diffusion-map kernel. LDM integrates the geometric intuition of diffusion-based nonlinear methods with the computational simplicity, efficiency, and interpretability inherent in linear embeddings such as PCA and classical MDS. Through comprehensive experiments on synthetic datasets (Swiss roll and hyperspheres) and real-world benchmarks (MNIST and COIL-20), we illustrate that LDM captures distinct geometric features of datasets compared to PCA, offering complementary advantages. Specifically, LDM embeddings outperform PCA in datasets exhibiting explicit manifold structures, particularly in high-dimensional regimes, whereas PCA remains preferable in scenarios dominated by variance or noise. Furthermore, the complete positivity of LDM’s kernel matrix allows direct applicability of Non-negative Matrix Factorization (NMF), suggesting opportunities for interpretable latent-structure discovery. Our analysis positions LDM as a valuable new linear dimensionality reduction technique with promising theoretical and practical extensions.

[515] A Simple “Try Again” Can Elicit Multi-Turn LLM Reasoning

Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, Manling Li

Main category: cs.LG

TL;DR: Training Large Reasoning Models (LRMs) with multi-turn RL using unary feedback improves both single-turn performance and multi-turn reasoning accuracy by up to 14%.

DetailsMotivation: Existing RL methods for LRMs focus on single-turn problem solving, leading to repetitive responses and poor multi-turn reasoning. The goal is to enhance LRMs' ability to reflect and revise answers in multi-turn contexts.

Method: Introduces Unary Feedback as Observation (UFO), a minimal feedback mechanism for RL training, and designs reward structures to encourage diverse reasoning and careful answers.

Result: UFO improves multi-turn reasoning accuracy by up to 14% while maintaining single-turn performance.

Conclusion: Multi-turn RL with unary feedback (UFO) effectively enhances LRMs’ ability to solve problems iteratively and react to feedback.

Abstract: Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., “Let’s try again”) after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

[516] FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning

Md Rafid Haque, Abu Raihan Mostofa Kamal, Md. Azam Hossain

Main category: cs.LG

TL;DR: FedStrategist is a meta-learning framework for dynamic defense selection in Federated Learning, outperforming static methods against adaptive attacks and diverse data environments.

DetailsMotivation: Address vulnerabilities in FL to model poisoning attacks by moving beyond static defenses, which fail against adaptive adversaries or heterogeneous data.

Method: Introduces a lightweight contextual bandit agent to dynamically choose the best aggregation rule from a set of defenses based on real-time metrics.

Result: No single static rule is universally optimal; FedStrategist learns superior policies across scenarios, including adversarial ones, while balancing performance and security.

Conclusion: FedStrategist offers a practical, analyzable solution for resilient decentralized AI, controllable via a risk tolerance parameter.

Abstract: Federated Learning (FL) offers a paradigm for privacy-preserving collaborative AI, but its decentralized nature creates significant vulnerabilities to model poisoning attacks. While numerous static defenses exist, their effectiveness is highly context-dependent, often failing against adaptive adversaries or in heterogeneous data environments. This paper introduces FedStrategist, a novel meta-learning framework that reframes robust aggregation as a real-time, cost-aware control problem. We design a lightweight contextual bandit agent that dynamically selects the optimal aggregation rule from an arsenal of defenses based on real-time diagnostic metrics. Through comprehensive experiments, we demonstrate that no single static rule is universally optimal. We show that our adaptive agent successfully learns superior policies across diverse scenarios, including a ``Krum-favorable" environment and against a sophisticated “stealth” adversary designed to neutralize specific diagnostic signals. Critically, we analyze the paradoxical scenario where a non-robust baseline achieves high but compromised accuracy, and demonstrate that our agent learns a conservative policy to prioritize model integrity. Furthermore, we prove the agent’s policy is controllable via a single “risk tolerance” parameter, allowing practitioners to explicitly manage the trade-off between performance and security. Our work provides a new, practical, and analyzable approach to creating resilient and intelligent decentralized AI systems.

[517] Rethinking Individual Fairness in Deepfake Detection

Aryana Hou, Li Lin, Justin Li, Shu Hu

Main category: cs.LG

TL;DR: The paper addresses fairness gaps in deepfake detection, focusing on individual fairness, and proposes a framework to enhance it without compromising detection performance.

DetailsMotivation: The misuse of generative AI for deepfakes poses risks, and existing detection methods lack fairness, especially at the individual level.

Method: The authors propose a generalizable framework to improve individual fairness in deepfake detection, integrating it into existing detectors.

Result: Experiments show the framework significantly enhances individual fairness while maintaining robust detection, outperforming state-of-the-art methods.

Conclusion: The work fills a critical gap in deepfake detection fairness, offering a practical solution for improving individual fairness.

Abstract: Generative AI models have substantially improved the realism of synthetic media, yet their misuse through sophisticated DeepFakes poses significant risks. Despite recent advances in deepfake detection, fairness remains inadequately addressed, enabling deepfake markers to exploit biases against specific populations. While previous studies have emphasized group-level fairness, individual fairness (i.e., ensuring similar predictions for similar individuals) remains largely unexplored. In this work, we identify for the first time that the original principle of individual fairness fundamentally fails in the context of deepfake detection, revealing a critical gap previously unexplored in the literature. To mitigate it, we propose the first generalizable framework that can be integrated into existing deepfake detectors to enhance individual fairness and generalization. Extensive experiments conducted on leading deepfake datasets demonstrate that our approach significantly improves individual fairness while maintaining robust detection performance, outperforming state-of-the-art methods. The code is available at https://github.com/Purdue-M2/Individual-Fairness-Deepfake-Detection.

[518] Development and Deployment of Hybrid ML Models for Critical Heat Flux Prediction in Annulus Geometries

Aidan Furlong, Xingang Zhao, Robert Salko, Xu Wu

Main category: cs.LG

TL;DR: The paper explores ML models for predicting CHF in annular geometries, outperforming traditional empirical correlations with significantly lower errors.

DetailsMotivation: Accurate CHF prediction is critical for reactor safety, but existing methods lack interpretability and resilience to data scarcity, especially for annular geometries.

Method: Developed four ML models using CTF subchannel code, trained on 577 experimental annulus data points, and compared with three empirical correlations (Biasi, Bowring, Katto).

Result: ML models achieved mean relative errors below 3.5%, vastly outperforming empirical correlations (26%+ errors).

Conclusion: Hybrid ML models are superior for CHF prediction in annular geometries, offering high accuracy and reliability.

Abstract: Accurate prediction of critical heat flux (CHF) is an essential component of safety analysis in pressurized and boiling water reactors. To support reliable prediction of this quantity, several empirical correlations and lookup tables have been constructed from physical experiments over the past several decades. With the onset of accessible machine learning (ML) frameworks, multiple initiatives have been established with the goal of predicting CHF more accurately than these traditional methods. While purely data-driven surrogate modeling has been extensively investigated, these approaches lack interpretability, lack resilience to data scarcity, and have been developed mostly using data from tube experiments. As a result, bias-correction hybrid approaches have become increasingly popular, which correct initial “low-fidelity” estimates provided by deterministic base models by using ML-predicted residuals. This body of work has mostly considered round tube geometries; annular geometry-specific ML models have not yet been deployed in thermal hydraulic codes. This study developed, deployed, and validated four ML models to predict CHF in annular geometries using the CTF subchannel code. Three empirical correlation models, Biasi, Bowring, and Katto, were used as base models for comparison. The ML models were trained and tested using 577 experimental annulus data points from four datasets: Becker, Beus, Janssen, and Mortimore. Baseline CHF predictions were obtained from the empirical correlations, with mean relative errors above 26%. The ML-driven models achieved mean relative errors below 3.5%, with no more than one point exceeding the 10% error envelope. In all cases, the hybrid ML models significantly outperformed their empirical counterparts.

[519] Influence Functions for Preference Dataset Pruning

Daniel Fein, Gabriela Aranguiz-Dias

Main category: cs.LG

TL;DR: The paper explores using influence functions to filter noisy training data in fine-tuning language models, showing a 1.5% accuracy improvement after removing 10% of harmful examples. Gradient similarity outperforms influence functions for identifying helpful examples.

DetailsMotivation: Human preference datasets for fine-tuning language models are often noisy, and small post-training datasets make it feasible to use influence functions to improve performance by filtering harmful examples.

Method: The authors adapt the TL;DR dataset for reward model training and use conjugate-gradient approximated influence functions to filter datasets, comparing results with gradient similarity.

Result: Influence function filtering improves retraining accuracy by 1.5% after removing 10% of harmful examples. Gradient similarity is more effective than influence functions for detecting helpful examples.

Conclusion: Local curvature (influence functions) is crucial for identifying harmful examples, while gradient similarity is better for helpful ones, suggesting complementary approaches for dataset filtering.

Abstract: Language models are commonly fine-tuned via reinforcement learning to alter their behavior or elicit new capabilities. Datasets used for these purposes, and particularly human preference datasets, are often noisy. The relatively small size post-training datasets, combined with parameter-efficient fine-tuning methods, enable the use of influence functions approximations to detect and prune training examples that are harmful to performance on a validation set. In this work, we adapt the TL;DR dataset for reward model training to demonstrate how conjugate-gradient approximated influence functions can be used to filter datasets. In our experiments, influence function filtering yields a small retraining accuracy uplift of 1.5% after removing 10% of training examples. We also show that gradient similarity outperforms influence functions for detecting helpful training examples. This suggests that local curvature is important for detecting harmful training examples, but less so for identifying helpful examples.

[520] Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers

Harsh Nilesh Pathak, Randy Paffenroth

Main category: cs.LG

TL;DR: Solo Connection is a PEFT method that adapts decoder-block representations, outperforming LoRA with fewer parameters and inspired by homotopy theory.

DetailsMotivation: To improve fine-tuning efficiency and stability in large language models by revisiting skip connections and leveraging homotopy theory.

Method: Introduces Solo Connection, adapting decoder-block representations with trainable linear transformations for smooth adaptation.

Result: Outperforms LoRA, reduces trainable parameters by 59% vs. LoRA and 99% vs. full fine-tuning.

Conclusion: Solo Connection offers a more efficient and stable fine-tuning approach for large language models.

Abstract: Parameter efficient fine tuning (PEFT) is a versatile and extensible approach for adapting a Large Language Model (LLM) for newer tasks. One of the most prominent PEFT approaches, Low Rank Adaptation (LoRA), primarily focuses on adjusting the attention weight matrices within individual decoder blocks of a Generative Pre trained Transformer (GPT2). In contrast, we introduce Solo Connection a novel method that adapts the representation at the decoder-block level rather than modifying individual weight matrices. Not only does Solo Connection outperform LoRA on E2E natural language generation benchmarks, but it also reduces the number of trainable parameters by 59% relative to LoRA and by more than 99% compared to full fine-tuning of GPT2, an early version of Large Language Models (LLMs). Solo Connection is also motivated by homotopy theory: we introduce a trainable linear transformation that gradually interpolates between a zero vector and the task-specific representation, enabling smooth and stable adaptation over time. While skip connections in the original 12 layer GPT2 are typically confined to individual decoder blocks, subsequent GPT2 variants scale up to 48 layers, and even larger language models can include 128 or more decoder blocks. These expanded architectures underscore the need to revisit how skip connections are employed during fine-tuning. This paper focuses on long skip connections that link outputs of different decoder blocks, potentially enhancing the model’s ability to adapt to new tasks while leveraging pre-trained knowledge.

[521] Incremental Causal Graph Learning for Online Cyberattack Detection in Cyber-Physical Infrastructures

Arun Vignesh Malarkkan, Dongjie Wang, Haoyue Bai, Yanjie Fu

Main category: cs.LG

TL;DR: INCADET is a novel framework for real-time cyberattack detection using incremental causal graph learning, outperforming traditional methods in accuracy and adaptability.

DetailsMotivation: The increasing threat of cyberattacks on critical infrastructures requires adaptive detection methods to handle complex interdependencies and evolving attack patterns. Traditional methods suffer from high false positives and static limitations.

Method: INCADET dynamically updates causal graphs in real-time using three modules: early symptom detection, incremental causal graph learning, and causal graph classification with GCNs.

Result: Experiments show INCADET achieves superior accuracy, robustness, and adaptability compared to static and deep temporal baselines.

Conclusion: INCADET effectively addresses the limitations of traditional methods, providing a scalable and adaptive solution for real-time cyberattack detection.

Abstract: The escalating threat of cyberattacks on real-time critical infrastructures poses serious risks to public safety, demanding detection methods that effectively capture complex system interdependencies and adapt to evolving attack patterns. Traditional real-time anomaly detection techniques often suffer from excessive false positives due to their statistical sensitivity to high data variance and class imbalance. To address these limitations, recent research has explored modeling causal relationships among system components. However, prior work mainly focuses on offline causal graph-based approaches that require static historical data and fail to generalize to real-time settings. These methods are fundamentally constrained by: (1) their inability to adapt to dynamic shifts in data distribution without retraining, and (2) the risk of catastrophic forgetting when lacking timely supervision in live systems. To overcome these challenges, we propose INCADET, a novel framework for incremental causal graph learning tailored to real-time cyberattack detection. INCADET dynamically captures evolving system behavior by incrementally updating causal graphs across streaming time windows. The framework comprises three modules: 1) Early Symptom Detection: Detects transitions in system status using divergence in edge-weight distributions across sequential causal graphs. 2) Incremental Causal Graph Learning: Leverages experience replay and edge reinforcement to continually refine causal structures while preserving prior knowledge. 3) Causal Graph Classification: Employs Graph Convolutional Networks (GCNs) to classify system status using the learned causal graphs. Extensive experiments on real-world critical infrastructure datasets demonstrate that INCADET achieves superior accuracy, robustness, and adaptability compared to both static causal and deep temporal baselines in evolving attack scenarios.

[522] It’s Not That Simple. An Analysis of Simple Test-Time Scaling

Guojun Wu

Main category: cs.LG

TL;DR: Analysis shows test-time scaling behavior in models is mainly due to scaling down by enforcing max length, not fine-tuning or scaling up. Scaling up in o1-like models outperforms simple scaling methods.

DetailsMotivation: To understand the effectiveness and limitations of simple test-time scaling methods compared to o1-like models.

Method: Analyzed scaling behavior by enforcing max length (scaling down) and appending ‘Wait’ (scaling up), and compared with o1-like models.

Result: Scaling down dominates behavior; scaling up causes inconsistencies. o1-like models outperform when scaling up naturally.

Conclusion: Simple scaling methods are limited; true scaling aims for higher performance, not just replicating behavior.

Abstract: Prior work proposed simple test-time scaling, a method for replicating this scaling behavior with models distilled from o1-like models by manually controlling test-time compute: either scaling down by enforcing a maximum length or scaling up by iteratively appending “Wait” when the model is about to terminate its generation. This paper presents an analysis of simple test-time scaling and finds that the scaling behavior is largely attributed to scaling down by enforcing a maximum length. In contrast, fine-tuning on long CoT data distilled from o1-like models has no significant impact on scaling behavior, and scaling up by appending “Wait” leads to inconsistencies, as the model may oscillate between solutions. A key distinction exists between scaling down by enforcing a maximum length and scaling up test-time compute in o1-like models, such as DeepSeek-R1@. These models are typically allowed to utilize as much compute as needed, with the only constraint being the model’s maximum supported length. By learning to naturally scale up test-time compute during reinforcement learning, o1-like models surpass their peak performance when scaling up. In contrast, simple test-time scaling progressively imposes a lower upper limit on model performance as it scales down. While replicating the test-time scaling behavior of o1 models can be straightforward by scaling down, it is crucial to recognize that the goal of scaling test-time compute is to unlock higher performance – beyond what the model could originally achieve – rather than merely reproducing the appearance of scaling behavior.

[523] Deep RL Dual Sourcing Inventory Management with Supply and Capacity Risk Awareness

Feng Liu, Ying Liu, Carson Eisenach

Main category: cs.LG

TL;DR: The paper proposes using reinforcement learning (RL) with intervention models to solve large-scale stochastic optimization, focusing on supply chain inventory management. It leverages pre-trained deep learning models for exploration and introduces a constraint coordination mechanism.

DetailsMotivation: To address the inefficiency of directly modeling complex constraints in RL for stochastic optimization, the paper aims to break down supply chain processes into scalable deep learning modules.

Method: The approach combines RL with pre-trained deep learning models to simulate and compose stochastic processes. It includes a constraint coordination mechanism for forecasting dual costs.

Result: The method improves performance on large real-world datasets by decomposing supply chain processes into scalable and composable modules.

Conclusion: The paper demonstrates the efficacy of the proposed approach for large-scale stochastic optimization and identifies open problems for future research.

Abstract: In this work, we study how to efficiently apply reinforcement learning (RL) for solving large-scale stochastic optimization problems by leveraging intervention models. The key of the proposed methodology is to better explore the solution space by simulating and composing the stochastic processes using pre-trained deep learning (DL) models. We demonstrate our approach on a challenging real-world application, the multi-sourcing multi-period inventory management problem in supply chain optimization. In particular, we employ deep RL models for learning and forecasting the stochastic supply chain processes under a range of assumptions. Moreover, we also introduce a constraint coordination mechanism, designed to forecast dual costs given the cross-products constraints in the inventory network. We highlight that instead of directly modeling the complex physical constraints into the RL optimization problem and solving the stochastic problem as a whole, our approach breaks down those supply chain processes into scalable and composable DL modules, leading to improved performance on large real-world datasets. We also outline open problems for future research to further investigate the efficacy of such models.

[524] ReDiSC: A Reparameterized Masked Diffusion Model for Scalable Node Classification with Structured Predictions

Yule Li, Yifeng Lu, Zhen Wang, Zhewei Wei, Yaliang Li, Bolin Ding

Main category: cs.LG

TL;DR: ReDiSC, a reparameterized masked diffusion model, improves structured node classification by addressing label correlation in graphs, outperforming existing methods in scalability and performance.

DetailsMotivation: Existing GNNs assume node label independence, which contradicts real-world graph label correlations. ReDiSC aims to model joint label distributions for better structured predictions.

Method: ReDiSC uses a reparameterized masked diffusion model within a variational EM framework to estimate joint node label distributions, linking its M-step to GNN and label propagation hybrids.

Result: ReDiSC outperforms state-of-the-art methods in performance and scalability, especially on large datasets where others fail due to computational limits.

Conclusion: ReDiSC effectively addresses label correlation in graphs, offering a scalable and superior solution for structured node classification.

Abstract: In recent years, graph neural networks (GNN) have achieved unprecedented successes in node classification tasks. Although GNNs inherently encode specific inductive biases (e.g., acting as low-pass or high-pass filters), most existing methods implicitly assume conditional independence among node labels in their optimization objectives. While this assumption is suitable for traditional classification tasks such as image recognition, it contradicts the intuitive observation that node labels in graphs remain correlated, even after conditioning on the graph structure. To make structured predictions for node labels, we propose ReDiSC, namely, Reparameterized masked Diffusion model for Structured node Classification. ReDiSC estimates the joint distribution of node labels using a reparameterized masked diffusion model, which is learned through the variational expectation-maximization (EM) framework. Our theoretical analysis shows the efficiency advantage of ReDiSC in the E-step compared to DPM-SNC, a state-of-the-art model that relies on a manifold-constrained diffusion model in continuous domain. Meanwhile, we explicitly link ReDiSC’s M-step objective to popular GNN and label propagation hybrid approaches. Extensive experiments demonstrate that ReDiSC achieves superior or highly competitive performance compared to state-of-the-art GNN, label propagation, and diffusion-based baselines across both homophilic and heterophilic graphs of varying sizes. Notably, ReDiSC scales effectively to large-scale datasets on which previous structured diffusion methods fail due to computational constraints, highlighting its significant practical advantage in structured node classification tasks.

[525] Federated Reinforcement Learning in Heterogeneous Environments

Ukjo Hwang, Songnam Hong

Main category: cs.LG

TL;DR: A Federated Reinforcement Learning (FRL) framework addresses environment heterogeneity (FRL-EH) by optimizing a global policy via collective local experiences while preserving privacy. The proposed FedRQ algorithm converges to an optimal policy and extends to continuous spaces with expectile loss, outperforming existing FRL methods.

DetailsMotivation: To tackle the challenge of learning robust global policies in federated settings with statistically heterogeneous local environments, ensuring privacy and performance across diverse scenarios.

Method: Introduces a novel global objective function for robustness, proposes the FedRQ algorithm with theoretical convergence guarantees, and extends it to continuous spaces using expectile loss.

Result: Empirical evaluations show superior performance and robustness of FedRQ across heterogeneous environments compared to state-of-the-art FRL algorithms.

Conclusion: The FRL-EH framework and FedRQ algorithm effectively address environment heterogeneity, offering robust and scalable solutions for federated reinforcement learning.

Abstract: We investigate a Federated Reinforcement Learning with Environment Heterogeneity (FRL-EH) framework, where local environments exhibit statistical heterogeneity. Within this framework, agents collaboratively learn a global policy by aggregating their collective experiences while preserving the privacy of their local trajectories. To better reflect real-world scenarios, we introduce a robust FRL-EH framework by presenting a novel global objective function. This function is specifically designed to optimize a global policy that ensures robust performance across heterogeneous local environments and their plausible perturbations. We propose a tabular FRL algorithm named FedRQ and theoretically prove its asymptotic convergence to an optimal policy for the global objective function. Furthermore, we extend FedRQ to environments with continuous state space through the use of expectile loss, addressing the key challenge of minimizing a value function over a continuous subset of the state space. This advancement facilitates the seamless integration of the principles of FedRQ with various Deep Neural Network (DNN)-based RL algorithms. Extensive empirical evaluations validate the effectiveness and robustness of our FRL algorithms across diverse heterogeneous environments, consistently achieving superior performance over the existing state-of-the-art FRL algorithms.

[526] Glitches in Decision Tree Ensemble Models

Satyankar Chandra, Ashutosh Gupta, Kaushik Mallik, Krishna Shankaranarayanan, Namrita Varshney

Main category: cs.LG

TL;DR: The paper identifies ‘glitches’—small input neighborhoods causing abrupt output oscillations—as a new source of unreliability in AI models with steep decision boundaries. It formally defines glitches, demonstrates their prevalence, and proposes an NP-complete detection algorithm for GBDT models using MILP encoding.

DetailsMotivation: To address unreliable behaviors in AI models, particularly glitches, which impair trustworthiness and consistency in critical decision-making tasks.

Method: Formally defines glitches, demonstrates their existence in literature models/datasets, and develops an MILP-based algorithm for detecting glitches in GBDT models.

Result: Glitches are widespread and indicate model inconsistencies. The glitch-detection problem for tree ensembles is NP-complete, and the proposed algorithm is effective for GBDT benchmarks.

Conclusion: Glitches are a significant reliability issue in AI models, and the proposed algorithm provides a feasible solution for detecting them in GBDT models.

Abstract: Many critical decision-making tasks are now delegated to machine-learned models, and it is imperative that their decisions are trustworthy and reliable, and their outputs are consistent across similar inputs. We identify a new source of unreliable behaviors-called glitches-which may significantly impair the reliability of AI models having steep decision boundaries. Roughly speaking, glitches are small neighborhoods in the input space where the model’s output abruptly oscillates with respect to small changes in the input. We provide a formal definition of glitches, and use well-known models and datasets from the literature to demonstrate that they have widespread existence and argue they usually indicate potential model inconsistencies in the neighborhood of where they are found. We proceed to the algorithmic search of glitches for widely used gradient-boosted decision tree (GBDT) models. We prove that the problem of detecting glitches is NP-complete for tree ensembles, already for trees of depth 4. Our glitch-search algorithm for GBDT models uses an MILP encoding of the problem, and its effectiveness and computational feasibility are demonstrated on a set of widely used GBDT benchmarks taken from the literature.

[527] Generative Distribution Distillation

Jiequan Cui, Beier Zhu, Qingshan Xu, Xiaogang Xu, Pengguang Chen, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong

Main category: cs.LG

TL;DR: The paper introduces Generative Distribution Distillation (GenDD) for knowledge distillation, addressing high-dimensional optimization and lack of label supervision with Split Tokenization and Distribution Contraction. It achieves competitive results, outperforming baselines by 16.29% in unsupervised settings and setting a new SOTA with 82.28% accuracy on ImageNet.

DetailsMotivation: To address challenges in knowledge distillation (KD) like high-dimensional optimization and lack of semantic supervision from labels, the paper proposes a generative approach.

Method: Proposes GenDD framework with Split Tokenization for unsupervised KD and Distribution Contraction to integrate label supervision. Theoretical proof links GenDD to multi-task learning.

Result: GenDD outperforms KL baseline by 16.29% on ImageNet in unsupervised settings. With supervision, ResNet-50 achieves 82.28% top-1 accuracy, setting a new SOTA.

Conclusion: GenDD effectively addresses KD challenges, demonstrating strong performance in both unsupervised and supervised settings, with theoretical and empirical validation.

Abstract: In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the \textit{Generative Distribution Distillation (GenDD)} framework. A naive \textit{GenDD} baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a \textit{Split Tokenization} strategy, achieving stable and effective unsupervised KD. Additionally, we develop the \textit{Distribution Contraction} technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that \textit{GenDD} with \textit{Distribution Contraction} serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the effectiveness of our method, we conduct experiments on balanced, imbalanced, and unlabeled data. Experimental results show that \textit{GenDD} performs competitively in the unsupervised setting, significantly surpassing KL baseline by \textbf{16.29%} on ImageNet validation set. With label supervision, our ResNet-50 achieves \textbf{82.28%} top-1 accuracy on ImageNet in 600 epochs training, establishing a new state-of-the-art.

[528] SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning

Jeyoung Lee, Hochul Kang

Main category: cs.LG

TL;DR: The paper introduces SDSC, a structure-aware metric for time series SSL, addressing limitations of distance-based methods like MSE. It combines structural agreement with amplitude preservation, showing improved performance in benchmarks.

DetailsMotivation: Distance-based objectives like MSE in SSL for signals are sensitive to amplitude, polarity-invariant, and unbounded, hindering semantic alignment and interpretability.

Method: SDSC quantifies structural agreement using signed amplitudes, derived from DSC. It can be used as a loss or combined with MSE for hybrid optimization.

Result: SDSC-based pre-training matches or outperforms MSE, especially in in-domain and low-resource settings, enhancing semantic representation quality.

Conclusion: Structure-aware metrics like SDSC are viable alternatives to conventional distance-based methods, improving signal representation fidelity.

Abstract: We propose the Signal Dice Similarity Coefficient (SDSC), a structure-aware metric function for time series self-supervised representation learning. Most Self-Supervised Learning (SSL) methods for signals commonly adopt distance-based objectives such as mean squared error (MSE), which are sensitive to amplitude, invariant to waveform polarity, and unbounded in scale. These properties hinder semantic alignment and reduce interpretability. SDSC addresses this by quantifying structural agreement between temporal signals based on the intersection of signed amplitudes, derived from the Dice Similarity Coefficient (DSC).Although SDSC is defined as a structure-aware metric, it can be used as a loss by subtracting from 1 and applying a differentiable approximation of the Heaviside function for gradient-based optimization. A hybrid loss formulation is also proposed to combine SDSC with MSE, improving stability and preserving amplitude where necessary. Experiments on forecasting and classification benchmarks demonstrate that SDSC-based pre-training achieves comparable or improved performance over MSE, particularly in in-domain and low-resource scenarios. The results suggest that structural fidelity in signal representations enhances the semantic representation quality, supporting the consideration of structure-aware metrics as viable alternatives to conventional distance-based methods.

[529] Positive-Unlabeled Learning for Control Group Construction in Observational Causal Inference

Ilias Tsoumas, Dimitrios Bormpoudakis, Vasileios Sitokonstantinou, Athanasios Askitopoulos, Andreas Kalogeras, Charalampos Kontoes, Ioannis Athanasiadis

Main category: cs.LG

TL;DR: The paper proposes using positive-unlabeled (PU) learning to identify control units in observational studies where labeled controls are missing, enabling accurate average treatment effect (ATE) estimation.

DetailsMotivation: In observational studies, the lack of clearly labeled control units complicates causal inference. The paper aims to address this challenge by leveraging PU learning to identify controls from unlabeled data.

Method: The authors use PU learning to identify control units from unlabeled data, validated through simulated and real-world agricultural data. A causal graph generates synthetic scenarios to test reliability.

Result: PU learning successfully identifies control units and estimates ATE close to the true value, as demonstrated in both synthetic and real-world agricultural data.

Conclusion: PU learning is effective for causal inference in observational studies, especially where randomized experiments are impractical, with applications in environmental and agricultural sciences.

Abstract: In causal inference, whether through randomized controlled trials or observational studies, access to both treated and control units is essential for estimating the effect of a treatment on an outcome of interest. When treatment assignment is random, the average treatment effect (ATE) can be estimated directly by comparing outcomes between groups. In non-randomized settings, various techniques are employed to adjust for confounding and approximate the counterfactual scenario to recover an unbiased ATE. A common challenge, especially in observational studies, is the absence of units clearly labeled as controls-that is, units known not to have received the treatment. To address this, we propose positive-unlabeled (PU) learning as a framework for identifying, with high confidence, control units from a pool of unlabeled ones, using only the available treated (positive) units. We evaluate this approach using both simulated and real-world data. We construct a causal graph with diverse relationships and use it to generate synthetic data under various scenarios, assessing how reliably the method recovers control groups that allow estimates of true ATE. We also apply our approach to real-world data on optimal sowing and fertilizer treatments in sustainable agriculture. Our findings show that PU learning can successfully identify control (negative) units from unlabeled data based only on treated units and, through the resulting control group, estimate an ATE that closely approximates the true value. This work has important implications for observational causal inference, especially in fields where randomized experiments are difficult or costly. In domains such as earth, environmental, and agricultural sciences, it enables a plethora of quasi-experiments by leveraging available earth observation and climate data, particularly when treated units are available but control units are lacking.

[530] Kernel Based Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games

Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi

Main category: cs.LG

TL;DR: The paper proposes a maximum causal entropy inverse reinforcement learning method for infinite-horizon stationary mean-field games, using a reproducing kernel Hilbert space to model nonlinear rewards.

DetailsMotivation: Existing methods for mean-field games often restrict rewards to linear combinations of basis functions and focus on finite-horizon settings, limiting their applicability.

Method: A Lagrangian relaxation transforms the problem into unconstrained log-likelihood maximization, solved via gradient ascent, with theoretical consistency ensured by proving Fréchet differentiability.

Result: The method accurately recovers expert behavior in a mean-field traffic routing game, demonstrating its effectiveness.

Conclusion: The approach successfully infers nonlinear rewards in infinite-horizon settings, outperforming traditional linear methods.

Abstract: We consider the maximum causal entropy inverse reinforcement learning problem for infinite-horizon stationary mean-field games, in which we model the unknown reward function within a reproducing kernel Hilbert space. This allows the inference of rich and potentially nonlinear reward structures directly from expert demonstrations, in contrast to most existing inverse reinforcement learning approaches for mean-field games that typically restrict the reward function to a linear combination of a fixed finite set of basis functions. We also focus on the infinite-horizon cost structure, whereas prior studies primarily rely on finite-horizon formulations. We introduce a Lagrangian relaxation to this maximum causal entropy inverse reinforcement learning problem that enables us to reformulate it as an unconstrained log-likelihood maximization problem, and obtain a solution \lk{via} a gradient ascent algorithm. To illustrate the theoretical consistency of the algorithm, we establish the smoothness of the log-likelihood objective by proving the Fr'echet differentiability of the related soft Bellman operators with respect to the parameters in the reproducing kernel Hilbert space. We demonstrate the effectiveness of our method on a mean-field traffic routing game, where it accurately recovers expert behavior.

[531] The Origin of Self-Attention: From Pairwise Affinity Matrices to Transformers

Giorgio Roffo

Main category: cs.LG

TL;DR: The paper connects self-attention in Transformers to a broader affinity-based computation principle, highlighting Infinite Feature Selection (Inf-FS) as a foundational approach.

DetailsMotivation: To unify self-attention mechanisms across domains (vision, NLP, graphs) by tracing their shared reliance on affinity matrices.

Method: Comparative analysis of self-attention and Inf-FS, focusing on how affinity matrices (A) are defined and applied.

Result: Self-attention is shown as a special case of Inf-FS, with differences in affinity matrix computation and application.

Conclusion: The paper unifies diverse ML models under a common affinity-based framework, emphasizing shared mathematical foundations.

Abstract: The self-attention mechanism, now central to deep learning architectures such as Transformers, is a modern instance of a more general computational principle: learning and using pairwise affinity matrices to control how information flows through a model. This paper traces the conceptual origins of self-attention across multiple domains, including computer vision, natural language processing, and graph learning, through their shared reliance on an affinity matrix, denoted as A. We highlight Infinite Feature Selection (Inf-FS) as a foundational approach that generalizes the idea of affinity-based weighting. Unlike the fixed dot-product structure used in Transformers, Inf-FS defines A either through domain knowledge or by learning, and computes feature relevance through multi-hop propagation over the affinity graph. From this perspective, self-attention can be seen as a special case of Inf-FS: it uses a single-hop affinity computation where A is dynamically built from token similarities. We argue that the underlying structure, reasoning over pairwise relationships, is preserved across both approaches, and the key differences lie in how the affinity matrix is defined and applied. By situating self-attention within the broader paradigm of affinity-based computation, we unify several strands of machine learning research and highlight a common mathematical foundation that underpins diverse models and tasks.

[532] LPS-GNN : Deploying Graph Neural Networks on Graphs with 100-Billion Edges

Xu Cheng, Liang Yao, Feng He, Yukuo Cen, Yufei He, Chenhui Zhang, Wenzheng Feng, Hongyun Cai, Jie Tang

Main category: cs.LG

TL;DR: LPS-GNN is a scalable, efficient GNN framework that handles large-scale graphs with a single GPU, improving performance by 13.8% in User Acquisition. It introduces LPMetis for graph partitioning and subgraph augmentation for better accuracy.

DetailsMotivation: Existing GNNs struggle with efficiency and accuracy due to computational demands and neighbor explosion in large graphs.

Method: Proposes LPS-GNN with LPMetis for graph partitioning and subgraph augmentation to enhance performance.

Result: Achieves 8.24% to 13.89% performance lift over SOTA models in real-world applications.

Conclusion: LPS-GNN is a scalable, efficient solution for large-scale graph tasks, validated on real-world datasets.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for various graph mining tasks, yet existing scalable solutions often struggle to balance execution efficiency with prediction accuracy. These difficulties stem from iterative message-passing techniques, which place significant computational demands and require extensive GPU memory, particularly when dealing with the neighbor explosion issue inherent in large-scale graphs. This paper introduces a scalable, low-cost, flexible, and efficient GNN framework called LPS-GNN, which can perform representation learning on 100 billion graphs with a single GPU in 10 hours and shows a 13.8% improvement in User Acquisition scenarios. We examine existing graph partitioning methods and design a superior graph partition algorithm named LPMetis. In particular, LPMetis outperforms current state-of-the-art (SOTA) approaches on various evaluation metrics. In addition, our paper proposes a subgraph augmentation strategy to enhance the model’s predictive performance. It exhibits excellent compatibility, allowing the entire framework to accommodate various GNN algorithms. Successfully deployed on the Tencent platform, LPS-GNN has been tested on public and real-world datasets, achieving performance lifts of 8. 24% to 13. 89% over SOTA models in online applications.

[533] A Transformer-Based Conditional GAN with Multiple Instance Learning for UAV Signal Detection and Classification

Haochen Liu, Jia Bi, Xiaomin Wang, Xin Yang, Ling Wang

Main category: cs.LG

TL;DR: A novel framework combining Transformer-based GAN and MILET for UAV flight state classification achieves high accuracy and efficiency, outperforming SOTA methods.

DetailsMotivation: Conventional TSC methods lack robustness for dynamic UAV environments, and SOTA models require large datasets and high computational costs.

Method: Integrates Transformer-based GAN for data augmentation and MILET to focus on discriminative input segments.

Result: Achieves 96.5% accuracy on DroneDetect and 98.6% on DroneRF datasets, with strong generalization and efficiency.

Conclusion: The framework is effective for real-time UAV flight state classification in resource-constrained settings.

Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly used in surveillance, logistics, agriculture, disaster management, and military operations. Accurate detection and classification of UAV flight states, such as hovering, cruising, ascending, or transitioning, which are essential for safe and effective operations. However, conventional time series classification (TSC) methods often lack robustness and generalization for dynamic UAV environments, while state of the art(SOTA) models like Transformers and LSTM based architectures typically require large datasets and entail high computational costs, especially with high-dimensional data streams. This paper proposes a novel framework that integrates a Transformer-based Generative Adversarial Network (GAN) with Multiple Instance Locally Explainable Learning (MILET) to address these challenges in UAV flight state classification. The Transformer encoder captures long-range temporal dependencies and complex telemetry dynamics, while the GAN module augments limited datasets with realistic synthetic samples. MIL is incorporated to focus attention on the most discriminative input segments, reducing noise and computational overhead. Experimental results show that the proposed method achieves superior accuracy 96.5% on the DroneDetect dataset and 98.6% on the DroneRF dataset that outperforming other SOTA approaches. The framework also demonstrates strong computational efficiency and robust generalization across diverse UAV platforms and flight states, highlighting its potential for real-time deployment in resource constrained environments.

[534] $k$-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation

Daniel Greenhut, Dan Feldman

Main category: cs.LG

TL;DR: Error: OutputParser failed

DetailsMotivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Given an integer $k\geq1$ and a set $P$ of $n$ points in $\REAL^d$, the classic $k$-PCA (Principle Component Analysis) approximates the affine \emph{$k$-subspace mean} of $P$, which is the $k$-dimensional affine linear subspace that minimizes its sum of squared Euclidean distances ($\ell_{2,2}$-norm) over the points of $P$, i.e., the mean of these distances. The \emph{$k$-subspace median} is the subspace that minimizes its sum of (non-squared) Euclidean distances ($\ell_{2,1}$-mixed norm), i.e., their median. The median subspace is usually more sparse and robust to noise/outliers than the mean, but also much harder to approximate since, unlike the $\ell_{z,z}$ (non-mixed) norms, it is non-convex for $k<d-1$. We provide the first polynomial-time deterministic algorithm whose both running time and approximation factor are not exponential in $k$. More precisely, the multiplicative approximation factor is $\sqrt{d}$, and the running time is polynomial in the size of the input. We expect that our technique would be useful for many other related problems, such as $\ell_{2,z}$ norm of distances for $z\not \in \br{1,2}$, e.g., $z=\infty$, and handling outliers/sparsity. Open code and experimental results on real-world datasets are also provided.

[535] Rec-AD: An Efficient Computation Framework for FDIA Detection Based on Tensor Train Decomposition and Deep Learning Recommendation Model

Yunfeng Li, Junhong Liu, Zhaohui Yang, Guofu Liao, Chuyun Zhang

Main category: cs.LG

TL;DR: Rec-AD integrates Tensor Train decomposition with DLRM to improve FDIA detection efficiency in smart grids, reducing computational and memory burdens while enhancing real-time performance.

DetailsMotivation: Address computational and memory inefficiencies in deep learning-based FDIA detection for large-scale smart grids.

Method: Uses Tensor Train decomposition for embedding compression, index reordering for optimized data access, and pipeline training to reduce memory overhead.

Result: Significantly improves computational throughput and real-time detection, narrowing attack windows and increasing attacker costs.

Conclusion: Rec-AD strengthens edge computing and scalability, offering robust support for smart grid security.

Abstract: Deep learning models have been widely adopted for False Data Injection Attack (FDIA) detection in smart grids due to their ability to capture unstructured and sparse features. However, the increasing system scale and data dimensionality introduce significant computational and memory burdens, particularly in large-scale industrial datasets, limiting detection efficiency. To address these issues, this paper proposes Rec-AD, a computationally efficient framework that integrates Tensor Train decomposition with the Deep Learning Recommendation Model (DLRM). Rec-AD enhances training and inference efficiency through embedding compression, optimized data access via index reordering, and a pipeline training mechanism that reduces memory communication overhead. Fully compatible with PyTorch, Rec-AD can be integrated into existing FDIA detection systems without code modifications. Experimental results show that Rec-AD significantly improves computational throughput and real-time detection performance, narrowing the attack window and increasing attacker cost. These advancements strengthen edge computing capabilities and scalability, providing robust technical support for smart grid security.

[536] Revisiting Graph Contrastive Learning on Anomaly Detection: A Structural Imbalance Perspective

Yiming Xu, Zhen Peng, Bin Shi, Xu Hua, Bo Dong, Song Wang, Chen Chen

Main category: cs.LG

TL;DR: AD-GCL is a novel graph contrastive learning framework designed to improve robustness in anomaly detection, especially for tail nodes in structurally imbalanced networks.

DetailsMotivation: Existing GCL-based anomaly detection methods prioritize overall performance but lack robustness to structural imbalance, particularly for tail anomalies, limiting their real-world applicability.

Method: AD-GCL introduces neighbor pruning for head nodes, anomaly-guided neighbor completion for tail nodes, and intra- and inter-view consistency loss for enhanced representation.

Result: AD-GCL outperforms existing methods in detecting both head and tail anomalies across multiple datasets.

Conclusion: AD-GCL addresses structural imbalance in anomaly detection, offering a more robust and comprehensive solution for real-world networks.

Abstract: The superiority of graph contrastive learning (GCL) has prompted its application to anomaly detection tasks for more powerful risk warning systems. Unfortunately, existing GCL-based models tend to excessively prioritize overall detection performance while neglecting robustness to structural imbalance, which can be problematic for many real-world networks following power-law degree distributions. Particularly, GCL-based methods may fail to capture tail anomalies (abnormal nodes with low degrees). This raises concerns about the security and robustness of current anomaly detection algorithms and therefore hinders their applicability in a variety of realistic high-risk scenarios. To the best of our knowledge, research on the robustness of graph anomaly detection to structural imbalance has received little scrutiny. To address the above issues, this paper presents a novel GCL-based framework named AD-GCL. It devises the neighbor pruning strategy to filter noisy edges for head nodes and facilitate the detection of genuine tail nodes by aligning from head nodes to forged tail nodes. Moreover, AD-GCL actively explores potential neighbors to enlarge the receptive field of tail nodes through anomaly-guided neighbor completion. We further introduce intra- and inter-view consistency loss of the original and augmentation graph for enhanced representation. The performance evaluation of the whole, head, and tail nodes on multiple datasets validates the comprehensive superiority of the proposed AD-GCL in detecting both head anomalies and tail anomalies.

[537] GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks

Zixin Xu, Zhijie Wang, Zhiyuan Pan

Main category: cs.LG

TL;DR: A novel spam-text detection framework, GCC-Spam, addresses adversarial spam strategies and data scarcity using character similarity networks, contrastive learning, and GAN-generated pseudo-spam samples, outperforming baselines with fewer labeled examples.

DetailsMotivation: The rise of spam text poses risks like information leakage and social instability, requiring robust detection methods despite adversarial tactics and limited labeled data.

Method: GCC-Spam integrates character similarity networks for obfuscation resistance, contrastive learning for better discrimination, and GANs for pseudo-spam generation to tackle data scarcity.

Result: The model achieves higher detection rates than baselines, even with fewer labeled examples, demonstrating improved robustness and accuracy.

Conclusion: GCC-Spam effectively counters spam challenges through innovative techniques, offering a scalable solution with enhanced performance.

Abstract: The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.

[538] Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition

Xuetao Lin, Tianhao Peng, Peihong Dai, Yu Liang, Wenjun Wu

Main category: cs.LG

TL;DR: The paper introduces SST-CL, a framework combining spatial-temporal transformers and curriculum learning for EEG-based emotion recognition, addressing non-stationary neural patterns and dynamic emotional intensity.

DetailsMotivation: To tackle challenges in EEG-based emotion recognition: integrating non-stationary spatial-temporal neural patterns and adapting to dynamic emotional intensity variations.

Method: Proposes SST-CL with spatial and temporal encoders for EEG signal analysis, plus an intensity-aware curriculum learning strategy for training.

Result: Achieves state-of-the-art performance on three benchmark datasets, validated by ablation studies.

Conclusion: SST-CL effectively integrates spatial-temporal patterns and adapts to emotional intensity, demonstrating superior performance.

Abstract: EEG-based emotion recognition plays an important role in developing adaptive brain-computer communication systems, yet faces two fundamental challenges in practical implementations: (1) effective integration of non-stationary spatial-temporal neural patterns, (2) robust adaptation to dynamic emotional intensity variations in real-world scenarios. This paper proposes SST-CL, a novel framework integrating spatial-temporal transformers with curriculum learning. Our method introduces two core components: a spatial encoder that models inter-channel relationships and a temporal encoder that captures multi-scale dependencies through windowed attention mechanisms, enabling simultaneous extraction of spatial correlations and temporal dynamics from EEG signals. Complementing this architecture, an intensity-aware curriculum learning strategy progressively guides training from high-intensity to low-intensity emotional states through dynamic sample scheduling based on a dual difficulty assessment. Comprehensive experiments on three benchmark datasets demonstrate state-of-the-art performance across various emotional intensity levels, with ablation studies confirming the necessity of both architectural components and the curriculum learning mechanism.

[539] Fraud is Not Just Rarity: A Causal Prototype Attention Approach to Realistic Synthetic Oversampling

Claudio Giusti, Luca Guarnera, Mirko Casu, Sebastiano Battiato

Main category: cs.LG

TL;DR: The paper proposes CPAC, an interpretable architecture for fraud detection, improving latent space structure and outperforming traditional oversamplers and generative models.

DetailsMotivation: Addressing the challenge of detecting fraudulent credit card transactions due to class imbalance and subtle patterns, existing methods like GANs and VAEs often lead to overconfident classifiers and poor latent cluster separation.

Method: Introduces the Causal Prototype Attention Classifier (CPAC) with prototype-based attention mechanisms, coupled with a VAE-GAN for better cluster separation.

Result: CPAC achieves an F1-score of 93.14% and recall of 90.18%, with improved latent cluster separation compared to traditional methods.

Conclusion: Classifier-guided latent shaping with CPAC enhances fraud detection performance and offers insights into representation learning, with the codebase to be released.

Abstract: Detecting fraudulent credit card transactions remains a significant challenge, due to the extreme class imbalance in real-world data and the often subtle patterns that separate fraud from legitimate activity. Existing research commonly attempts to address this by generating synthetic samples for the minority class using approaches such as GANs, VAEs, or hybrid generative models. However, these techniques, particularly when applied only to minority-class data, tend to result in overconfident classifiers and poor latent cluster separation, ultimately limiting real-world detection performance. In this study, we propose the Causal Prototype Attention Classifier (CPAC), an interpretable architecture that promotes class-aware clustering and improved latent space structure through prototype-based attention mechanisms and we will couple it with the encoder in a VAE-GAN allowing it to offer a better cluster separation moving beyond post-hoc sample augmentation. We compared CPAC-augmented models to traditional oversamplers, such as SMOTE, as well as to state-of-the-art generative models, both with and without CPAC-based latent classifiers. Our results show that classifier-guided latent shaping with CPAC delivers superior performance, achieving an F1-score of 93.14% percent and recall of 90.18%, along with improved latent cluster separation. Further ablation studies and visualizations provide deeper insight into the benefits and limitations of classifier-driven representation learning for fraud detection. The codebase for this work will be available at final submission.

[540] Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems

Rachid Karami, Rajeev Patwari, Hyoukjun Kwon, Ashish Sirasao

Main category: cs.LG

TL;DR: The paper explores real-time generative AI (RTGen) workloads on heterogeneous SoCs, focusing on scheduling policies’ impact on performance and latency.

DetailsMotivation: The rise of RTGen workloads in applications like video conferencing and gaming necessitates efficient scheduling on heterogeneous SoCs, which is underexplored.

Method: The study characterizes RTGen workloads on AMD’s Ryzen AI SoC, profiles model performance, and evaluates five scheduling policies.

Result: Scheduling decisions significantly affect performance, with a 41.7% average difference in deadline violation rates.

Conclusion: Workload-aware, dynamic heterogeneous scheduling is crucial for high-performance RTGen applications.

Abstract: The integration of generative AI models, particularly large language models (LLMs), into real-time multi-model AI applications such as video conferencing and gaming is giving rise to a new class of workloads: real-time generative AI (RTGen). These workloads combine the compute intensity and dynamic execution patterns of generative models with the stringent latency and concurrency constraints of real-time inference. To meet the diverse demands of RTGen workloads, modern edge platforms increasingly adopt heterogeneous system-on-chip (SoC) architectures that integrate CPUs, GPUs, and NPUs. Despite the potential of heterogeneous SoC, the scheduling space complexity and performance implications of RTGen workloads on such platforms remain underexplored. In this work, we perform a comprehensive characterization of RTGen workloads on AMD’s latest heterogeneous SoC, Ryzen AI. We construct realistic multi-model scenarios inspired by industry use cases and profile model performance across all available backends. Using this data, we evaluate five scheduling policies and their impact on both real-time metrics (e.g., deadline violation rate) and LLM performance (e.g., time-to-first-token and tokens-per-second). Our results show that scheduling decisions significantly affect workload performance (e.g., leading to a 41.7% difference in deadline violation rates on average), and highlight the need for scheduling strategies that are aware of workload dynamics and hardware heterogeneity. Our findings underscore the importance of workload-aware, dynamic heterogeneous scheduling in enabling high-performance, on-device RTGen applications.

[541] LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4

Matěj Kripner, Michal Šustr, Milan Straka

Main category: cs.LG

TL;DR: LeanTree introduces a white-box tool for ATP, leveraging LLMs with Lean 4 to factorize proofs, outperforming black-box methods in some cases.

DetailsMotivation: Address the lag in white-box ATP methods compared to black-box approaches by leveraging LLMs for incremental proof construction.

Method: Develop LeanTree, a Lean 4-based tool that factorizes complex proofs into simpler branches and provides a dataset of intermediate states.

Result: Preliminary results show white-box methods like LeanTree can outperform black-box alternatives in certain settings.

Conclusion: LeanTree demonstrates the potential of white-box approaches in ATP, offering advantages like simplified evaluation and richer training data.

Abstract: Automated theorem proving (ATP) has been a classical problem in artificial intelligence since its inception, yet it remains challenging due to its vast state and action space. Large language models (LLMs) have recently emerged as a promising heuristic for ATP, but they lack correctness guarantees and thus require interaction with a proof verifier. Such interactions typically follow one of two approaches: black-box interaction, which does not utilize intermediate proof states, or white-box approaches, which allow for incremental proof construction and examination of intermediate states. While black-box approaches have directly benefited from recent LLM advances, white-box methods have comparatively lagged behind. In this paper, we address this gap by introducing LeanTree, which consists of (i) a tool built in the Lean 4 language that factorizes complex proof states into simpler, independent branches, and (ii) a dataset of these factorized intermediate states. Our white-box tooling offers several advantages over black-box approaches: it simplifies evaluation, reduces necessary context, generates richer training data, enables parallel search across multiple states, supports efficient reuse of states, and provides feedback in case of errors. Our preliminary results hint that white-box approaches outperform black-box alternatives in some settings.

[542] Task-Agnostic Continual Prompt Tuning with Gradient-Based Selection and Decoding

Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

Main category: cs.LG

TL;DR: GRID is a unified framework for prompt-based continual learning that tackles latent forgetting and prompt memory issues, improving backward transfer and scalability.

DetailsMotivation: Existing prompt-based CL methods assume task-aware inference and use growing task-specific prompts, limiting scalability and hiding latent forgetting.

Method: GRID integrates task-aware decoding with representative inputs, automatic task identification, and constrained decoding, plus a gradient-based prompt selection strategy for memory efficiency.

Result: GRID improves backward transfer, reduces forgotten tasks by up to 80%, and achieves competitive forward transfer, outperforming state-of-the-art methods.

Conclusion: GRID effectively addresses scalability and forgetting in prompt-based CL, offering a robust solution for lifelong learning.

Abstract: Prompt-based continual learning (CL) offers a parameter-efficient way to adapt large language models (LLMs) across task sequences. However, most existing methods assume task-aware inference and maintain a growing list of task-specific prompts, which limits scalability and hides latent forgetting. In this work, we introduce GRID, a unified framework that addresses two key limitations: (1) latent forgetting under task-agnostic inference, and (2) prompt memory explosion as task sequences grow. GRID integrates a task-aware decoding mechanism that improves backward transfer by leveraging representative inputs, automatic task identification, and constrained decoding. Additionally, we propose a gradient-based prompt selection strategy that compresses less informative prompts into a single aggregated representation, enabling scalable and memory-efficient lifelong learning. Extensive experiments across short-sequence, long-sequence, and negative transfer benchmarks show that GRID significantly improves backward transfer, achieves competitive forward transfer, and reduces forgotten tasks by up to 80%, outperforming state-of-the-art methods on T5 and Flan-T5 backbones.

[543] Balancing Expressivity and Robustness: Constrained Rational Activations for Reinforcement Learning

Rafał Surdej, Michał Bortkiewicz, Alex Lewandowski, Mateusz Ostaszewski, Clare Lyle

Main category: cs.LG

TL;DR: Trainable rational activation functions enhance adaptability but can introduce instability in RL and continual learning. A constrained variant is proposed to balance expressivity and plasticity, improving stability and performance.

DetailsMotivation: To understand the impact of trainable rational activation functions on training stability and performance in reinforcement and continual learning.

Method: Study trainable rational activations, propose a constrained variant to limit output scaling, and test in MetaWorld, DMC, and continual learning benchmarks.

Result: Rational activations show a trade-off between expressivity and plasticity. The constrained variant improves stability and performance in RL and continual learning.

Conclusion: The findings provide design principles for robust trainable activations in dynamic environments, with the trade-off being more relevant for continuous control.

Abstract: Trainable activation functions, whose parameters are optimized alongside network weights, offer increased expressivity compared to fixed activation functions. Specifically, trainable activation functions defined as ratios of polynomials (rational functions) have been proposed to enhance plasticity in reinforcement learning. However, their impact on training stability remains unclear. In this work, we study trainable rational activations in both reinforcement and continual learning settings. We find that while their flexibility enhances adaptability, it can also introduce instability, leading to overestimation in RL and feature collapse in longer continual learning scenarios. Our main result is demonstrating a trade-off between expressivity and plasticity in rational activations. To address this, we propose a constrained variant that structurally limits excessive output scaling while preserving adaptability. Experiments across MetaWorld and DeepMind Control Suite (DMC) environments show that our approach improves training stability and performance. In continual learning benchmarks, including MNIST with reshuffled labels and Split CIFAR-100, we reveal how different constraints affect the balance between expressivity and long-term retention. While preliminary experiments in discrete action domains (e.g., Atari) did not show similar instability, this suggests that the trade-off is particularly relevant for continuous control. Together, our findings provide actionable design principles for robust and adaptable trainable activations in dynamic, non-stationary environments. Code available at: https://github.com/special114/rl_rational_plasticity.

[544] Better Training Data Attribution via Better Inverse Hessian-Vector Products

Andrew Wang, Elisa Nguyen, Runshi Yang, Juhan Bae, Sheila A. McIlraith, Roger Grosse

Main category: cs.LG

TL;DR: ASTRA improves TDA by efficiently approximating inverse Hessian-vector products (iHVP) using EKFAC-preconditioned Neumann series iterations, outperforming existing methods.

DetailsMotivation: Gradient-based TDA methods struggle with efficiently approximating iHVP, limiting their performance.

Method: ASTRA combines EKFAC-preconditioner with Neumann series iterations for accurate iHVP approximation.

Result: ASTRA is more accurate, easier to tune, and requires fewer iterations than existing methods, enhancing TDA performance.

Conclusion: Accurate iHVP approximation via ASTRA significantly improves TDA, offering a practical solution for training data attribution.

Abstract: Training data attribution (TDA) provides insights into which training data is responsible for a learned model behavior. Gradient-based TDA methods such as influence functions and unrolled differentiation both involve a computation that resembles an inverse Hessian-vector product (iHVP), which is difficult to approximate efficiently. We introduce an algorithm (ASTRA) which uses the EKFAC-preconditioner on Neumann series iterations to arrive at an accurate iHVP approximation for TDA. ASTRA is easy to tune, requires fewer iterations than Neumann series iterations, and is more accurate than EKFAC-based approximations. Using ASTRA, we show that improving the accuracy of the iHVP approximation can significantly improve TDA performance.

[545] Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML

Mustafa Cavus, Jan N. van Rijn, Przemysław Biecek

Main category: cs.LG

TL;DR: The paper proposes a framework to incorporate model multiplicity into explanation generation, using Rashomon set PDPs to highlight uncertainty and variability in feature effects, improving reliability in high-stakes domains.

DetailsMotivation: Current automated ML systems focus on single best-performing models, neglecting explanation uncertainty, which is crucial for human-centered explainable AI.

Method: The framework aggregates partial dependence profiles (PDP) from near-optimal models (Rashomon set) to generate Rashomon PDP, capturing interpretive variability. Two metrics (coverage rate and mean width of confidence intervals) evaluate consistency with standard PDP.

Result: Experiments on 35 regression datasets show Rashomon PDP covers less than 70% of the best model’s PDP, revealing limitations of single-model explanations.

Conclusion: Rashomon PDP enhances reliability and trustworthiness of model interpretations by including otherwise neglected information, especially valuable in high-stakes domains.

Abstract: Automated machine learning systems efficiently streamline model selection but often focus on a single best-performing model, overlooking explanation uncertainty, an essential concern in human centered explainable AI. To address this, we propose a novel framework that incorporates model multiplicity into explanation generation by aggregating partial dependence profiles (PDP) from a set of near optimal models, known as the Rashomon set. The resulting Rashomon PDP captures interpretive variability and highlights areas of disagreement, providing users with a richer, uncertainty aware view of feature effects. To evaluate its usefulness, we introduce two quantitative metrics, the coverage rate and the mean width of confidence intervals, to evaluate the consistency between the standard PDP and the proposed Rashomon PDP. Experiments on 35 regression datasets from the OpenML CTR23 benchmark suite show that in most cases, the Rashomon PDP covers less than 70% of the best model’s PDP, underscoring the limitations of single model explanations. Our findings suggest that Rashomon PDP improves the reliability and trustworthiness of model interpretations by adding additional information that would otherwise be neglected. This is particularly useful in high stakes domains where transparency and confidence are critical.

[546] Sampling from Gaussian Processes: A Tutorial and Applications in Global Sensitivity Analysis and Optimization

Bach Do, Nafeezat A. Ajenifuja, Taiwo A. Adebiyi, Ruda Zhang

Main category: cs.LG

TL;DR: The paper addresses the high cost of simulations and experiments in engineering by using Gaussian processes (GPs) for efficient sampling. It introduces two GP sampling methods (random Fourier features and pathwise conditioning) and demonstrates their application in sensitivity analysis and optimization.

DetailsMotivation: High costs of simulations and experiments limit their use in global sensitivity analysis (GSA) and optimization, motivating the adoption of GPs as efficient proxy models.

Method: The paper presents two GP sampling methods: random Fourier features and pathwise conditioning, with alternative approaches briefly discussed.

Result: The methods are successfully applied in GSA, single-objective, and multi-objective optimization, demonstrated through numerical examples.

Conclusion: GP sampling methods offer a practical solution for engineering tasks, enabling informed decision-making under uncertainty with limited data.

Abstract: High-fidelity simulations and physical experiments are essential for engineering analysis and design. However, their high cost often limits their applications in two critical tasks: global sensitivity analysis (GSA) and optimization. This limitation motivates the common use of Gaussian processes (GPs) as proxy regression models to provide uncertainty-aware predictions based on a limited number of high-quality observations. GPs naturally enable efficient sampling strategies that support informed decision-making under uncertainty by extracting information from a subset of possible functions for the model of interest. Despite their popularity in machine learning and statistics communities, sampling from GPs has received little attention in the community of engineering optimization. In this paper, we present the formulation and detailed implementation of two notable sampling methods – random Fourier features and pathwise conditioning – for generating posterior samples from GPs. Alternative approaches are briefly described. Importantly, we detail how the generated samples can be applied in GSA, single-objective optimization, and multi-objective optimization. We show successful applications of these sampling methods through a series of numerical examples.

[547] Pruning Increases Orderedness in Recurrent Computation

Yiding Song

Main category: cs.LG

TL;DR: The paper explores whether directionality in neural networks is a helpful inductive bias, showing it can be induced via pruning without performance loss.

DetailsMotivation: Inspired by recurrent circuits in biological brains, the study questions if directionality is a necessary or advantageous bias in artificial networks.

Method: A perceptron layer with all-to-all connections (like a weight-tied RNN) is formalized, and pruning techniques are applied to induce directionality.

Result: Pruning successfully induces topological ordering in information flow without performance compromise, indicating directionality is discoverable, not essential.

Conclusion: Directionality is a beneficial inductive bias that can emerge through gradient descent and sparsification, not a prerequisite for learning.

Abstract: Inspired by the prevalence of recurrent circuits in biological brains, we investigate the degree to which directionality is a helpful inductive bias for artificial neural networks. Taking directionality as topologically-ordered information flow between neurons, we formalise a perceptron layer with all-to-all connections (mathematically equivalent to a weight-tied recurrent neural network) and demonstrate that directionality, a hallmark of modern feed-forward networks, can be induced rather than hard-wired by applying appropriate pruning techniques. Across different random seeds our pruning schemes successfully induce greater topological ordering in information flow between neurons without compromising performance, suggesting that directionality is not a prerequisite for learning, but may be an advantageous inductive bias discoverable by gradient descent and sparsification.

[548] Skill Learning via Policy Diversity Yields Identifiable Representations for Reinforcement Learning

Patrik Reizinger, Bálint Mucsányi, Siyuan Guo, Benjamin Eysenbach, Bernhard Schölkopf, Wieland Brendel

Main category: cs.LG

TL;DR: The paper analyzes MISL in RL, focusing on CSF, proving it recovers ground-truth features up to linear transformation and explains mutual information objectives’ implications.

DetailsMotivation: To understand the role of representation and mutual information in MISL, particularly in CSF, and provide theoretical guarantees.

Method: Theoretical analysis of CSF, proving identifiability of ground-truth features, and empirical validation in MuJoCo and DeepMind Control.

Result: CSF provably recovers ground-truth features up to linear transformation, with empirical validation.

Conclusion: CSF’s identifiability guarantee clarifies mutual information objectives and highlights downsides of entropy regularizers.

Abstract: Self-supervised feature learning and pretraining methods in reinforcement learning (RL) often rely on information-theoretic principles, termed mutual information skill learning (MISL). These methods aim to learn a representation of the environment while also incentivizing exploration thereof. However, the role of the representation and mutual information parametrization in MISL is not yet well understood theoretically. Our work investigates MISL through the lens of identifiable representation learning by focusing on the Contrastive Successor Features (CSF) method. We prove that CSF can provably recover the environment’s ground-truth features up to a linear transformation due to the inner product parametrization of the features and skill diversity in a discriminative sense. This first identifiability guarantee for representation learning in RL also helps explain the implications of different mutual information objectives and the downsides of entropy regularizers. We empirically validate our claims in MuJoCo and DeepMind Control and show how CSF provably recovers the ground-truth features both from states and pixels.

[549] CXR-TFT: Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories

Mehak Arora, Ayman Ali, Kaiyuan Wu, Carolyn Davis, Takashi Shimazui, Mahmoud Alwakeel, Victor Moas, Philip Yang, Annette Esper, Rishikesan Kamaleswaran

Main category: cs.LG

TL;DR: CXR-TFT is a multi-modal framework integrating sparse CXR data and high-frequency clinical measurements to predict abnormal CXR findings in ICU patients up to 12 hours early.

DetailsMotivation: Existing CXR tools lack temporal dynamics, limiting their utility in ICU settings where timely interventions are critical.

Method: CXR-TFT combines CXR imaging, radiology reports, and hourly clinical data, using a vision encoder and transformer model to predict CXR trajectories.

Result: The framework accurately forecasts abnormal CXR findings 12 hours in advance in a study of 20,000 ICU patients.

Conclusion: CXR-TFT enhances ICU patient management by providing early, actionable insights for time-sensitive conditions like acute respiratory distress syndrome.

Abstract: In intensive care units (ICUs), patients with complex clinical conditions require vigilant monitoring and prompt interventions. Chest X-rays (CXRs) are a vital diagnostic tool, providing insights into clinical trajectories, but their irregular acquisition limits their utility. Existing tools for CXR interpretation are constrained by cross-sectional analysis, failing to capture temporal dynamics. To address this, we introduce CXR-TFT, a novel multi-modal framework that integrates temporally sparse CXR imaging and radiology reports with high-frequency clinical data, such as vital signs, laboratory values, and respiratory flow sheets, to predict the trajectory of CXR findings in critically ill patients. CXR-TFT leverages latent embeddings from a vision encoder that are temporally aligned with hourly clinical data through interpolation. A transformer model is then trained to predict CXR embeddings at each hour, conditioned on previous embeddings and clinical measurements. In a retrospective study of 20,000 ICU patients, CXR-TFT demonstrated high accuracy in forecasting abnormal CXR findings up to 12 hours before they became radiographically evident. This predictive capability in clinical data holds significant potential for enhancing the management of time-sensitive conditions like acute respiratory distress syndrome, where early intervention is crucial and diagnoses are often delayed. By providing distinctive temporal resolution in prognostic CXR analysis, CXR-TFT offers actionable ‘whole patient’ insights that can directly improve clinical outcomes.

[550] Rethinking Memorization Measures and their Implications in Large Language Models

Bishwamittra Ghosh, Soumi Das, Qinyuan Wu, Mohammad Aflah Khan, Krishna P. Gummadi, Evimaria Terzi, Deepak Garg

Main category: cs.LG

TL;DR: The paper investigates whether memorization in LLMs can be avoided during optimal learning and evaluates privacy threats. It introduces contextual memorization, compares it with existing measures, and shows that memorization is unavoidable but varies by measure.

DetailsMotivation: To address concerns about privacy threats from memorization in LLMs and determine if memorization is exaggerated or inherent in optimal learning.

Method: Re-examines existing memorization measures (recollection-based, counterfactual) and introduces contextual memorization. Tests on 18 LLMs across 6 families and multiple formal languages.

Result: Memorization measures disagree on string frequency; optimal learning cannot avoid partial memorization; improved learning reduces contextual/counterfactual memorization but increases recollection-based memorization.

Conclusion: Memorization is unavoidable in optimal learning, and existing reports of memorized strings may not pose privacy threats when contextual or counterfactual memorization is considered.

Abstract: Concerned with privacy threats, memorization in LLMs is often seen as undesirable, specifically for learning. In this paper, we study whether memorization can be avoided when optimally learning a language, and whether the privacy threat posed by memorization is exaggerated or not. To this end, we re-examine existing privacy-focused measures of memorization, namely recollection-based and counterfactual memorization, along with a newly proposed contextual memorization. Relating memorization to local over-fitting during learning, contextual memorization aims to disentangle memorization from the contextual learning ability of LLMs. Informally, a string is contextually memorized if its recollection due to training exceeds the optimal contextual recollection, a learned threshold denoting the best contextual learning without training. Conceptually, contextual recollection avoids the fallacy of recollection-based memorization, where any form of high recollection is a sign of memorization. Theoretically, contextual memorization relates to counterfactual memorization, but imposes stronger conditions. Memorization measures differ in outcomes and information requirements. Experimenting on 18 LLMs from 6 families and multiple formal languages of different entropy, we show that (a) memorization measures disagree on memorization order of varying frequent strings, (b) optimal learning of a language cannot avoid partial memorization of training strings, and (c) improved learning decreases contextual and counterfactual memorization but increases recollection-based memorization. Finally, (d) we revisit existing reports of memorized strings by recollection that neither pose a privacy threat nor are contextually or counterfactually memorized.

[551] Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards

Derek Li, Jiaming Zhou, Amirreza Kazemi, Qianyi Sun, Abbas Ghaddar, Mohammad Ali Alomrani, Liheng Ma, Yu Luo, Dong Li, Feng Wen, Jianye Hao, Mark Coates, Yingxue Zhang

Main category: cs.LG

TL;DR: Omni-Think is a reinforcement learning framework for LLMs that combines rule-based rewards and generative preference signals, improving generalization and performance across diverse tasks.

DetailsMotivation: Addressing the limitations of post-training methods like SFT, which often prioritize memorization over transferable learning in LLMs.

Method: Introduces Omni-Think, a unified RL framework using rule-based rewards and LLM-as-a-Judge evaluations, with curriculum-based task progression.

Result: Curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging across four domains.

Conclusion: Task-aware sampling and hybrid supervision are key to scaling RL-based post-training for general-purpose LLMs.

Abstract: The advancement of general-purpose artificial intelligence relies on large language models (LLMs) that excel across a wide range of tasks, from structured reasoning to creative generation. However, post-training methods like Supervised Fine-Tuning (SFT) often struggle with generalization, favoring memorization over transferable learning. In this work, we introduce Omni-Think, a unified reinforcement learning (RL) framework that enhances LLM performance across diverse tasks by combining rule-based verifiable rewards with generative preference signals via LLM-as-a-Judge evaluations. Our approach enables consistent optimization across task types and scales RL-based training to subjective domains. We further investigate training strategies, demonstrating that a curriculum-based progression that orders tasks from structured to open-ended improves performance and reduces forgetting. Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging. These results highlight the importance of task-aware sampling and hybrid supervision in scaling RL-based post-training for general-purpose LLMs.

[552] Exploring the In-Context Learning Capabilities of LLMs for Money Laundering Detection in Financial Graphs

Erfan Pirmorad

Main category: cs.LG

TL;DR: LLMs are used for reasoning over financial knowledge graphs to detect money laundering, showing potential for explainable analytics.

DetailsMotivation: The complexity of money laundering requires graph-based reasoning, and LLMs offer a promising approach for this.

Method: A lightweight pipeline retrieves k-hop subgraphs, serializes them into text, and uses few-shot LLM prompting for analysis.

Result: LLMs emulate analyst logic, flag suspicious activity, and provide explanations in synthetic AML scenarios.

Conclusion: LLM-based graph reasoning shows promise for explainable financial crime analytics, though further research is needed.

Abstract: The complexity and interconnectivity of entities involved in money laundering demand investigative reasoning over graph-structured data. This paper explores the use of large language models (LLMs) as reasoning engines over localized subgraphs extracted from a financial knowledge graph. We propose a lightweight pipeline that retrieves k-hop neighborhoods around entities of interest, serializes them into structured text, and prompts an LLM via few-shot in-context learning to assess suspiciousness and generate justifications. Using synthetic anti-money laundering (AML) scenarios that reflect common laundering behaviors, we show that LLMs can emulate analyst-style logic, highlight red flags, and provide coherent explanations. While this study is exploratory, it illustrates the potential of LLM-based graph reasoning in AML and lays groundwork for explainable, language-driven financial crime analytics.

[553] Flow Equivariant Recurrent Neural Networks

T. Anderson Keller

Main category: cs.LG

TL;DR: The paper extends equivariant network theory to time-parameterized transformations (flows) in sequence models, showing improved performance in RNNs.

DetailsMotivation: Current equivariant networks only handle static transformations, limiting their use in sequence models like RNNs. This work aims to address this gap by incorporating time-parameterized symmetries.

Method: The authors analyze standard RNNs for flow equivariance, propose modifications to introduce it, and test these models on tasks like next-step prediction and sequence classification.

Result: Flow-equivariant RNNs outperform non-equivariant ones in training speed, length generalization, and velocity generalization.

Conclusion: This work is a foundational step towards sequence models that respect time-parameterized symmetries in real-world data.

Abstract: Data arrives at our senses as a continuous stream, smoothly transforming from one instant to the next. These smooth transformations can be viewed as continuous symmetries of the environment that we inhabit, defining equivalence relations between stimuli over time. In machine learning, neural network architectures that respect symmetries of their data are called equivariant and have provable benefits in terms of generalization ability and sample efficiency. To date, however, equivariance has been considered only for static transformations and feed-forward networks, limiting its applicability to sequence models, such as recurrent neural networks (RNNs), and corresponding time-parameterized sequence transformations. In this work, we extend equivariant network theory to this regime of `flows’ – one-parameter Lie subgroups capturing natural transformations over time, such as visual motion. We begin by showing that standard RNNs are generally not flow equivariant: their hidden states fail to transform in a geometrically structured manner for moving stimuli. We then show how flow equivariance can be introduced, and demonstrate that these models significantly outperform their non-equivariant counterparts in terms of training speed, length generalization, and velocity generalization, on both next step prediction and sequence classification. We present this work as a first step towards building sequence models that respect the time-parameterized symmetries which govern the world around us.

[554] Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans

Main category: cs.LG

TL;DR: Subliminal learning in language models allows unintended traits to transfer via unrelated data, posing risks for AI development.

DetailsMotivation: To investigate how language models can transmit behavioral traits (e.g., preferences or misalignment) through semantically unrelated data, even after filtering.

Method: Experiments with ’teacher’ models generating datasets (number sequences, code, reasoning traces) and ‘student’ models trained on them, along with theoretical proofs and tests on simple MLP classifiers.

Result: Student models learn traits from unrelated data, but not when teacher and student models differ. Theoretical proof confirms subliminal learning in neural networks under certain conditions.

Conclusion: Subliminal learning is a general phenomenon, highlighting risks in AI development, especially in distillation, where unintended traits may propagate despite filtering.

Abstract: We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a “student” model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.

[555] Benchmarking Foundation Models with Multimodal Public Electronic Health Records

Kunyu Yu, Rui Yang, Jingchi Liao, Siqi Li, Huitao Li, Irene Li, Yifan Peng, Rishikesan Kamaleswaran, Nan Liu

Main category: cs.LG

TL;DR: A benchmark evaluates foundation models for EHRs using MIMIC-IV, showing multimodal models improve performance without bias.

DetailsMotivation: To assess performance, fairness, and interpretability of foundation models in handling diverse EHR data.

Method: Standardized data processing pipeline and comparison of eight foundation models (unimodal/multimodal, domain-specific/general-purpose).

Result: Multimodal models consistently outperform unimodal ones without introducing bias.

Conclusion: Supports development of trustworthy multimodal AI for clinical use; code is publicly available.

Abstract: Foundation models have emerged as a powerful approach for processing electronic health records (EHRs), offering flexibility to handle diverse medical data modalities. In this study, we present a comprehensive benchmark that evaluates the performance, fairness, and interpretability of foundation models, both as unimodal encoders and as multimodal learners, using the publicly available MIMIC-IV database. To support consistent and reproducible evaluation, we developed a standardized data processing pipeline that harmonizes heterogeneous clinical records into an analysis-ready format. We systematically compared eight foundation models, encompassing both unimodal and multimodal models, as well as domain-specific and general-purpose variants. Our findings demonstrate that incorporating multiple data modalities leads to consistent improvements in predictive performance without introducing additional bias. Through this benchmark, we aim to support the development of effective and trustworthy multimodal artificial intelligence (AI) systems for real-world clinical applications. Our code is available at https://github.com/nliulab/MIMIC-Multimodal.

[556] eMargin: Revisiting Contrastive Learning with Margin-Based Separation

Abdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor

Main category: cs.LG

TL;DR: The paper investigates the effect of adding an adaptive margin (eMargin) to contrastive loss for time series representation learning, finding improved clustering but limited downstream classification performance.

DetailsMotivation: To explore if an adaptive margin in contrastive loss can enhance separation of dissimilar time steps and improve downstream task performance.

Method: Introduces eMargin, adjusted by a similarity threshold, and evaluates its impact on clustering and classification in benchmark datasets.

Result: eMargin improves unsupervised clustering metrics but underperforms in downstream classification tasks.

Conclusion: High clustering scores do not guarantee meaningful embeddings for downstream tasks; eMargin excels in clustering but not classification.

Abstract: We revisit previous contrastive learning frameworks to investigate the effect of introducing an adaptive margin into the contrastive loss function for time series representation learning. Specifically, we explore whether an adaptive margin (eMargin), adjusted based on a predefined similarity threshold, can improve the separation between adjacent but dissimilar time steps and subsequently lead to better performance in downstream tasks. Our study evaluates the impact of this modification on clustering performance and classification in three benchmark datasets. Our findings, however, indicate that achieving high scores on unsupervised clustering metrics does not necessarily imply that the learned embeddings are meaningful or effective in downstream tasks. To be specific, eMargin added to InfoNCE consistently outperforms state-of-the-art baselines in unsupervised clustering metrics, but struggles to achieve competitive results in downstream classification with linear probing. The source code is publicly available at https://github.com/sfi-norwai/eMargin.

[557] The Invisible Leash: Why RLVR May Not Escape Its Origin

Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi

Main category: cs.LG

TL;DR: RLVR enhances precision but may limit exploration and original solutions due to constraints from the base model’s support and an entropy-reward tradeoff.

DetailsMotivation: To investigate whether RLVR truly expands reasoning boundaries or just amplifies known high-reward outputs.

Method: Theoretical analysis and empirical experiments to evaluate RLVR’s constraints and tradeoffs.

Result: RLVR improves precision (pass@1) but shrinks empirical support, missing correct answers accessible to the base model. It also reduces answer-level entropy.

Conclusion: RLVR has limits in extending reasoning horizons; future innovations like explicit exploration or hybrid strategies are needed.

Abstract: Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model’s reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model’s support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.

[558] Time-Aware Attention for Enhanced Electronic Health Records Modeling

Junhan Yu, Zhunyi Feng, Junwei Lu, Tianxi Cai, Doudou Zhou

Main category: cs.LG

TL;DR: TALE-EHR is a Transformer-based framework with a time-aware attention mechanism for EHR analysis, outperforming baselines in disease progression forecasting.

DetailsMotivation: EHRs contain valuable clinical data but pose challenges due to data heterogeneity and irregular temporal patterns.

Method: TALE-EHR uses a time-aware attention mechanism and LLM-derived embeddings to model temporal gaps and clinical semantics.

Result: Outperforms state-of-the-art baselines on MIMIC-IV and PIC datasets for tasks like disease progression forecasting.

Conclusion: Integrating explicit temporal modeling with semantic representations advances EHR analysis.

Abstract: Electronic Health Records (EHR) contain valuable clinical information for predicting patient outcomes and guiding healthcare decisions. However, effectively modeling Electronic Health Records (EHRs) requires addressing data heterogeneity and complex temporal patterns. Standard approaches often struggle with irregular time intervals between clinical events. We propose TALE-EHR, a Transformer-based framework featuring a novel time-aware attention mechanism that explicitly models continuous temporal gaps to capture fine-grained sequence dynamics. To complement this temporal modeling with robust semantics, TALE-EHR leverages embeddings derived from standardized code descriptions using a pre-trained Large Language Model (LLM), providing a strong foundation for understanding clinical concepts. Experiments on the MIMIC-IV and PIC dataset demonstrate that our approach outperforms state-of-the-art baselines on tasks such as disease progression forecasting. TALE-EHR underscores the benefit of integrating explicit, continuous temporal modeling with strong semantic representations provides a powerful solution for advancing EHR analysis.

[559] Hierarchical Multi-Agent Reinforcement Learning with Control Barrier Functions for Safety-Critical Autonomous Systems

H. M. Sabbir Ahmad, Ehsan Sabouni, Alexander Wasilkoff, Param Budhraja, Zijian Guo, Songyuan Zhang, Chuchu Fan, Christos Cassandras, Wenchao Li

Main category: cs.LG

TL;DR: A safe Hierarchical Multi-Agent Reinforcement Learning (HMARL) approach using Control Barrier Functions (CBFs) is proposed to ensure safety and cooperation in multi-agent systems.

DetailsMotivation: Addressing the need for safety in multi-agent autonomous systems while ensuring cooperation among agents.

Method: Decomposes learning into two levels: high-level joint policy learning and low-level safe individual behavior using CBFs.

Result: Achieves near-perfect safety (within 5%) and improved performance in challenging environments.

Conclusion: The HMARL-CBF approach effectively balances safety and cooperation in multi-agent systems.

Abstract: We address the problem of safe policy learning in multi-agent safety-critical autonomous systems. In such systems, it is necessary for each agent to meet the safety requirements at all times while also cooperating with other agents to accomplish the task. Toward this end, we propose a safe Hierarchical Multi-Agent Reinforcement Learning (HMARL) approach based on Control Barrier Functions (CBFs). Our proposed hierarchical approach decomposes the overall reinforcement learning problem into two levels learning joint cooperative behavior at the higher level and learning safe individual behavior at the lower or agent level conditioned on the high-level policy. Specifically, we propose a skill-based HMARL-CBF algorithm in which the higher level problem involves learning a joint policy over the skills for all the agents and the lower-level problem involves learning policies to execute the skills safely with CBFs. We validate our approach on challenging environment scenarios whereby a large number of agents have to safely navigate through conflicting road networks. Compared with existing state of the art methods, our approach significantly improves the safety achieving near perfect (within 5%) success/safety rate while also improving performance across all the environments.

[560] The Tsetlin Machine Goes Deep: Logical Learning and Reasoning With Graphs

Ole-Christoffer Granmo, Youmna Abdelwahab, Per-Arne Andersen, Paul F. A. Clarke, Kunal Dumbre, Ylva Grønninsæter, Vojtech Halenka, Runar Helin, Lei Jiao, Ahmed Khalid, Rebekka Omslandseter, Rupsa Saha, Mayur Shende, Xuan Zhang

Main category: cs.LG

TL;DR: The Graph Tsetlin Machine (GraphTM) extends the Tsetlin Machine to graph-structured data, improving interpretability and accuracy across diverse tasks like image classification, action tracking, recommendation systems, and genome analysis.

DetailsMotivation: To enhance the Tsetlin Machine's versatility by handling graph-structured data, enabling interpretable deep learning for sequences, grids, relations, and multimodality.

Method: Uses message passing to build nested deep clauses for sub-graph pattern recognition, reducing the number of clauses needed and improving data utilization.

Result: Achieves higher accuracy than convolutional TM (3.86% on CIFAR-10), outperforms reinforcement learning methods (up to 20.6%), and tolerates noise better than GCN (89.86% vs. 70.87%). Also trains faster than GCN for genome data.

Conclusion: GraphTM demonstrates the potential of graph representation learning and deep clauses to expand the capabilities of Tsetlin Machines in diverse applications.

Abstract: Pattern recognition with concise and flat AND-rules makes the Tsetlin Machine (TM) both interpretable and efficient, while the power of Tsetlin automata enables accuracy comparable to deep learning on an increasing number of datasets. We introduce the Graph Tsetlin Machine (GraphTM) for learning interpretable deep clauses from graph-structured input. Moving beyond flat, fixed-length input, the GraphTM gets more versatile, supporting sequences, grids, relations, and multimodality. Through message passing, the GraphTM builds nested deep clauses to recognize sub-graph patterns with exponentially fewer clauses, increasing both interpretability and data utilization. For image classification, GraphTM preserves interpretability and achieves 3.86%-points higher accuracy on CIFAR-10 than a convolutional TM. For tracking action coreference, faced with increasingly challenging tasks, GraphTM outperforms other reinforcement learning methods by up to 20.6%-points. In recommendation systems, it tolerates increasing noise to a greater extent than a Graph Convolutional Neural Network (GCN), e.g., for noise ratio 0.1, GraphTM obtains accuracy 89.86% compared to GCN’s 70.87%. Finally, for viral genome sequence data, GraphTM is competitive with BiLSTM-CNN and GCN accuracy-wise, training 2.5x faster than GCN. The GraphTM’s application to these varied fields demonstrates how graph representation learning and deep clauses bring new possibilities for TM learning.

[561] Application-Specific Component-Aware Structured Pruning of Deep Neural Networks via Soft Coefficient Optimization

Ganesh Sundaram, Jonas Ulmen, Amjad Haider, Daniel Görges

Main category: cs.LG

TL;DR: Proposes an enhanced importance metric for structured pruning of DNNs to balance compression and task performance, validated on MNIST autoencoder.

DetailsMotivation: Address the challenge of preserving application-specific performance during DNN pruning, where conventional metrics often fail.

Method: Develops a framework with optimized pruning strategies for groups of elements, ensuring performance constraints are met.

Result: Effectively maintains task-relevant performance post-pruning, as shown in MNIST image reconstruction.

Conclusion: The method successfully balances model compression with application-specific performance, enhancing usability.

Abstract: Deep neural networks (DNNs) offer significant versatility and performance benefits, but their widespread adoption is often hindered by high model complexity and computational demands. Model compression techniques such as pruning have emerged as promising solutions to these challenges. However, it remains critical to ensure that application-specific performance characteristics are preserved during compression. In structured pruning, where groups of structurally coherent elements are removed, conventional importance metrics frequently fail to maintain these essential performance attributes. In this work, we propose an enhanced importance metric framework that not only reduces model size but also explicitly accounts for application-specific performance constraints. We employ multiple strategies to determine the optimal pruning magnitude for each group, ensuring a balance between compression and task performance. Our approach is evaluated on an autoencoder tasked with reconstructing MNIST images. Experimental results demonstrate that the proposed method effectively preserves task-relevant performance, maintaining the model’s usability even after substantial pruning, by satisfying the required application-specific criteria.

[562] Old Rules in a New Game: Mapping Uncertainty Quantification to Quantum Machine Learning

Maximilian Wendlinger, Kilian Tscharke, Pascal Debus

Main category: cs.LG

TL;DR: The paper addresses the lack of transparency in quantum machine learning by adapting classical uncertainty quantification methods to improve model reliability.

DetailsMotivation: The opacity of quantum machine learning models, similar to classical deep learning, leads to issues like overfitting and overconfidence, necessitating better uncertainty awareness.

Method: The study builds on classical uncertainty quantification and quantum Bayesian modeling to theoretically develop and empirically evaluate techniques for quantum machine learning.

Result: The findings highlight the importance of integrating classical uncertainty insights into quantum machine learning model design.

Conclusion: Classical uncertainty quantification methods can enhance transparency and reliability in quantum machine learning.

Abstract: One of the key obstacles in traditional deep learning is the reduction in model transparency caused by increasingly intricate model functions, which can lead to problems such as overfitting and excessive confidence in predictions. With the advent of quantum machine learning offering possible advances in computational power and latent space complexity, we notice the same opaque behavior. Despite significant research in classical contexts, there has been little advancement in addressing the black-box nature of quantum machine learning. Consequently, we approach this gap by building upon existing work in classical uncertainty quantification and initial explorations in quantum Bayesian modeling to theoretically develop and empirically evaluate techniques to map classical uncertainty quantification methods to the quantum machine learning domain. Our findings emphasize the necessity of leveraging classical insights into uncertainty quantification to include uncertainty awareness in the process of designing new quantum machine learning models.

[563] FedWCM: Unleashing the Potential of Momentum-based Federated Learning in Long-Tailed Scenarios

Tianle Li, Yongzhi Huang, Linshan Jiang, Qipeng Xie, Chang Liu, Wenfeng Du, Lu Wang, Kaishun Wu

Main category: cs.LG

TL;DR: FedWCM dynamically adjusts momentum in FL to address non-IID data challenges, improving convergence and model fairness in long-tailed scenarios.

DetailsMotivation: FL struggles with non-IID data, especially long-tailed distributions, causing biased models and convergence issues.

Method: Proposes FedWCM, which dynamically adjusts momentum using global and per-round data to correct biases.

Result: FedWCM resolves non-convergence, outperforms existing methods, and enhances FL efficiency in heterogeneous and imbalanced data.

Conclusion: FedWCM effectively addresses FL challenges in non-IID and long-tailed data, improving model performance and convergence.

Abstract: Federated Learning (FL) enables decentralized model training while preserving data privacy. Despite its benefits, FL faces challenges with non-identically distributed (non-IID) data, especially in long-tailed scenarios with imbalanced class samples. Momentum-based FL methods, often used to accelerate FL convergence, struggle with these distributions, resulting in biased models and making FL hard to converge. To understand this challenge, we conduct extensive investigations into this phenomenon, accompanied by a layer-wise analysis of neural network behavior. Based on these insights, we propose FedWCM, a method that dynamically adjusts momentum using global and per-round data to correct directional biases introduced by long-tailed distributions. Extensive experiments show that FedWCM resolves non-convergence issues and outperforms existing methods, enhancing FL’s efficiency and effectiveness in handling client heterogeneity and data imbalance.

[564] Clustered Federated Learning for Generalizable FDIA Detection in Smart Grids with Heterogeneous Data

Yunfeng Li, Junhong Liu, Zhaohui Yang, Guofu Liao, Chuyun Zhang

Main category: cs.LG

TL;DR: Proposes FedClusAvg, a federated learning framework for detecting False Data Injection Attacks (FDIAs) in smart grids, addressing Non-IID data challenges and privacy concerns.

DetailsMotivation: FDIAs threaten smart grids, and traditional centralized detection models struggle with Non-IID data, privacy risks, and high transmission costs.

Method: Introduces FedClusAvg with cluster-based stratified sampling and hierarchical communication (client-subserver-server) for localized training and weighted aggregation.

Result: Improves FDIA detection accuracy in Non-IID settings, reduces communication rounds, and lowers bandwidth usage.

Conclusion: FedClusAvg offers a secure, efficient solution for FDIA detection in distributed power systems.

Abstract: False Data Injection Attacks (FDIAs) pose severe security risks to smart grids by manipulating measurement data collected from spatially distributed devices such as SCADA systems and PMUs. These measurements typically exhibit Non-Independent and Identically Distributed (Non-IID) characteristics across different regions, which significantly challenges the generalization ability of detection models. Traditional centralized training approaches not only face privacy risks and data sharing constraints but also incur high transmission costs, limiting their scalability and deployment feasibility. To address these issues, this paper proposes a privacy-preserving federated learning framework, termed Federated Cluster Average (FedClusAvg), designed to improve FDIA detection in Non-IID and resource-constrained environments. FedClusAvg incorporates cluster-based stratified sampling and hierarchical communication (client-subserver-server) to enhance model generalization and reduce communication overhead. By enabling localized training and weighted parameter aggregation, the algorithm achieves accurate model convergence without centralizing sensitive data. Experimental results on benchmark smart grid datasets demonstrate that FedClusAvg not only improves detection accuracy under heterogeneous data distributions but also significantly reduces communication rounds and bandwidth consumption. This work provides an effective solution for secure and efficient FDIA detection in large-scale distributed power systems.

[565] ROBAD: Robust Adversary-aware Local-Global Attended Bad Actor Detection Sequential Model

Bing He, Mustaque Ahamad, Srijan Kumar

Main category: cs.LG

TL;DR: ROBAD is a transformer-based model designed to detect bad actors on internet platforms robustly by capturing local and global information and using adversarial training.

DetailsMotivation: Existing deep learning models for bad actor detection lack robustness against adversarial attacks, prompting the need for a more resilient solution.

Method: ROBAD uses transformer encoder and decoder blocks to create post and sequence embeddings, then employs contrastive learning with adversarial examples for robust classification.

Result: ROBAD effectively detects bad actors under adversarial attacks, outperforming existing models on Yelp and Wikipedia datasets.

Conclusion: ROBAD addresses robustness in bad actor detection by combining local-global attention and adversarial training, proving effective against attacks.

Abstract: Detecting bad actors is critical to ensure the safety and integrity of internet platforms. Several deep learning-based models have been developed to identify such users. These models should not only accurately detect bad actors, but also be robust against adversarial attacks that aim to evade detection. However, past deep learning-based detection models do not meet the robustness requirement because they are sensitive to even minor changes in the input sequence. To address this issue, we focus on (1) improving the model understanding capability and (2) enhancing the model knowledge such that the model can recognize potential input modifications when making predictions. To achieve these goals, we create a novel transformer-based classification model, called ROBAD (RObust adversary-aware local-global attended Bad Actor Detection model), which uses the sequence of user posts to generate user embedding to detect bad actors. Particularly, ROBAD first leverages the transformer encoder block to encode each post bidirectionally, thus building a post embedding to capture the local information at the post level. Next, it adopts the transformer decoder block to model the sequential pattern in the post embeddings by using the attention mechanism, which generates the sequence embedding to obtain the global information at the sequence level. Finally, to enrich the knowledge of the model, embeddings of modified sequences by mimicked attackers are fed into a contrastive-learning-enhanced classification layer for sequence prediction. In essence, by capturing the local and global information (i.e., the post and sequence information) and leveraging the mimicked behaviors of bad actors in training, ROBAD can be robust to adversarial attacks. Extensive experiments on Yelp and Wikipedia datasets show that ROBAD can effectively detect bad actors when under state-of-the-art adversarial attacks.

[566] Reinforcement Learning for Flow-Matching Policies

Samuel Pfrommer, Yixiao Huang, Somayeh Sojoudi

Main category: cs.LG

TL;DR: Flow-matching policies trained via reinforcement learning surpass suboptimal demonstration performance, with GRPO reducing costs by 50-85% compared to naive imitation learning.

DetailsMotivation: To improve upon suboptimal human or policy-generated demonstrations in flow-matching policies for robotics.

Method: Introduces Reward-Weighted Flow Matching (RWFM) and Group Relative Policy Optimization (GRPO) with a learned reward surrogate, tested on simulated unicycle dynamics tasks.

Result: Both RWFM and GRPO outperform the demonstrator, with GRPO reducing costs by 50-85% compared to naive imitation learning.

Conclusion: Reinforcement learning enhances flow-matching policies, with GRPO being particularly effective for surpassing demonstration performance.

Abstract: Flow-matching policies have emerged as a powerful paradigm for generalist robotics. These models are trained to imitate an action chunk, conditioned on sensor observations and textual instructions. Often, training demonstrations are generated by a suboptimal policy, such as a human operator. This work explores training flow-matching policies via reinforcement learning to surpass the original demonstration policy performance. We particularly note minimum-time control as a key application and present a simple scheme for variable-horizon flow-matching planning. We then introduce two families of approaches: a simple Reward-Weighted Flow Matching (RWFM) scheme and a Group Relative Policy Optimization (GRPO) approach with a learned reward surrogate. Our policies are trained on an illustrative suite of simulated unicycle dynamics tasks, and we show that both approaches dramatically improve upon the suboptimal demonstrator performance, with the GRPO approach in particular generally incurring between $50%$ and $85%$ less cost than a naive Imitation Learning Flow Matching (ILFM) approach.

[567] Isotonic Quantile Regression Averaging for uncertainty quantification of electricity price forecasts

Arkadiusz Lipiecki, Bartosz Uniejewski

Main category: cs.LG

TL;DR: The paper introduces Isotonic Quantile Regression Averaging (iQRA), a method for probabilistic electricity price forecasting, improving accuracy, reliability, and computational efficiency over existing techniques.

DetailsMotivation: Uncertainty quantification in forecasting models is crucial for risk assessment in volatile domains like electricity markets, where current machine learning models often lack reliable uncertainty estimates.

Method: iQRA extends Quantile Regression Averaging (QRA) by adding stochastic order constraints to enhance forecast accuracy, reliability, and computational efficiency.

Result: iQRA outperforms state-of-the-art methods in the German day-ahead electricity market, providing well-calibrated prediction intervals and superior reliability.

Conclusion: iQRA offers a hyperparameter-free, computationally efficient solution for probabilistic forecasting, addressing limitations of existing methods.

Abstract: Quantifying the uncertainty of forecasting models is essential to assess and mitigate the risks associated with data-driven decisions, especially in volatile domains such as electricity markets. Machine learning methods can provide highly accurate electricity price forecasts, critical for informing the decisions of market participants. However, these models often lack uncertainty estimates, which limits the ability of decision makers to avoid unnecessary risks. In this paper, we propose a novel method for generating probabilistic forecasts from ensembles of point forecasts, called Isotonic Quantile Regression Averaging (iQRA). Building on the established framework of Quantile Regression Averaging (QRA), we introduce stochastic order constraints to improve forecast accuracy, reliability, and computational costs. In an extensive forecasting study of the German day-ahead electricity market, we show that iQRA consistently outperforms state-of-the-art postprocessing methods in terms of both reliability and sharpness. It produces well-calibrated prediction intervals across multiple confidence levels, providing superior reliability to all benchmark methods, particularly coverage-based conformal prediction. In addition, isotonic regularization decreases the complexity of the quantile regression problem and offers a hyperparameter-free approach to variable selection.

[568] Robust Control with Gradient Uncertainty

Qian Qi

Main category: cs.LG

TL;DR: The paper introduces a robust control theory extension addressing gradient uncertainty in value functions, common in reinforcement learning. It formulates a zero-sum game, derives a new nonlinear PDE (GU-HJBI), and validates insights with numerical studies and a novel algorithm (GURAC).

DetailsMotivation: Address uncertainty in value function gradients, prevalent in applications like reinforcement learning, to enhance robustness in control systems.

Method: Formulate a zero-sum dynamic game with adversarial perturbations, derive the GU-HJBI equation, analyze the LQ case, and propose the GURAC algorithm.

Result: Proves the quadratic value function assumption fails under gradient uncertainty, characterizes non-polynomial corrections, and validates with numerical studies.

Conclusion: The work advances robust control theory, offering practical tools like GURAC for applications involving function approximation, such as reinforcement learning.

Abstract: We introduce a novel extension to robust control theory that explicitly addresses uncertainty in the value function’s gradient, a form of uncertainty endemic to applications like reinforcement learning where value functions are approximated. We formulate a zero-sum dynamic game where an adversary perturbs both system dynamics and the value function gradient, leading to a new, highly nonlinear partial differential equation: the Hamilton-Jacobi-Bellman-Isaacs Equation with Gradient Uncertainty (GU-HJBI). We establish its well-posedness by proving a comparison principle for its viscosity solutions under a uniform ellipticity condition. Our analysis of the linear-quadratic (LQ) case yields a key insight: we prove that the classical quadratic value function assumption fails for any non-zero gradient uncertainty, fundamentally altering the problem structure. A formal perturbation analysis characterizes the non-polynomial correction to the value function and the resulting nonlinearity of the optimal control law, which we validate with numerical studies. Finally, we bridge theory to practice by proposing a novel Gradient-Uncertainty-Robust Actor-Critic (GURAC) algorithm, accompanied by an empirical study demonstrating its effectiveness in stabilizing training. This work provides a new direction for robust control, holding significant implications for fields where function approximation is common, including reinforcement learning and computational finance.

[569] AnalogFed: Federated Discovery of Analog Circuit Topologies with Generative AI

Qiufeng Li, Shu Hong, Jian Gao, Xuan Zhang, Tian Lan, Weidong Cao

Main category: cs.LG

TL;DR: AnalogFed enables collaborative AI-driven analog circuit design without sharing raw data, addressing privacy and data fragmentation issues.

DetailsMotivation: The proprietary nature of analog circuit design limits data availability, hindering generative AI research. AnalogFed aims to overcome this by fostering collaboration while preserving privacy.

Method: AnalogFed uses federated learning (FedL) tailored for analog design, including generative model development, data heterogeneity handling, and privacy-preserving strategies.

Result: AnalogFed matches centralized baselines in performance, achieving state-of-the-art efficiency and scalability in topology discovery.

Conclusion: AnalogFed successfully enables privacy-preserving, collaborative innovation in analog circuit design, overcoming data fragmentation challenges.

Abstract: Recent breakthroughs in AI/ML offer exciting opportunities to revolutionize analog design automation through data-driven approaches. In particular, researchers are increasingly fascinated by harnessing the power of generative AI to automate the discovery of novel analog circuit topologies. Unlocking the full potential of generative AI in these data-driven discoveries requires access to large and diverse datasets.Yet, there is a significant barrier in the analog domain–Analog circuit design is inherently proprietary, involving not only confidential circuit structures but also the underlying commercial semiconductor processes. As a result, current generative AI research is largely confined to individual researchers who construct small, narrowly focused private datasets. This fragmentation severely limits collaborative innovation and impedes progress across the research community. To address these challenges, we propose AnalogFed. AnalogFed enables collaborative topology discovery across decentralized clients (e.g., individual researchers or institutions) without requiring the sharing of raw private data. To make this vision practical, we introduce a suite of techniques tailored to the unique challenges of applying FedL in analog design–from generative model development and data heterogeneity handling to privacy-preserving strategies that ensure both flexibility and security for circuit designers and semiconductor manufacturers. Extensive experiments across varying client counts and dataset sizes demonstrate that AnalogFed achieves performance comparable to centralized baselines–while maintaining strict data privacy. Specifically, the generative AI model within AnalogFed achieves state-of-the-art efficiency and scalability in the design of analog circuit topologies.

[570] Distributional Unlearning: Forgetting Distributions, Not Just Samples

Youssef Allouah, Rachid Guerraoui, Sanmi Koyejo

Main category: cs.LG

TL;DR: The paper introduces distributional unlearning, a method to remove entire sub-populations from trained models efficiently, ensuring minimal residual signal and performance impact.

DetailsMotivation: Existing unlearning tools focus on individual samples, leaving residual signals for unwanted domains. The need arises to remove entire topical domains for privacy, legal, or quality reasons.

Method: The paper proposes distributional unlearning, using Kullback-Leibler divergence to quantify removal and preservation, deriving exact Pareto frontiers for Gaussian cases and proving bounded log-loss shifts.

Result: Experiments show 15-72% fewer deletions than random removal, with negligible impact on retained performance.

Conclusion: Distributional unlearning effectively removes unwanted domains while preserving retained data quality, offering a practical solution for privacy and legal compliance.

Abstract: Machine unlearning seeks to remove unwanted information from trained models, initially at the individual-sample level, but increasingly at the level of entire sub-populations. In many deployments, models must delete whole topical domains to satisfy privacy, legal, or quality requirements, e.g., removing several users’ posts under GDPR or copyrighted web content. Existing unlearning tools remain largely sample-oriented, and straightforward point deletion often leaves enough residual signal for downstream learners to recover the unwanted domain. We introduce distributional unlearning, a data-centric, model-agnostic framework that asks: Given examples from an unwanted distribution and a retained distribution, what is the smallest set of points whose removal makes the edited dataset far from the unwanted domain yet close to the retained one? Using Kullback-Leibler divergence to quantify removal and preservation, we derive the exact Pareto frontier in the Gaussian case and prove that any model retrained on the edited data incurs log-loss shifts bounded by the divergence thresholds. We propose a simple distance-based selection rule satisfying these constraints with a quadratic reduction in deletion budget compared to random removal. Experiments on synthetic Gaussians, Jigsaw Toxic Comments, SMS spam, and CIFAR-10 show 15-72% fewer deletions than random, with negligible impact on retained performance.

[571] Are We Overlooking the Dimensions? Learning Latent Hierarchical Channel Structure for High-Dimensional Time Series Forecasting

Juntong Ni, Shiyu Wang, Zewen Liu, Xiaoming Shi, Xinyue Zhong, Zhou Ye, Wei Jin

Main category: cs.LG

TL;DR: U-Cast addresses High-Dimensional Time Series Forecasting (HDTSF) by learning hierarchical channel structures with query-based attention and full-rank regularization, outperforming baselines on the Time-HD benchmark.

DetailsMotivation: Traditional TSF models struggle with high-dimensional datasets due to complex channel correlations, which are often ignored or poorly scaled.

Method: U-Cast uses channel-dependent forecasting with query-based attention and full-rank regularization to disentangle correlated channels.

Result: U-Cast outperforms baselines in accuracy and efficiency on the Time-HD benchmark.

Conclusion: U-Cast and Time-HD provide a foundation for future HDTSF research.

Abstract: Time series forecasting (TSF) is a central problem in time series analysis. However, as the number of channels in time series datasets scales to the thousands or more, a scenario we define as High-Dimensional Time Series Forecasting (HDTSF), it introduces significant new modeling challenges that are often not the primary focus of traditional TSF research. HDTSF is challenging because the channel correlation often forms complex and hierarchical patterns. Existing TSF models either ignore these interactions or fail to scale as dimensionality grows. To address this issue, we propose U-Cast, a channel-dependent forecasting architecture that learns latent hierarchical channel structures with an innovative query-based attention. To disentangle highly correlated channel representation, U-Cast adds a full-rank regularization during training. We also release Time-HD, a benchmark of large, diverse, high-dimensional datasets. Our theory shows that exploiting cross-channel information lowers forecasting risk, and experiments on Time-HD demonstrate that U-Cast surpasses strong baselines in both accuracy and efficiency. Together, U-Cast and Time-HD provide a solid basis for future HDTSF research.

[572] Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm

Joanna Komorniczak

Main category: cs.LG

TL;DR: A genetic algorithm is proposed to generate synthetic datasets with targeted complexity levels for classification and regression tasks, showing a correlation between data complexity and model performance.

DetailsMotivation: To enhance the availability of diverse datasets for evaluating machine learning methods by controlling problem complexity.

Method: A genetic algorithm optimizes problem complexity measures (10 for classification, 4 for regression) via linear feature projections on synthetic datasets.

Result: The algorithm successfully generates datasets with varying difficulty levels, and evaluations show a link between complexity and recognition quality.

Conclusion: The approach effectively produces datasets with controlled complexity, aiding in ML method evaluation.

Abstract: The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.

[573] Constraint-aware Learning of Probabilistic Sequential Models for Multi-Label Classification

Mykhailo Buleshnyi, Anna Polova, Zsolt Zombori, Michael Benedikt

Main category: cs.LG

TL;DR: The paper explores multi-label classification with logical constraints, using an expressive sequential model to capture label correlations and enforce constraints.

DetailsMotivation: To address multi-label classification with large label sets and logical constraints, leveraging correlations among labels.

Method: An architecture combining individual label classifiers with an expressive sequential model to produce a joint distribution.

Result: The model effectively exploits constraints during training and enforces them at inference.

Conclusion: The proposed architecture successfully handles label correlations and constraints in multi-label classification.

Abstract: We investigate multi-label classification involving large sets of labels, where the output labels may be known to satisfy some logical constraints. We look at an architecture in which classifiers for individual labels are fed into an expressive sequential model, which produces a joint distribution. One of the potential advantages for such an expressive model is its ability to modelling correlations, as can arise from constraints. We empirically demonstrate the ability of the architecture both to exploit constraints in training and to enforce constraints at inference time.

[574] Resonant-Tunnelling Diode Reservoir Computing System for Image Recognition

A. H. Abbas, Hend Abdel-Ghani, Ivan S. Maksymov

Main category: cs.LG

TL;DR: A neuromorphic computing architecture using resonant-tunnelling diodes (RTDs) is proposed for efficient physical reservoir computing, validated on image recognition tasks.

DetailsMotivation: The need for hardware-efficient computational models in AI for edge-based and resource-constrained environments drives this research.

Method: Theoretical formulation and numerical implementation of an RTD-based reservoir computing system, tested on handwritten digit and object recognition benchmarks.

Result: The architecture shows promising performance by replacing random connectivity with deterministic nonlinear transformations.

Conclusion: The RTD-based system offers a viable solution for next-generation reservoir computing, aligning with hardware efficiency goals.

Abstract: As artificial intelligence continues to push into real-time, edge-based and resource-constrained environments, there is an urgent need for novel, hardware-efficient computational models. In this study, we present and validate a neuromorphic computing architecture based on resonant-tunnelling diodes (RTDs), which exhibit the nonlinear characteristics ideal for physical reservoir computing (RC). We theoretically formulate and numerically implement an RTD-based RC system and demonstrate its effectiveness on two image recognition benchmarks: handwritten digit classification and object recognition using the Fruit~360 dataset. Our results show that this circuit-level architecture delivers promising performance while adhering to the principles of next-generation RC – eliminating random connectivity in favour of a deterministic nonlinear transformation of input signals.

[575] Designing User-Centric Metrics for Evaluation of Counterfactual Explanations

Firdaus Ahmed Choudhury, Ethan Leicht, Jude Ethan Bislig, Hangzhi Guo, Amulya Yadav

Main category: cs.LG

TL;DR: The paper critiques current Counterfactual Explanation (CFE) evaluation metrics, showing misalignment with user preferences, and introduces a user-centric AWP model for better CFE selection.

DetailsMotivation: To address the gap between artificial CFE evaluation metrics and real-world user preferences, ensuring actionable and user-aligned explanations.

Method: Conducted two studies: a pilot with 20 crowd-workers and a detailed two-day study with 41 participants in credit scenarios, leading to the AWP model.

Result: User-preferred CFEs matched proximity-based ones only 63.81% of the time; AWP predicted preferences with 84.37% accuracy.

Conclusion: Highlights the need for adaptive, user-centered CFE evaluation metrics, validated by human-centered studies.

Abstract: Machine learning-based decision models are increasingly being used to make decisions that significantly impact people’s lives, but their opaque nature leaves end users without a clear understanding of why a decision was made. Counterfactual Explanations (CFEs) have grown in popularity as a means of offering actionable guidance by identifying the minimum changes in feature values required to flip a model’s prediction to something more desirable. Unfortunately, most prior research in CFEs relies on artificial evaluation metrics, such as proximity, which may overlook end-user preferences and constraints, e.g., the user’s perception of effort needed to make certain feature changes may differ from that of the model designer. To address this research gap, this paper makes three novel contributions. First, we conduct a pilot study with 20 crowd-workers on Amazon MTurk to experimentally validate the alignment of existing CF evaluation metrics with real-world user preferences. Results show that user-preferred CFEs matched those based on proximity in only 63.81% of cases, highlighting the limited applicability of these metrics in real-world settings. Second, inspired by the need to design a user-informed evaluation metric for CFEs, we conduct a more detailed two-day user study with 41 participants facing realistic credit application scenarios to find experimental support for or against three intuitive hypotheses that may explain how end users evaluate CFEs. Third, based on the findings of this second study, we propose the AWP model, a novel user-centric, two-stage model that describes one possible mechanism by which users evaluate and select CFEs. Our results show that AWP predicts user-preferred CFEs with 84.37% accuracy. Our study provides the first human-centered validation for personalized cost models in CFE generation and highlights the need for adaptive, user-centered evaluation metrics.

[576] Better Models and Algorithms for Learning Ising Models from Dynamics

Jason Gaitonde, Ankur Moitra, Elchanan Mossel

Main category: cs.LG

TL;DR: The paper presents algorithms for learning the Ising model’s structure and parameters from observing only configuration changes in a Markov chain, addressing limitations of prior work that required observing all update attempts.

DetailsMotivation: Prior work assumed observing all site update attempts, even unsuccessful ones, which is unrealistic. This work aims to learn the Ising model under a more natural observation model where only configuration changes are observed.

Method: The authors develop algorithms that efficiently learn the Ising model by observing configuration changes. The method involves recovering the dependency graph in polynomial time and then estimating parameters, leveraging properties of reversible Markov chains.

Result: The algorithm recovers the dependency graph in poly(d)⋅n²log n time and parameters in Õ(2^d n) time, matching state-of-the-art performance in weaker observation models.

Conclusion: This work provides the first efficient algorithms for learning the Ising model under a realistic observation model, extending applicability to broader reversible Markov chains.

Abstract: We study the problem of learning the structure and parameters of the Ising model, a fundamental model of high-dimensional data, when observing the evolution of an associated Markov chain. A recent line of work has studied the natural problem of learning when observing an evolution of the well-known Glauber dynamics [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Mossel STOC 2024], which provides an arguably more realistic generative model than the classical i.i.d. setting. However, this prior work crucially assumes that all site update attempts are observed, \emph{even when this attempt does not change the configuration}: this strong observation model is seemingly essential for these approaches. While perhaps possible in restrictive contexts, this precludes applicability to most realistic settings where we can observe \emph{only} the stochastic evolution itself, a minimal and natural assumption for any process we might hope to learn from. However, designing algorithms that succeed in this more realistic setting has remained an open problem [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Moitra, Mossel, STOC 2025]. In this work, we give the first algorithms that efficiently learn the Ising model in this much more natural observation model that only observes when the configuration changes. For Ising models with maximum degree $d$, our algorithm recovers the underlying dependency graph in time $\mathsf{poly}(d)\cdot n^2\log n$ and then the actual parameters in additional $\widetilde{O}(2^d n)$ time, which qualitatively matches the state-of-the-art even in the i.i.d. setting in a much weaker observation model. Our analysis holds more generally for a broader class of reversible, single-site Markov chains that also includes the popular Metropolis chain by leveraging more robust properties of reversible Markov chains.

[577] Joint-Local Grounded Action Transformation for Sim-to-Real Transfer in Multi-Agent Traffic Control

Justin Turnau, Longchao Da, Khoa Vo, Ferdous Al Rafi, Shreyas Bachiraju, Tiejin Chen, Hua Wei

Main category: cs.LG

TL;DR: JL-GAT applies Grounded Action Transformation (GAT) to multi-agent RL for Traffic Signal Control, addressing the sim-to-real gap by incorporating neighbor information in a scalable, decentralized framework.

DetailsMotivation: The sim-to-real gap in MARL-based TSC policies causes performance drops in real-world deployment. While GAT works for single-agent RL, real-world traffic networks require MARL.

Method: JL-GAT extends GAT to MARL by using a decentralized approach, incorporating neighbor agent information for better grounding and scalability.

Result: Experiments show JL-GAT effectively mitigates the sim-to-real gap in diverse road networks, including adverse weather conditions.

Conclusion: JL-GAT successfully balances scalability and grounding capability, making it suitable for real-world MARL-based TSC.

Abstract: Traffic Signal Control (TSC) is essential for managing urban traffic flow and reducing congestion. Reinforcement Learning (RL) offers an adaptive method for TSC by responding to dynamic traffic patterns, with multi-agent RL (MARL) gaining traction as intersections naturally function as coordinated agents. However, due to shifts in environmental dynamics, implementing MARL-based TSC policies in the real world often leads to a significant performance drop, known as the sim-to-real gap. Grounded Action Transformation (GAT) has successfully mitigated this gap in single-agent RL for TSC, but real-world traffic networks, which involve numerous interacting intersections, are better suited to a MARL framework. In this work, we introduce JL-GAT, an application of GAT to MARL-based TSC that balances scalability with enhanced grounding capability by incorporating information from neighboring agents. JL-GAT adopts a decentralized approach to GAT, allowing for the scalability often required in real-world traffic networks while still capturing key interactions between agents. Comprehensive experiments on various road networks under simulated adverse weather conditions, along with ablation studies, demonstrate the effectiveness of JL-GAT. The code is publicly available at https://github.com/DaRL-LibSignal/JL-GAT/.

[578] Feature Construction Using Network Control Theory and Rank Encoding for Graph Machine Learning

Anwar Said, Yifan Wei, Ubaid Ullah Ahmad, Mudassir Shabbir, Waseem Abbas, Xenofon Koutsoukos

Main category: cs.LG

TL;DR: The paper proposes using average controllability and a rank encoding method to improve GNN performance in social network tasks where node features are scarce.

DetailsMotivation: GNNs struggle in social networks due to lack of expressive node features, often caused by privacy or missing attributes.

Method: Introduces average controllability and centrality metrics (NCT-EFA) as node features, and a rank encoding method to transform these into fixed-dimensional features.

Result: Experiments show significant GNN performance improvement, with rank encoding boosting ROC AUC from 68.7% to 73.9% on GitHub Stargazers dataset.

Conclusion: The proposed methods enhance GNN performance by providing expressive node features, especially in feature-scarce scenarios.

Abstract: In this article, we utilize the concept of average controllability in graphs, along with a novel rank encoding method, to enhance the performance of Graph Neural Networks (GNNs) in social network classification tasks. GNNs have proven highly effective in various network-based learning applications and require some form of node features to function. However, their performance is heavily influenced by the expressiveness of these features. In social networks, node features are often unavailable due to privacy constraints or the absence of inherent attributes, making it challenging for GNNs to achieve optimal performance. To address this limitation, we propose two strategies for constructing expressive node features. First, we introduce average controllability along with other centrality metrics (denoted as NCT-EFA) as node-level metrics that capture critical aspects of network topology. Building on this, we develop a rank encoding method that transforms average controllability or any other graph-theoretic metric into a fixed-dimensional feature space, thereby improving feature representation. We conduct extensive numerical evaluations using six benchmark GNN models across four social network datasets to compare different node feature construction methods. Our results demonstrate that incorporating average controllability into the feature space significantly improves GNN performance. Moreover, the proposed rank encoding method outperforms traditional one-hot degree encoding, improving the ROC AUC from 68.7% to 73.9% using GraphSAGE on the GitHub Stargazers dataset, underscoring its effectiveness in generating expressive and efficient node representations.

[579] Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation

Xinran Li, Xiujuan Xu, Jiaqi Qiao

Main category: cs.LG

TL;DR: The paper introduces LSDGNN, a multimodal approach for ERC, using long- and short-distance GNNs with a Differential Regularizer and BiAffine Module for feature interaction. It also proposes ICL to handle data imbalance, achieving state-of-the-art results on IEMOCAP and MELD datasets.

DetailsMotivation: ERC is challenging due to the complexity of multimodal interactions and data imbalance. The paper aims to improve feature extraction and learning efficiency.

Method: LSDGNN combines long- and short-distance GNNs on a DAG, uses a Differential Regularizer and BiAffine Module for feature interaction, and employs ICL with a ‘weighted emotional shift’ metric for balanced training.

Result: The model outperforms benchmarks on IEMOCAP and MELD datasets, demonstrating superior performance in ERC.

Conclusion: LSDGNN effectively addresses ERC challenges through multimodal feature extraction and balanced learning, achieving state-of-the-art results.

Abstract: Emotion Recognition in Conversation (ERC) is a practical and challenging task. This paper proposes a novel multimodal approach, the Long-Short Distance Graph Neural Network (LSDGNN). Based on the Directed Acyclic Graph (DAG), it constructs a long-distance graph neural network and a short-distance graph neural network to obtain multimodal features of distant and nearby utterances, respectively. To ensure that long- and short-distance features are as distinct as possible in representation while enabling mutual influence between the two modules, we employ a Differential Regularizer and incorporate a BiAffine Module to facilitate feature interaction. In addition, we propose an Improved Curriculum Learning (ICL) to address the challenge of data imbalance. By computing the similarity between different emotions to emphasize the shifts in similar emotions, we design a “weighted emotional shift” metric and develop a difficulty measurer, enabling a training process that prioritizes learning easy samples before harder ones. Experimental results on the IEMOCAP and MELD datasets demonstrate that our model outperforms existing benchmarks.

[580] Exact Reformulation and Optimization for Direct Metric Optimization in Binary Imbalanced Classification

Le Peng, Yash Travadi, Chuan He, Ying Cui, Ju Sun

Main category: cs.LG

TL;DR: The paper introduces exact constrained reformulations for direct metric optimization (DMO) in imbalanced classification, outperforming existing methods.

DetailsMotivation: Standard accuracy is misleading in imbalanced classification, and existing methods fail when class significance varies or specific metrics must meet certain levels.

Method: Exact constrained reformulations for DMO problems (FPOR, FROP, OFBS) solved via exact penalty methods.

Result: Superior performance on benchmark datasets compared to state-of-the-art methods.

Conclusion: The ERO framework is effective for DMO in binary IC and potentially other problems.

Abstract: For classification with imbalanced class frequencies, i.e., imbalanced classification (IC), standard accuracy is known to be misleading as a performance measure. While most existing methods for IC resort to optimizing balanced accuracy (i.e., the average of class-wise recalls), they fall short in scenarios where the significance of classes varies or certain metrics should reach prescribed levels. In this paper, we study two key classification metrics, precision and recall, under three practical binary IC settings: fix precision optimize recall (FPOR), fix recall optimize precision (FROP), and optimize $F_\beta$-score (OFBS). Unlike existing methods that rely on smooth approximations to deal with the indicator function involved, \textit{we introduce, for the first time, exact constrained reformulations for these direct metric optimization (DMO) problems}, which can be effectively solved by exact penalty methods. Experiment results on multiple benchmark datasets demonstrate the practical superiority of our approach over the state-of-the-art methods for the three DMO problems. We also expect our exact reformulation and optimization (ERO) framework to be applicable to a wide range of DMO problems for binary IC and beyond. Our code is available at https://github.com/sun-umn/DMO.

[581] Spatio-Temporal Demand Prediction for Food Delivery Using Attention-Driven Graph Neural Networks

Rabia Latief Bhat, Iqra Altaf Gillani

Main category: cs.LG

TL;DR: The paper introduces an attention-based Graph Neural Network for accurate demand forecasting in food delivery, addressing spatial-temporal dependencies to improve operational efficiency.

DetailsMotivation: Accurate demand forecasting is crucial for food delivery platforms due to spatial heterogeneity and temporal fluctuations in order volumes, which impact operational decisions.

Method: The proposed method uses an attention-based Graph Neural Network, modeling delivery zones as nodes and spatial proximity/order flows as edges, dynamically weighing neighboring influences and learning temporal trends.

Result: Experiments on real-world datasets show the model’s high accuracy in forecasting order volumes, outperforming existing methods.

Conclusion: The framework provides a scalable, adaptive solution for proactive fleet positioning, resource allocation, and dispatch optimization in urban food delivery.

Abstract: Accurate demand forecasting is critical for enhancing the efficiency and responsiveness of food delivery platforms, where spatial heterogeneity and temporal fluctuations in order volumes directly influence operational decisions. This paper proposes an attention-based Graph Neural Network framework that captures spatial-temporal dependencies by modeling the food delivery environment as a graph. In this graph, nodes represent urban delivery zones, while edges reflect spatial proximity and inter-regional order flow patterns derived from historical data. The attention mechanism dynamically weighs the influence of neighboring zones, enabling the model to focus on the most contextually relevant areas during prediction. Temporal trends are jointly learned alongside spatial interactions, allowing the model to adapt to evolving demand patterns. Extensive experiments on real-world food delivery datasets demonstrate the superiority of the proposed model in forecasting future order volumes with high accuracy. The framework offers a scalable and adaptive solution to support proactive fleet positioning, resource allocation, and dispatch optimization in urban food delivery operations.

[582] CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers

Jiaqi Han, Haotian Ye, Puheng Li, Minkai Xu, James Zou, Stefano Ermon

Main category: cs.LG

TL;DR: CHORDS is a training-free, model-agnostic acceleration framework for diffusion-based generative models, achieving up to 2.9x speedup with eight cores without quality loss.

DetailsMotivation: Diffusion models are computationally expensive during inference, and existing acceleration methods either require retraining or degrade quality.

Method: CHORDS uses multi-core parallelism, treating diffusion sampling as an ODE solver pipeline where slower solvers rectify faster ones via inter-core communication.

Result: CHORDS achieves up to 2.1x speedup with four cores and 2.9x with eight cores, outperforming baselines by 50% without quality degradation.

Conclusion: CHORDS provides a foundation for real-time, high-fidelity diffusion generation, offering significant speedup without compromising quality.

Abstract: Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures. Existing acceleration techniques either require extensive model retraining or compromise significantly on sample quality. This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. Our framework views multi-core diffusion sampling as an ODE solver pipeline, where slower yet accurate solvers progressively rectify faster solvers through a theoretically justified inter-core communication mechanism. This motivates our multi-core training-free diffusion sampling accelerator, CHORDS, which is compatible with various diffusion samplers, model architectures, and modalities. Through extensive experiments, CHORDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation. This advancement enables CHORDS to establish a solid foundation for real-time, high-fidelity diffusion generation.

[583] Temporal Basis Function Models for Closed-Loop Neural Stimulation

Matthew J. Bryan, Felix Schwock, Azadeh Yazdan-Shahmorad, Rajesh P N Rao

Main category: cs.LG

TL;DR: The paper proposes temporal basis function models (TBFMs) for efficient, low-latency closed-loop neural stimulation, demonstrating their effectiveness in predicting and controlling neural activity in non-human primates.

DetailsMotivation: To address translational challenges like sample efficiency, training time, and loop latency in AI-driven closed-loop neural stimulation for neurological diseases like Parkinson's.

Method: Uses TBFMs for single-trial, spatiotemporal forward prediction of optogenetic stimulation effects on local field potentials (LFPs) and simulations for closed-loop control.

Result: TBFMs are sample efficient, train quickly (2-4min), and have low latency (0.2ms), achieving prediction accuracy comparable to slower models.

Conclusion: TBFMs bridge the gap between AI-based dynamical systems modeling and clinically useful closed-loop stimulation protocols.

Abstract: Closed-loop neural stimulation provides novel therapies for neurological diseases such as Parkinson’s disease (PD), but it is not yet clear whether artificial intelligence (AI) techniques can tailor closed-loop stimulation to individual patients or identify new therapies. Progress requires us to address a number of translational issues, including sample efficiency, training time, and minimizing loop latency such that stimulation may be shaped in response to changing brain activity. We propose temporal basis function models (TBFMs) to address these difficulties, and explore this approach in the context of excitatory optogenetic stimulation. We demonstrate the ability of TBF models to provide a single-trial, spatiotemporal forward prediction of the effect of optogenetic stimulation on local field potentials (LFPs) measured in two non-human primates. We further use simulations to demonstrate the use of TBF models for closed-loop stimulation, driving neural activity towards target patterns. The simplicity of TBF models allow them to be sample efficient, rapid to train (2-4min), and low latency (0.2ms) on desktop CPUs. We demonstrate the model on 40 sessions of previously published excitatory optogenetic stimulation data. For each session, the model required 15-20min of data collection to successfully model the remainder of the session. It achieved a prediction accuracy comparable to a baseline nonlinear dynamical systems model that requires hours to train, and superior accuracy to a linear state-space model. In our simulations, it also successfully allowed a closed-loop stimulator to control a neural circuit. Our approach begins to bridge the translational gap between complex AI-based approaches to modeling dynamical systems and the vision of using such forward prediction models to develop novel, clinically useful closed-loop stimulation protocols.

[584] Machine Unlearning for Streaming Forgetting

Shaofei Shen, Chenhao Zhang, Yawen Zhao, Alina Bialkowski, Weitong Chen, Miao Xu

Main category: cs.LG

TL;DR: The paper introduces a streaming unlearning paradigm to address inefficiencies in existing batch-based machine unlearning methods, formalizing it as a distribution shift problem and proposing a novel algorithm with theoretical guarantees.

DetailsMotivation: Existing machine unlearning methods handle forgetting data in a single batch, which is inefficient for streaming removal requests. The paper aims to improve performance, efficiency, and data access in such scenarios.

Method: The authors formalize unlearning as a distribution shift problem, estimate the altered distribution, and propose a streaming unlearning algorithm that avoids accessing original training data.

Result: Theoretical analysis shows an $O(\sqrt{T} + V_T)$ error bound on streaming unlearning regret. Experiments validate the method’s effectiveness across models and datasets.

Conclusion: The proposed streaming unlearning algorithm efficiently handles streaming removal requests without requiring original data, supported by theoretical and experimental results.

Abstract: Machine unlearning aims to remove knowledge of the specific training data in a well-trained model. Currently, machine unlearning methods typically handle all forgetting data in a single batch, removing the corresponding knowledge all at once upon request. However, in practical scenarios, requests for data removal often arise in a streaming manner rather than in a single batch, leading to reduced efficiency and effectiveness in existing methods. Such challenges of streaming forgetting have not been the focus of much research. In this paper, to address the challenges of performance maintenance, efficiency, and data access brought about by streaming unlearning requests, we introduce a streaming unlearning paradigm, formalizing the unlearning as a distribution shift problem. We then estimate the altered distribution and propose a novel streaming unlearning algorithm to achieve efficient streaming forgetting without requiring access to the original training data. Theoretical analyses confirm an $O(\sqrt{T} + V_T)$ error bound on the streaming unlearning regret, where $V_T$ represents the cumulative total variation in the optimal solution over $T$ learning rounds. This theoretical guarantee is achieved under mild conditions without the strong restriction of convex loss function. Experiments across various models and datasets validate the performance of our proposed method.

[585] Mixture of Autoencoder Experts Guidance using Unlabeled and Incomplete Data for Exploration in Reinforcement Learning

Elias Malomgré, Pieter Simoens

Main category: cs.LG

TL;DR: A framework for RL agents to learn from imperfect expert demonstrations by transforming state-expert similarity into shaped intrinsic rewards, enabling robust exploration in diverse environments.

DetailsMotivation: The need for RL agents to learn from reward-free signals and adapt in real-world settings, overcoming challenges of intrinsic motivation in dense or complex environments.

Method: Uses a mapping function to convert state-expert similarity into intrinsic rewards and employs a Mixture of Autoencoder Experts to handle diverse and incomplete demonstrations.

Result: Demonstrates robust exploration and performance in sparse and dense reward environments, even with imperfect or sparse demonstrations.

Conclusion: Provides a practical solution for RL in realistic settings where optimal data and precise reward control are lacking.

Abstract: Recent trends in Reinforcement Learning (RL) highlight the need for agents to learn from reward-free interactions and alternative supervision signals, such as unlabeled or incomplete demonstrations, rather than relying solely on explicit reward maximization. Additionally, developing generalist agents that can adapt efficiently in real-world environments often requires leveraging these reward-free signals to guide learning and behavior. However, while intrinsic motivation techniques provide a means for agents to seek out novel or uncertain states in the absence of explicit rewards, they are often challenged by dense reward environments or the complexity of high-dimensional state and action spaces. Furthermore, most existing approaches rely directly on the unprocessed intrinsic reward signals, which can make it difficult to shape or control the agent’s exploration effectively. We propose a framework that can effectively utilize expert demonstrations, even when they are incomplete and imperfect. By applying a mapping function to transform the similarity between an agent’s state and expert data into a shaped intrinsic reward, our method allows for flexible and targeted exploration of expert-like behaviors. We employ a Mixture of Autoencoder Experts to capture a diverse range of behaviors and accommodate missing information in demonstrations. Experiments show our approach enables robust exploration and strong performance in both sparse and dense reward environments, even when demonstrations are sparse or incomplete. This provides a practical framework for RL in realistic settings where optimal data is unavailable and precise reward control is needed.

[586] Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Johannes Ackermann, Takashi Ishida, Masashi Sugiyama

Main category: cs.LG

TL;DR: The paper addresses overoptimization in RLHF for LMs by proposing Off-Policy Corrected Reward Modeling (OCRM), which improves reward model accuracy without new labels.

DetailsMotivation: Overoptimization in RLHF causes reward models to become inaccurate as the LM's responses diverge from training data, leading to mismatched human preferences.

Method: The authors propose OCRM, which iteratively corrects the reward model using importance weighting, avoiding the need for new labels or samples.

Result: Experiments on summarization and chatbot datasets show OCRM outperforms standard RLHF methods.

Conclusion: OCRM effectively mitigates overoptimization, improving the final policy’s alignment with human preferences.

Abstract: Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling

[587] Preferential subspace identification (PSID) with forward-backward smoothing

Omid G. Sani, Maryam M. Shanechi

Main category: cs.LG

TL;DR: The paper extends Preferential Subspace Identification (PSID) to include optimal filtering and smoothing for better estimation in offline applications, validated on simulated data.

DetailsMotivation: Existing PSID methods focus on prediction using past data, but offline applications could benefit from incorporating concurrent or all available data for improved estimation.

Method: The authors extend PSID by introducing a reduced-rank regression step for optimal filtering and develop a forward-backward PSID smoothing algorithm.

Result: The approach successfully recovers ground-truth model parameters and achieves optimal filtering and smoothing performance, matching the ideal performance of the true model.

Conclusion: This work provides a principled framework for optimal linear filtering and smoothing in two-signal settings, enhancing analysis of dynamic interactions in multivariate time-series.

Abstract: System identification methods for multivariate time-series, such as neural and behavioral recordings, have been used to build models for predicting one from the other. For example, Preferential Subspace Identification (PSID) builds a state-space model of a primary time-series (e.g., neural activity) to optimally predict a secondary time-series (e.g., behavior). However, PSID focuses on optimal prediction using past primary data, even though in offline applications, better estimation can be achieved by incorporating concurrent data (filtering) or all available data (smoothing). Here, we extend PSID to enable optimal filtering and smoothing. First, we show that the presence of a secondary signal makes it possible to uniquely identify a model with an optimal Kalman update step (to enable filtering) from a family of otherwise equivalent state-space models. Our filtering solution augments PSID with a reduced-rank regression step that directly learns the optimal gain required for the update step from data. We refer to this extension of PSID as PSID with filtering. Second, inspired by two-filter Kalman smoother formulations, we develop a novel forward-backward PSID smoothing algorithm where we first apply PSID with filtering and then apply it again in the reverse time direction on the residuals of the filtered secondary signal. We validate our methods on simulated data, showing that our approach recovers the ground-truth model parameters for filtering, and achieves optimal filtering and smoothing decoding performance of the secondary signal that matches the ideal performance of the true underlying model. This work provides a principled framework for optimal linear filtering and smoothing in the two-signal setting, significantly expanding the toolkit for analyzing dynamic interactions in multivariate time-series.

[588] Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Kailai Yang, Xiao Liu, Lei Ji, Hao Li, Yeyun Gong, Peng Cheng, Mao Yang

Main category: cs.LG

TL;DR: Proposes Data Mixing Agent, a model-based framework for domain reweighting in continual pre-training, outperforming manual heuristics and generalizing across domains.

DetailsMotivation: Address catastrophic forgetting in continual pre-training by automating domain reweighting, moving beyond manual heuristics.

Method: Uses reinforcement learning to train a Data Mixing Agent on data mixing trajectories with feedback from an evaluation environment.

Result: Outperforms baselines in balanced performance across source and target fields, generalizes to unseen domains, and adapts to new target fields like code generation.

Conclusion: Data Mixing Agent offers an efficient, generalizable solution for domain reweighting, aligning with human intuition and reducing reliance on source-field data.

Abstract: Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents’ well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.

[589] Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown

Emile Anand, Sarah Liaw

Main category: cs.LG

TL;DR: FG-TS improves exploration in high-dimensional bandits with an optimism bonus but struggles with approximate posteriors in neural settings.

DetailsMotivation: Address the lack of aggressive exploration in Thompson Sampling for high-dimensional problems.

Method: Introduces Feel-Good Thompson Sampling (FG-TS) with an optimism bonus and tests it across exact and approximate posterior settings.

Result: FG-TS outperforms vanilla TS in linear/logistic bandits but is weaker in neural settings. Trade-offs exist with bonus scaling.

Conclusion: FG-TS is recommended as a baseline for contextual-bandit benchmarks due to its competitiveness and ease of use.

Abstract: Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that biases toward high-reward models, and it achieves the asymptotically minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with \emph{approximate} posteriors – common in large-scale or neural problems – has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across eleven real-world and synthetic benchmarks. To evaluate their robustness, we compare performance across settings with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Nevertheless, because FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks. Finally, we provide source code for all our experiments in https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown.

[590] Universal crystal material property prediction via multi-view geometric fusion in graph transformers

Liang Zhang, Kong Chen, Yuen Wu

Main category: cs.LG

TL;DR: MGT, a multi-view graph transformer, improves crystal property prediction by fusing SE3 invariant and SO3 equivariant representations, achieving up to 21% error reduction and 58% gains in transfer learning.

DetailsMotivation: Existing methods struggle to capture the intricate geometric and topological characteristics of crystal structures, limiting machine learning in materials simulations.

Method: MGT combines SE3 invariant and SO3 equivariant graph representations using a lightweight mixture of experts router, adapting weights based on the task.

Result: MGT reduces mean absolute error by up to 21% and outperforms baselines by up to 58% in transfer learning tasks.

Conclusion: MGT is a versatile and effective framework for crystal property prediction, aiding novel material discovery.

Abstract: Accurately and comprehensively representing crystal structures is critical for advancing machine learning in large-scale crystal materials simulations, however, effectively capturing and leveraging the intricate geometric and topological characteristics of crystal structures remains a core, long-standing challenge for most existing methods in crystal property prediction. Here, we propose MGT, a multi-view graph transformer framework that synergistically fuses SE3 invariant and SO3 equivariant graph representations, which respectively captures rotation-translation invariance and rotation equivariance in crystal geometries. To strategically incorporate these complementary geometric representations, we employ a lightweight mixture of experts router in MGT to adaptively adjust the weight assigned to SE3 and SO3 embeddings based on the specific target task. Compared with previous state-of-the-art models, MGT reduces the mean absolute error by up to 21% on crystal property prediction tasks through multi-task self-supervised pretraining. Ablation experiments and interpretable investigations confirm the effectiveness of each technique implemented in our framework. Additionally, in transfer learning scenarios including crystal catalyst adsorption energy and hybrid perovskite bandgap prediction, MGT achieves performance improvements of up to 58% over existing baselines, demonstrating domain-agnostic scalability across diverse application domains. As evidenced by the above series of studies, we believe that MGT can serve as useful model for crystal material property prediction, providing a valuable tool for the discovery of novel materials.

[591] Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning

Sneheel Sarangi, Hanan Salam

Main category: cs.LG

TL;DR: Small-scale LLMs trained with RLVR struggle to develop generalizable Theory of Mind (ToM) capabilities, showing narrow overfitting instead.

DetailsMotivation: To explore if RL-based methods can instill nuanced social intelligence (e.g., ToM) in small LLMs.

Method: Training small LLMs on ToM datasets (HiToM, ExploreToM, FANToM) using RL with verifiable rewards (RLVR), then testing generalization on held-out datasets (e.g., OpenToM).

Result: Models improve on in-distribution tasks but fail to generalize to unseen ToM tasks, with prolonged RL training leading to narrow overfitting.

Conclusion: RLVR does not enable small LLMs to acquire a true, abstract ToM capability; learned behavior is dataset-specific.

Abstract: Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen ToM tasks with different characteristics. Furthermore, we demonstrate that prolonged RL training leads to models ``hacking’’ the statistical patterns of the training datasets, resulting in significant performance gains on in-domain data but no change, or degradation of performance on out-of-distribution tasks. This suggests the learned behavior is a form of narrow overfitting rather than the acquisition of a true, abstract ToM capability.

[592] Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design

Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou

Main category: cs.LG

TL;DR: M-DESIGN introduces a model knowledge base (MKB) pipeline for adaptive neural network refinement, addressing gaps in static model selection by leveraging relational dependencies and iterative refinement.

DetailsMotivation: Traditional model selection in databases overlooks fine-grained relational dependencies between tasks and model architectures, leading to suboptimal matches. M-DESIGN aims to fill this gap by enabling adaptive refinement.

Method: M-DESIGN uses a knowledge weaving engine and graph-relational schema to iteratively refine models based on task metadata, architecture variations, and performance metrics. It includes a predictive query planner for OOD tasks.

Result: Empirical results show M-DESIGN delivers optimal models in 26 of 33 data-task pairs within limited budgets, demonstrating effectiveness.

Conclusion: M-DESIGN successfully bridges the model refinement gap in databases, offering a dynamic and adaptive approach to neural network model selection and refinement.

Abstract: Database systems have recently advocated for embedding machine learning (ML) capabilities, offering declarative model queries over large, managed model repositories, thereby circumventing the huge computational overhead of traditional ML-based algorithms in automated neural network model selection. Pioneering database studies aim to organize existing benchmark repositories as model bases (MB), querying them for the model records with the highest performance estimation metrics for given tasks. However, this static model selection practice overlooks the fine-grained, evolving relational dependencies between diverse task queries and model architecture variations, resulting in suboptimal matches and failing to further refine the model effectively. To fill the model refinement gap in database research, we propose M-DESIGN, a curated model knowledge base (MKB) pipeline for mastering neural network refinement by adaptively weaving prior insights about model architecture modification. First, we propose a knowledge weaving engine that reframes model refinement as an adaptive query problem over task metadata. Given a user’s task query, M-DESIGN quickly matches and iteratively refines candidate models by leveraging a graph-relational knowledge schema that explicitly encodes data properties, architecture variations, and pairwise performance deltas as joinable relations. This schema supports fine-grained relational analytics over architecture tweaks and drives a predictive query planner that can detect and adapt to out-of-distribution (OOD) tasks. We instantiate M-DESIGN for graph analytics tasks, where our model knowledge base enriches existing benchmarks with structured metadata covering 3 graph tasks and 22 graph datasets, contributing data records of 67,760 graph models. Empirical results demonstrate that M-DESIGN delivers the optimal model in 26 of 33 data-task pairs within limited budgets.

[593] GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.LG

TL;DR: GUI-G² introduces Gaussian rewards for GUI grounding, outperforming UI-TARS-72B by 24.7% on ScreenSpot-Pro.

DetailsMotivation: Current reinforcement learning uses sparse binary rewards, ignoring the continuous nature of spatial interactions. Human clicking behavior inspires Gaussian modeling.

Method: GUI-G² uses Gaussian point rewards for precise localization and coverage rewards for spatial alignment, with adaptive variance for element scales.

Result: Outperforms UI-TARS-72B by 24.7% on ScreenSpot-Pro, showing robustness and generalization to unseen layouts.

Conclusion: GUI-G² transforms GUI grounding into dense continuous optimization, setting a new paradigm for spatial reasoning in GUI tasks.

Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

[594] Scaling Decentralized Learning with FLock

Zehua Cheng, Rui Sun, Jiahao Sun, Yike Guo

Main category: cs.LG

TL;DR: FLock is a decentralized framework for secure LLM fine-tuning, replacing central servers with blockchain and economic incentives, reducing adversarial attacks by >68%.

DetailsMotivation: Centralized control in LLM fine-tuning poses security risks, and decentralized schemes face computational challenges. FLock addresses these issues.

Method: FLock integrates blockchain for trust and replaces central aggregators with a secure protocol for untrusted parties.

Result: Empirical validation shows FLock reduces adversarial attacks by >68% and improves cross-domain generalization.

Conclusion: FLock enables secure, efficient LLM fine-tuning in decentralized settings, outperforming isolated training.

Abstract: Fine-tuning the large language models (LLMs) are prevented by the deficiency of centralized control and the massive computing and communication overhead on the decentralized schemes. While the typical standard federated learning (FL) supports data privacy, the central server requirement creates a single point of attack and vulnerability to poisoning attacks. Generalizing the result in this direction to 70B-parameter models in the heterogeneous, trustless environments has turned out to be a huge, yet unbroken bottleneck. This paper introduces FLock, a decentralized framework for secure and efficient collaborative LLM fine-tuning. Integrating a blockchain-based trust layer with economic incentives, FLock replaces the central aggregator with a secure, auditable protocol for cooperation among untrusted parties. We present the first empirical validation of fine-tuning a 70B LLM in a secure, multi-domain, decentralized setting. Our experiments show the FLock framework defends against backdoor poisoning attacks that compromise standard FL optimizers and fosters synergistic knowledge transfer. The resulting models show a >68% reduction in adversarial attack success rates. The global model also demonstrates superior cross-domain generalization, outperforming models trained in isolation on their own specialized data.

[595] To Label or Not to Label: PALM – A Predictive Model for Evaluating Sample Efficiency in Active Learning Models

Julia Machnio, Mads Nielsen, Mostafa Mehdipour Ghazi

Main category: cs.LG

TL;DR: PALM is a unified model for analyzing active learning (AL) performance, predicting trajectories through four key parameters. It generalizes across datasets and strategies, enabling cost-effective AL evaluation.

DetailsMotivation: Traditional AL evaluation methods focus only on final accuracy, missing the dynamics of the learning process. PALM addresses this gap by providing a comprehensive and interpretable analysis.

Method: PALM characterizes AL trajectories using four parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. It predicts future performance from partial observations.

Result: PALM generalizes across datasets (CIFAR-10/100, ImageNet-50/100/200), accurately predicting full learning curves and revealing insights into efficiency and scalability.

Conclusion: PALM enables systematic, reproducible, and data-efficient AL evaluation, aiding strategy selection and performance prediction under budget constraints.

Abstract: Active learning (AL) seeks to reduce annotation costs by selecting the most informative samples for labeling, making it particularly valuable in resource-constrained settings. However, traditional evaluation methods, which focus solely on final accuracy, fail to capture the full dynamics of the learning process. To address this gap, we propose PALM (Performance Analysis of Active Learning Models), a unified and interpretable mathematical model that characterizes AL trajectories through four key parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. PALM provides a predictive description of AL behavior from partial observations, enabling the estimation of future performance and facilitating principled comparisons across different strategies. We validate PALM through extensive experiments on CIFAR-10/100 and ImageNet-50/100/200, covering a wide range of AL methods and self-supervised embeddings. Our results demonstrate that PALM generalizes effectively across datasets, budgets, and strategies, accurately predicting full learning curves from limited labeled data. Importantly, PALM reveals crucial insights into learning efficiency, data space coverage, and the scalability of AL methods. By enabling the selection of cost-effective strategies and predicting performance under tight budget constraints, PALM lays the basis for more systematic, reproducible, and data-efficient evaluation of AL in both research and real-world applications. The code is available at: https://github.com/juliamachnio/PALM.

[596] Learning to Gridize: Segment Physical World by Wireless Communication Channel

Juntao Wang, Feng Yin, Tian Ding, Tsung-Hui Chang, Zhi-Quan Luo, Qi Yan

Main category: cs.LG

TL;DR: The paper introduces Channel Space Gridization (CSG), a novel framework for network optimization by unifying channel estimation and gridization, outperforming existing methods in accuracy and efficiency.

DetailsMotivation: Existing gridization methods (GSG, BSG) rely on unavailable location data or flawed assumptions about channel properties, limiting their effectiveness.

Method: Proposes CSG, a joint optimization framework using beam-level RSRP to estimate CAPS and partition grids. Introduces CSG-AE with a trainable encoder, quantizer, and physics-informed decoder, along with the PIDA training scheme for stability.

Result: CSG-AE improves CAPS estimation accuracy and clustering quality, reducing MAE by 30-65% on real-world data compared to baselines.

Conclusion: CSG advances gridization for large-scale network optimization by enhancing accuracy, consistency, and efficiency.

Abstract: Gridization, the process of partitioning space into grids where users share similar channel characteristics, serves as a fundamental prerequisite for efficient large-scale network optimization. However, existing methods like Geographical or Beam Space Gridization (GSG or BSG) are limited by reliance on unavailable location data or the flawed assumption that similar signal strengths imply similar channel properties. We propose Channel Space Gridization (CSG), a pioneering framework that unifies channel estimation and gridization for the first time. Formulated as a joint optimization problem, CSG uses only beam-level reference signal received power (RSRP) to estimate Channel Angle Power Spectra (CAPS) and partition samples into grids with homogeneous channel characteristics. To perform CSG, we develop the CSG Autoencoder (CSG-AE), featuring a trainable RSRP-to-CAPS encoder, a learnable sparse codebook quantizer, and a physics-informed decoder based on the Localized Statistical Channel Model. On recognizing the limitations of naive training scheme, we propose a novel Pretraining-Initialization-Detached-Asynchronous (PIDA) training scheme for CSG-AE, ensuring stable and effective training by systematically addressing the common pitfalls of the naive training paradigm. Evaluations reveal that CSG-AE excels in CAPS estimation accuracy and clustering quality on synthetic data. On real-world datasets, it reduces Active Mean Absolute Error (MAE) by 30% and Overall MAE by 65% on RSRP prediction accuracy compared to salient baselines using the same data, while improving channel consistency, cluster sizes balance, and active ratio, advancing the development of gridization for large-scale network optimization.

[597] MAP Estimation with Denoisers: Convergence Rates and Guarantees

Scott Pesme, Giacomo Meanti, Michael Arbel, Julien Mairal

Main category: cs.LG

TL;DR: The paper provides theoretical justification for using pretrained denoisers as surrogates for proximal operators in MAP optimization, proving convergence under log-concave priors.

DetailsMotivation: Despite the empirical success of using denoisers as surrogates for proximal operators in MAP optimization, there was no general theoretical justification for this practice.

Method: The authors propose a simple algorithm, related to practical methods, and analyze its convergence to the proximal operator under log-concave priors, interpreting it as gradient descent on smoothed proximal objectives.

Result: The algorithm provably converges to the proximal operator under log-concavity assumptions, providing a theoretical foundation for heuristic methods.

Conclusion: This work bridges the gap between theory and practice by justifying the use of denoisers in MAP optimization, offering a solid theoretical basis for previously heuristic approaches.

Abstract: Denoiser models have become powerful tools for inverse problems, enabling the use of pretrained networks to approximate the score of a smoothed prior distribution. These models are often used in heuristic iterative schemes aimed at solving Maximum a Posteriori (MAP) optimisation problems, where the proximal operator of the negative log-prior plays a central role. In practice, this operator is intractable, and practitioners plug in a pretrained denoiser as a surrogate-despite the lack of general theoretical justification for this substitution. In this work, we show that a simple algorithm, closely related to several used in practice, provably converges to the proximal operator under a log-concavity assumption on the prior $p$. We show that this algorithm can be interpreted as a gradient descent on smoothed proximal objectives. Our analysis thus provides a theoretical foundation for a class of empirically successful but previously heuristic methods.

[598] The calculus of variations of the Transformer on the hyperspherical tangent bundle

Andrew Gracyk

Main category: cs.LG

TL;DR: The paper provides a theoretical framework for Transformers using Lagrangian optimization and calculus of variations, showing they solve variational problems naturally.

DetailsMotivation: To mathematically ground Transformers by linking them to Lagrangian optimization and variational calculus, filling a gap in existing literature.

Method: Uses calculus of variations to model Transformers as flow maps on a high-dimensional unit sphere, deriving the Euler-Lagrange equation for them.

Result: Proves Transformers solve variational problems naturally and introduces new scenarios for their application in loss optimization.

Conclusion: Lays foundational groundwork for applying calculus of variations to Transformers, offering new theoretical insights and tools.

Abstract: We offer a theoretical mathematical background to Transformers through Lagrangian optimization across the token space. The Transformer, as a flow map, exists in the tangent fiber for each token along the high-dimensional unit sphere. The circumstance of the hypersphere across the latent data is reasonable due to the trained diagonal matrix equal to the identity, which has various empirical justifications. Thus, under the continuum limit of the dynamics, the latent vectors flow among the tangent bundle. Using these facts, we devise a mathematical framework for the Transformer through calculus of variations. We develop a functional and show that the continuous flow map induced by the Transformer satisfies this functional, therefore the Transformer can be viewed as a natural solver of a calculus of variations problem. We invent new scenarios of when our methods are applicable based on loss optimization with respect to path optimality. We derive the Euler-Lagrange equation for the Transformer. The variant of the Euler-Lagrange equation we present has various appearances in literature, but, to our understanding, oftentimes not foundationally proven or under other specialized cases. Our overarching proof is new: our techniques are classical and the use of the flow map object is original. We provide several other relevant results, primarily ones specific to neural scenarios. In particular, much of our analysis will be attempting to quantify Transformer data in variational contexts under neural approximations. Calculus of variations on manifolds is a well-nourished research area, but for the Transformer specifically, it is uncharted: we lay the foundation for this area through an introduction to the Lagrangian for the Transformer.

[599] An Adaptive Random Fourier Features approach Applied to Learning Stochastic Differential Equations

Owen Douglas, Aku Kammonen, Anamika Pandey, Raúl Tempone

Main category: cs.LG

TL;DR: The paper introduces an adaptive random Fourier features (ARFF) training algorithm with Metropolis sampling for learning stochastic differential equations from snapshot data, outperforming conventional methods.

DetailsMotivation: To improve the learning of drift and diffusion components in stochastic differential equations from snapshot data using a more efficient and effective method.

Method: Uses ARFF with Metropolis sampling and resampling, along with a likelihood-based loss function derived from Euler-Maruyama integration.

Result: The ARFF-based approach matches or exceeds conventional Adam-based optimization in loss minimization and convergence speed across benchmark problems.

Conclusion: ARFF is a promising alternative for data-driven modeling of stochastic dynamics.

Abstract: This work proposes a training algorithm based on adaptive random Fourier features (ARFF) with Metropolis sampling and resampling \cite{kammonen2024adaptiverandomfourierfeatures} for learning drift and diffusion components of stochastic differential equations from snapshot data. Specifically, this study considers It^{o} diffusion processes and a likelihood-based loss function derived from the Euler-Maruyama integration introduced in \cite{Dietrich2023} and \cite{dridi2021learningstochasticdynamicalsystems}. This work evaluates the proposed method against benchmark problems presented in \cite{Dietrich2023}, including polynomial examples, underdamped Langevin dynamics, a stochastic susceptible-infected-recovered model, and a stochastic wave equation. Across all cases, the ARFF-based approach matches or surpasses the performance of conventional Adam-based optimization in both loss minimization and convergence speed. These results highlight the potential of ARFF as a compelling alternative for data-driven modeling of stochastic dynamics.

[600] FedMultiEmo: Real-Time Emotion Recognition via Multimodal Federated Learning

Baran Can Gül, Suraksha Nadig, Stefanos Tziampazis, Nasser Jazdi, Michael Weyrich

Main category: cs.LG

TL;DR: FedMultiEmo is a privacy-preserving framework for in-vehicle emotion recognition, combining visual and physiological data via federated learning, achieving 87% accuracy while keeping data local.

DetailsMotivation: Address challenges like modality fragility, physiological variability, and privacy risks in emotion recognition for driver-assistance systems.

Method: Uses a multimodal federated learning pipeline with CNN for visual features and Random Forest for physiological cues, fused via majority-vote.

Result: Achieves 77% accuracy (CNN), 74% (Random Forest), and 87% (fusion), matching centralized performance with local data.

Conclusion: FedMultiEmo provides a practical, privacy-aware solution for real-time emotion recognition in vehicles.

Abstract: In-vehicle emotion recognition underpins adaptive driver-assistance systems and, ultimately, occupant safety. However, practical deployment is hindered by (i) modality fragility - poor lighting and occlusions degrade vision-based methods; (ii) physiological variability - heart-rate and skin-conductance patterns differ across individuals; and (iii) privacy risk - centralized training requires transmission of sensitive data. To address these challenges, we present FedMultiEmo, a privacy-preserving framework that fuses two complementary modalities at the decision level: visual features extracted by a Convolutional Neural Network from facial images, and physiological cues (heart rate, electrodermal activity, and skin temperature) classified by a Random Forest. FedMultiEmo builds on three key elements: (1) a multimodal federated learning pipeline with majority-vote fusion, (2) an end-to-end edge-to-cloud prototype on Raspberry Pi clients and a Flower server, and (3) a personalized Federated Averaging scheme that weights client updates by local data volume. Evaluated on FER2013 and a custom physiological dataset, the federated Convolutional Neural Network attains 77% accuracy, the Random Forest 74%, and their fusion 87%, matching a centralized baseline while keeping all raw data local. The developed system converges in 18 rounds, with an average round time of 120 seconds and a per-client memory footprint below 200 MB. These results indicate that FedMultiEmo offers a practical approach to real-time, privacy-aware emotion recognition in automotive settings.

[601] Data Aware Differentiable Neural Architecture Search for Tiny Keyword Spotting Applications

Yujia Shi, Emil Njor, Pablo Martínez-Nuevo, Sven Ewan Shepstone, Xenofon Fafoutis

Main category: cs.LG

TL;DR: The paper introduces ‘Data Aware Differentiable Neural Architecture Search’ to simplify TinyML system design by co-optimizing model architecture and data configuration.

DetailsMotivation: The complexity of TinyML system design hinders adoption, and current methods don't optimize both architecture and data.

Method: Expands Neural Architecture Search to include data configuration parameters, enabling co-optimization.

Result: Initial tests on keyword spotting show the method produces efficient, accurate TinyML systems.

Conclusion: The approach effectively balances resource usage and performance, advancing TinyML adoption.

Abstract: The success of Machine Learning is increasingly tempered by its significant resource footprint, driving interest in efficient paradigms like TinyML. However, the inherent complexity of designing TinyML systems hampers their broad adoption. To reduce this complexity, we introduce “Data Aware Differentiable Neural Architecture Search”. Unlike conventional Differentiable Neural Architecture Search, our approach expands the search space to include data configuration parameters alongside architectural choices. This enables Data Aware Differentiable Neural Architecture Search to co-optimize model architecture and input data characteristics, effectively balancing resource usage and system performance for TinyML applications. Initial results on keyword spotting demonstrate that this novel approach to TinyML system design can generate lean but highly accurate systems.

[602] The added value for MRI radiomics and deep-learning for glioblastoma prognostication compared to clinical and molecular information

D. Abler, O. Pusterla, A. Joye-Kühnis, N. Andratschke, M. Bach, A. Bink, S. M. Christ, P. Hagmann, B. Pouymayou, E. Pravatà, P. Radojewski, M. Reyes, L. Ruinelli, R. Schaer, B. Stieltjes, G. Treglia, W. Valenzuela, R. Wiest, S. Zoergiebel, M. Guckenberger, S. Tanadini-Lang, A. Depeursinge

Main category: cs.LG

TL;DR: The study evaluates the added value of conventional radiomics (CR) and deep learning (DL) MRI radiomics for glioblastoma prognosis, finding minimal improvement over clinical predictors.

DetailsMotivation: To assess whether radiomics (CR and DL) provides significant added value over clinical and molecular predictors for glioblastoma prognosis.

Method: Analyzed 1152 glioblastoma patients with clinical, molecular, and MRI data. Developed CR and DL models, evaluated on internal and external cohorts, and compared feature sets (imaging-only, clinical/molecular-only, combined).

Result: Combined-feature CR models slightly outperformed clinical-only models (AUC 0.75 vs. 0.74), but DL models lacked significance. Imaging data showed modest relevance for overall survival.

Conclusion: Radiomics offers minimal added value over clinical predictors like age and gender for glioblastoma prognosis.

Abstract: Background: Radiomics shows promise in characterizing glioblastoma, but its added value over clinical and molecular predictors has yet to be proven. This study assessed the added value of conventional radiomics (CR) and deep learning (DL) MRI radiomics for glioblastoma prognosis (<= 6 vs > 6 months survival) on a large multi-center dataset. Methods: After patient selection, our curated dataset gathers 1152 glioblastoma (WHO 2016) patients from five Swiss centers and one public source. It included clinical (age, gender), molecular (MGMT, IDH), and baseline MRI data (T1, T1 contrast, FLAIR, T2) with tumor regions. CR and DL models were developed using standard methods and evaluated on internal and external cohorts. Sub-analyses assessed models with different feature sets (imaging-only, clinical/molecular-only, combined-features) and patient subsets (S-1: all patients, S-2: with molecular data, S-3: IDH wildtype). Results: The best performance was observed in the full cohort (S-1). In external validation, the combined-feature CR model achieved an AUC of 0.75, slightly, but significantly outperforming clinical-only (0.74) and imaging-only (0.68) models. DL models showed similar trends, though without statistical significance. In S-2 and S-3, combined models did not outperform clinical-only models. Exploratory analysis of CR models for overall survival prediction suggested greater relevance of imaging data: across all subsets, combined-feature models significantly outperformed clinical-only models, though with a modest advantage of 2-4 C-index points. Conclusions: While confirming the predictive value of anatomical MRI sequences for glioblastoma prognosis, this multi-center study found standard CR and DL radiomics approaches offer minimal added value over demographic predictors such as age and gender.

[603] PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

Yimeng Chen, Piotr Piȩkos, Mateusz Ostaszewski, Firas Laakom, Jürgen Schmidhuber

Main category: cs.LG

TL;DR: PhysGym is a benchmark suite for evaluating LLM-based agents’ scientific reasoning in physics, focusing on prior knowledge and problem complexity.

DetailsMotivation: Current benchmarks lack the ability to assess LLM-based agents' scientific discovery capabilities, especially in varying environmental complexity and prior knowledge utilization.

Method: PhysGym introduces interactive simulations where agents probe environments, gather data, and hypothesize physical laws, with controlled prior knowledge levels.

Result: The benchmark differentiates LLM capabilities based on prior knowledge and task complexity, demonstrated through baseline model results.

Conclusion: PhysGym fills a critical gap by providing a standardized platform for rigorous evaluation of LLM-based scientific reasoning.

Abstract: Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce PhysGym, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. PhysGym’s primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. PhysGym provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark’s utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.

[604] Trade-offs between elective surgery rescheduling and length-of-stay prediction accuracy

Pieter Smet, Martina Doneda, Ettore Lanzarone, Giuliana Carello

Main category: cs.LG

TL;DR: This paper explores how the accuracy of machine learning (ML) predictions for patient length-of-stay (LOS) affects rescheduling strategies in elective surgery planning, aiming to optimize bed utilization and prevent overflows.

DetailsMotivation: Downstream resource availability, like inpatient beds, is critical for elective surgery planning. Inaccurate LOS predictions can disrupt schedules, necessitating flexible rescheduling strategies.

Method: The study uses simulated ML to evaluate data-driven approaches, analyzing the relationship between LOS prediction accuracy and rescheduling flexibility under various corrective policies.

Result: The research identifies effective patient rescheduling strategies to mitigate the impact of LOS prediction errors, balancing bed availability and resource optimization.

Conclusion: Accurate LOS predictions reduce rescheduling needs, but flexible strategies are essential to handle prediction errors and maintain efficient resource utilization.

Abstract: The availability of downstream resources plays a critical role in planning the admission of patients undergoing elective surgery, with inpatient beds being one of the most crucial resources. When planning patient admissions, predictions on their length-of-stay (LOS) made by machine learning (ML) models are used to ensure bed availability. However, the actual LOS for each patient may differ considerably from the predicted value, potentially making the schedule infeasible. To address such infeasibilities, rescheduling strategies that take advantage of operational flexibility can be implemented. For example, adjustments may include postponing admission dates, relocating patients to different wards, or even transferring patients who are already admitted. The common assumption is that more accurate LOS predictions reduce the impact of rescheduling. However, training ML models that can make such accurate predictions can be costly. Building on previous work that proposed simulated \ac{ml} for evaluating data-driven approaches, this paper explores the relationship between LOS prediction accuracy and rescheduling flexibility across various corrective policies. Specifically, we examine the most effective patient rescheduling strategies under LOS prediction errors to prevent bed overflows while optimizing resource utilization.

[605] On the Role of AI in Managing Satellite Constellations: Insights from the ConstellAI Project

Gregory F. Stock, Juan A. Fraire, Holger Hermanns, Jędrzej Mosiężny, Yusra Al-Khazraji, Julio Ramírez Molina, Evridiki V. Ntagiou

Main category: cs.LG

TL;DR: AI-driven algorithms, particularly Reinforcement Learning (RL), outperform traditional methods in optimizing satellite mega-constellation operations, specifically in data routing and resource allocation.

DetailsMotivation: The rapid expansion of satellite constellations necessitates innovative, scalable, and resilient management solutions.

Method: The ConstellAI project employs RL for data routing (improving latency) and resource allocation (optimizing task scheduling).

Result: RL outperforms classical methods, offering flexibility, scalability, and generalizability in satellite fleet management.

Conclusion: AI can transform satellite constellation management by providing adaptive, robust, and cost-effective solutions.

Abstract: The rapid expansion of satellite constellations in near-Earth orbits presents significant challenges in satellite network management, requiring innovative approaches for efficient, scalable, and resilient operations. This paper explores the role of Artificial Intelligence (AI) in optimizing the operation of satellite mega-constellations, drawing from the ConstellAI project funded by the European Space Agency (ESA). A consortium comprising GMV GmbH, Saarland University, and Thales Alenia Space collaborates to develop AI-driven algorithms and demonstrates their effectiveness over traditional methods for two crucial operational challenges: data routing and resource allocation. In the routing use case, Reinforcement Learning (RL) is used to improve the end-to-end latency by learning from historical queuing latency, outperforming classical shortest path algorithms. For resource allocation, RL optimizes the scheduling of tasks across constellations, focussing on efficiently using limited resources such as battery and memory. Both use cases were tested for multiple satellite constellation configurations and operational scenarios, resembling the real-life spacecraft operations of communications and Earth observation satellites. This research demonstrates that RL not only competes with classical approaches but also offers enhanced flexibility, scalability, and generalizability in decision-making processes, which is crucial for the autonomous and intelligent management of satellite fleets. The findings of this activity suggest that AI can fundamentally alter the landscape of satellite constellation management by providing more adaptive, robust, and cost-effective solutions.

[606] We Need to Rethink Benchmarking in Anomaly Detection

Philipp Röchner, Simon Klüttermann, Franz Rothlauf, Daniel Schlör

Main category: cs.LG

TL;DR: The paper argues that stagnation in anomaly detection progress is due to flawed benchmarking methods and proposes three improvements: scenario-based evaluation, end-to-end pipeline analysis, and meaningful objective alignment.

DetailsMotivation: Current benchmarking in anomaly detection fails to reflect real-world diversity, leading to minor performance gains and stagnation.

Method: Proposes three key improvements: scenario identification via taxonomy, end-to-end pipeline analysis, and scenario-specific evaluation.

Result: Identifies limitations in current benchmarking and suggests a shift to scenario-based evaluation for better progress.

Conclusion: Anomaly detection research should adopt scenario-focused benchmarking to drive meaningful advancements.

Abstract: Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. Current benchmarking does not, for example, sufficiently reflect the diversity of anomalies in applications ranging from predictive maintenance to scientific discovery. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that capture the relevant characteristics of different applications. We identify three key areas for improvement: First, we need to identify anomaly detection scenarios based on a common taxonomy. Second, anomaly detection pipelines should be analyzed end-to-end and by component. Third, evaluating anomaly detection algorithms should be meaningful regarding the scenario’s objectives.

[607] Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario

Yinsong Chen, Kaifeng Wang, Xiaoqiang Meng, Xueyuan Li, Zirui Li, Xin Gao

Main category: cs.LG

TL;DR: A Red-Team Multi-Agent Reinforcement Learning framework is proposed to uncover corner cases in safety-critical scenarios by using interfering red-team vehicles, improving AV decision-making safety.

DetailsMotivation: Existing methods for decision-making in safety-critical scenarios are inefficient and miss corner cases, prompting the need for a better approach.

Method: The framework employs red-team agents (background vehicles) to actively interfere and explore, using a Constraint Graph Representation Markov Decision Process to ensure safety while disrupting AVs. A policy threat zone model quantifies threats.

Result: The framework significantly impacts AV decision-making safety and generates diverse corner cases.

Conclusion: This method advances research in safety-critical scenarios by effectively uncovering corner cases and enhancing AV safety.

Abstract: Current research on decision-making in safety-critical scenarios often relies on inefficient data-driven scenario generation or specific modeling approaches, which fail to capture corner cases in real-world contexts. To address this issue, we propose a Red-Team Multi-Agent Reinforcement Learning framework, where background vehicles with interference capabilities are treated as red-team agents. Through active interference and exploration, red-team vehicles can uncover corner cases outside the data distribution. The framework uses a Constraint Graph Representation Markov Decision Process, ensuring that red-team vehicles comply with safety rules while continuously disrupting the autonomous vehicles (AVs). A policy threat zone model is constructed to quantify the threat posed by red-team vehicles to AVs, inducing more extreme actions to increase the danger level of the scenario. Experimental results show that the proposed framework significantly impacts AVs decision-making safety and generates various corner cases. This method also offers a novel direction for research in safety-critical scenarios.

[608] Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity

Huiling Yang, Zhanwei Wang, Kaibin Huang

Main category: cs.LG

TL;DR: The paper proposes a C²-aware framework for optimizing batch-size control in federated learning to minimize latency while ensuring convergence, addressing challenges like high-dimensional model updates and device heterogeneity.

DetailsMotivation: The need for low-latency federated learning (FL) in 6G networks for time-sensitive IoT applications like autonomous driving and healthcare, while overcoming challenges of computation/communication overhead and device heterogeneity.

Method: A novel C²-aware framework for optimal batch-size control, balancing the tradeoff between gradient estimation accuracy and per-round latency, with strategies for slow and fast fading scenarios.

Result: The proposed strategies outperform conventional batch-size adaptation schemes, demonstrating effectiveness in minimizing latency and accommodating device heterogeneity.

Conclusion: The framework successfully addresses the C² tradeoff and device heterogeneity, offering practical solutions for low-latency FL in 6G networks.

Abstract: Federated learning (FL) has emerged as a popular approach for collaborative machine learning in sixth-generation (6G) networks, primarily due to its privacy-preserving capabilities. The deployment of FL algorithms is expected to empower a wide range of Internet-of-Things (IoT) applications, e.g., autonomous driving, augmented reality, and healthcare. The mission-critical and time-sensitive nature of these applications necessitates the design of low-latency FL frameworks that guarantee high learning performance. In practice, achieving low-latency FL faces two challenges: the overhead of computing and transmitting high-dimensional model updates, and the heterogeneity in communication-and-computation (C$^2$) capabilities across devices. To address these challenges, we propose a novel C$^2$-aware framework for optimal batch-size control that minimizes end-to-end (E2E) learning latency while ensuring convergence. The framework is designed to balance a fundamental C$^2$ tradeoff as revealed through convergence analysis. Specifically, increasing batch sizes improves the accuracy of gradient estimation in FL and thus reduces the number of communication rounds required for convergence, but results in higher per-round latency, and vice versa. The associated problem of latency minimization is intractable; however, we solve it by designing an accurate and tractable surrogate for convergence speed, with parameters fitted to real data. This approach yields two batch-size control strategies tailored to scenarios with slow and fast fading, while also accommodating device heterogeneity. Extensive experiments using real datasets demonstrate that the proposed strategies outperform conventional batch-size adaptation schemes that do not consider the C$^2$ tradeoff or device heterogeneity.

[609] Accelerating HEC-RAS: A Recurrent Neural Operator for Rapid River Forecasting

Edward Holmberg, Pujan Pokhrel, Maximilian Zoch, Elias Ioup, Ken Pathak, Steven Sloan, Kendall Niles, Jay Ratcliff, Maik Flanagin, Christian Guetl, Julian Simeonov, Mahdi Abdelguerfi

Main category: cs.LG

TL;DR: A deep learning surrogate model accelerates HEC-RAS river forecasts by 3.5x, maintaining accuracy with minimal feature input.

DetailsMotivation: Traditional physics-based solvers like HEC-RAS are computationally slow for real-time flood decision-making, necessitating a faster yet accurate alternative.

Method: A hybrid architecture combines GRU for short-term dynamics and Geo-FNO for spatial dependencies, trained on HEC-RAS-generated data.

Result: The model achieves a median absolute stage error of 0.31 feet and reduces forecast time from 139 to 40 minutes.

Conclusion: The surrogate model proves data-driven approaches can replace conventional hydraulic models, enhancing large-scale flood forecasting feasibility.

Abstract: Physics-based solvers like HEC-RAS provide high-fidelity river forecasts but are too computationally intensive for on-the-fly decision-making during flood events. The central challenge is to accelerate these simulations without sacrificing accuracy. This paper introduces a deep learning surrogate that treats HEC-RAS not as a solver but as a data-generation engine. We propose a hybrid, auto-regressive architecture that combines a Gated Recurrent Unit (GRU) to capture short-term temporal dynamics with a Geometry-Aware Fourier Neural Operator (Geo-FNO) to model long-range spatial dependencies along a river reach. The model learns underlying physics implicitly from a minimal eight-channel feature vector encoding dynamic state, static geometry, and boundary forcings extracted directly from native HEC-RAS files. Trained on 67 reaches of the Mississippi River Basin, the surrogate was evaluated on a year-long, unseen hold-out simulation. Results show the model achieves a strong predictive accuracy, with a median absolute stage error of 0.31 feet. Critically, for a full 67-reach ensemble forecast, our surrogate reduces the required wall-clock time from 139 minutes to 40 minutes, a speedup of nearly 3.5 times over the traditional solver. The success of this data-driven approach demonstrates that robust feature engineering can produce a viable, high-speed replacement for conventional hydraulic models, improving the computational feasibility of large-scale ensemble flood forecasting.

[610] Towards Explainable Anomaly Detection in Shared Mobility Systems

Elnur Isgandarov, Matteo Cederle, Federico Chiariotti, Gian Antonio Susto

Main category: cs.LG

TL;DR: The paper introduces an interpretable anomaly detection framework for bike-sharing systems using multi-source data and Isolation Forest with DIFFI for interpretability.

DetailsMotivation: Identifying anomalies in shared mobility systems is crucial for optimizing operations, improving reliability, and enhancing user experience.

Method: The framework integrates bike-sharing trip records, weather, and transit data, using Isolation Forest for anomaly detection and DIFFI for interpretability.

Result: Station-level analysis effectively identifies anomalies, influenced by external factors like weather and transit availability.

Conclusion: The framework aids decision-making in shared mobility operations by providing actionable insights into anomalies.

Abstract: Shared mobility systems, such as bike-sharing networks, play a crucial role in urban transportation. Identifying anomalies in these systems is essential for optimizing operations, improving service reliability, and enhancing user experience. This paper presents an interpretable anomaly detection framework that integrates multi-source data, including bike-sharing trip records, weather conditions, and public transit availability. The Isolation Forest algorithm is employed for unsupervised anomaly detection, along with the Depth-based Isolation Forest Feature Importance (DIFFI) algorithm providing interpretability. Results show that station-level analysis offers a robust understanding of anomalies, highlighting the influence of external factors such as adverse weather and limited transit availability. Our findings contribute to improving decision-making in shared mobility operations.

[611] GeoHNNs: Geometric Hamiltonian Neural Networks

Amine Mohamed Aboussalah, Abdessalam Ed-dib

Main category: cs.LG

TL;DR: GeoHNN is a neural network framework that embeds geometric priors from physics, ensuring stability and accuracy in modeling dynamics.

DetailsMotivation: Common machine learning methods ignore the geometric principles of physics, leading to unstable predictions for complex systems.

Method: GeoHNN encodes Riemannian and symplectic geometries, using symmetric positive-definite matrices and a constrained autoencoder to preserve phase space volume.

Result: GeoHNN outperforms existing models in long-term stability, accuracy, and energy conservation across various systems.

Conclusion: Embedding geometric physics principles is essential for robust and generalizable models of physical dynamics.

Abstract: The fundamental laws of physics are intrinsically geometric, dictating the evolution of systems through principles of symmetry and conservation. While modern machine learning offers powerful tools for modeling complex dynamics from data, common methods often ignore this underlying geometric fabric. Physics-informed neural networks, for instance, can violate fundamental physical principles, leading to predictions that are unstable over long periods, particularly for high-dimensional and chaotic systems. Here, we introduce \textit{Geometric Hamiltonian Neural Networks (GeoHNN)}, a framework that learns dynamics by explicitly encoding the geometric priors inherent to physical laws. Our approach enforces two fundamental structures: the Riemannian geometry of inertia, by parameterizing inertia matrices in their natural mathematical space of symmetric positive-definite matrices, and the symplectic geometry of phase space, using a constrained autoencoder to ensure the preservation of phase space volume in a reduced latent space. We demonstrate through experiments on systems ranging from coupled oscillators to high-dimensional deformable objects that GeoHNN significantly outperforms existing models. It achieves superior long-term stability, accuracy, and energy conservation, confirming that embedding the geometry of physics is not just a theoretical appeal but a practical necessity for creating robust and generalizable models of the physical world.

[612] Explainable Anomaly Detection for Electric Vehicles Charging Stations

Matteo Cederle, Andrea Mazzucco, Andrea Demartini, Eugenio Mazza, Eugenia Suriani, Federico Vitti, Gian Antonio Susto

Main category: cs.LG

TL;DR: The paper explores unsupervised anomaly detection in EV charging stations using Isolation Forest and DIFFI for interpretability and root cause analysis, validated with real-world data.

DetailsMotivation: To ensure reliability and efficiency in EV charging infrastructure by detecting anomalies and understanding their root causes.

Method: Uses Isolation Forest for anomaly detection and DIFFI for feature importance analysis, applied to real-world sensor and charging session data.

Result: The approach is evaluated in a real industrial case, demonstrating its efficacy.

Conclusion: The study successfully integrates unsupervised anomaly detection with explainable AI to enhance interpretability and root cause analysis in EV charging infrastructure.

Abstract: Electric vehicles (EV) charging stations are one of the critical infrastructures needed to support the transition to renewable-energy-based mobility, but ensuring their reliability and efficiency requires effective anomaly detection to identify irregularities in charging behavior. However, in such a productive scenario, it is also crucial to determine the underlying cause behind the detected anomalies. To achieve this goal, this study investigates unsupervised anomaly detection techniques for EV charging infrastructure, integrating eXplainable Artificial Intelligence techniques to enhance interpretability and uncover root causes of anomalies. Using real-world sensors and charging session data, this work applies Isolation Forest to detect anomalies and employs the Depth-based Isolation Forest Feature Importance (DIFFI) method to identify the most important features contributing to such anomalies. The efficacy of the proposed approach is evaluated in a real industrial case.

[613] Multi-Modal Sensor Fusion for Proactive Blockage Prediction in mmWave Vehicular Networks

Ahmad M. Nazar, Abdulkadir Celik, Mohamed Y. Selim, Asmaa Abdallah, Daji Qiao, Ahmed M. Eltawil

Main category: cs.LG

TL;DR: A proactive blockage prediction framework for mmWave vehicular communication uses multi-modal sensing (camera, GPS, LiDAR, radar) with deep learning models and softmax-weighted fusion, achieving high F1-scores (up to 97.2%) and low inference times.

DetailsMotivation: Signal blockage in mmWave vehicular communication due to dynamic obstacles like vehicles and pedestrians necessitates proactive prediction to ensure reliable communication.

Method: Proposes a multi-modal sensing approach with independent deep learning models for each sensor (camera, GPS, LiDAR, radar) and fuses outputs using a softmax-weighted ensemble based on validation performance.

Result: Camera-only achieves 97.1% F1-score (89.8ms inference); camera+radar improves to 97.2% F1 (95.7ms). Demonstrates effectiveness of multi-modal sensing for blockage prediction.

Conclusion: Multi-modal sensing is efficient and effective for mmWave blockage prediction, enabling proactive wireless communication in dynamic environments.

Abstract: Vehicular communication systems operating in the millimeter wave (mmWave) band are highly susceptible to signal blockage from dynamic obstacles such as vehicles, pedestrians, and infrastructure. To address this challenge, we propose a proactive blockage prediction framework that utilizes multi-modal sensing, including camera, GPS, LiDAR, and radar inputs in an infrastructure-to-vehicle (I2V) setting. This approach uses modality-specific deep learning models to process each sensor stream independently and fuses their outputs using a softmax-weighted ensemble strategy based on validation performance. Our evaluations, for up to 1.5s in advance, show that the camera-only model achieves the best standalone trade-off with an F1-score of 97.1% and an inference time of 89.8ms. A camera+radar configuration further improves accuracy to 97.2% F1 at 95.7ms. Our results display the effectiveness and efficiency of multi-modal sensing for mmWave blockage prediction and provide a pathway for proactive wireless communication in dynamic environments.

[614] Deep-Learning Investigation of Vibrational Raman Spectra for Plant-Stress Analysis

Anoop C. Patil, Benny Jian Rong Sng, Yu-Wei Chang, Joana B. Pereira, Chua Nam-Hai, Rajani Sarojam, Gajendra Pratap Singh, In-Cheol Jang, Giovanni Volpe

Main category: cs.LG

TL;DR: DIVA, a deep-learning-based tool, automates plant stress detection using Raman spectroscopy without manual preprocessing, improving accuracy and consistency.

DetailsMotivation: Traditional Raman analysis for plant stress detection is biased and inconsistent due to manual preprocessing. DIVA aims to automate and improve this process.

Method: DIVA uses a variational autoencoder to process native Raman spectra, including fluorescence backgrounds, without manual intervention, identifying key spectral features.

Result: DIVA successfully detected various plant stresses (abiotic and biotic) by analyzing spectral features in an unbiased manner.

Conclusion: DIVA enables AI-driven plant health assessment, promoting resilient and sustainable agriculture through automated, unbiased Raman spectroscopy analysis.

Abstract: Detecting stress in plants is crucial for both open-farm and controlled-environment agriculture. Biomolecules within plants serve as key stress indicators, offering vital markers for continuous health monitoring and early disease detection. Raman spectroscopy provides a powerful, non-invasive means to quantify these biomolecules through their molecular vibrational signatures. However, traditional Raman analysis relies on customized data-processing workflows that require fluorescence background removal and prior identification of Raman peaks of interest-introducing potential biases and inconsistencies. Here, we introduce DIVA (Deep-learning-based Investigation of Vibrational Raman spectra for plant-stress Analysis), a fully automated workflow based on a variational autoencoder. Unlike conventional approaches, DIVA processes native Raman spectra-including fluorescence backgrounds-without manual preprocessing, identifying and quantifying significant spectral features in an unbiased manner. We applied DIVA to detect a range of plant stresses, including abiotic (shading, high light intensity, high temperature) and biotic stressors (bacterial infections). By integrating deep learning with vibrational spectroscopy, DIVA paves the way for AI-driven plant health assessment, fostering more resilient and sustainable agricultural practices.

[615] Dynamics is what you need for time-series forecasting!

Alexis-Raja Brachet, Pierre-Yves Richard, Céline Hudelot

Main category: cs.LG

TL;DR: The paper addresses challenges in time-series forecasting by emphasizing the need for models to learn underlying data dynamics. It introduces the PRO-DYN framework and highlights the importance of a dynamics block at the model’s end.

DetailsMotivation: Current deep models struggle with time-series forecasting due to partial learning of data dynamics. The hypothesis is that models must fully capture underlying dynamics to improve performance.

Method: The study uses the PRO-DYN nomenclature to analyze models, identifying partial dynamics learning and the critical role of a dynamics block at the model’s end. Extensive experiments validate these findings.

Result: Findings show that under-performing models learn dynamics partially, and placing a dynamics block at the model’s end significantly improves forecasting accuracy.

Conclusion: Incorporating a learnable dynamics block as the final predictor is crucial for effective time-series forecasting.

Abstract: While boundaries between data modalities are vanishing, the usual successful deep models are still challenged by simple ones in the time-series forecasting task. Our hypothesis is that this task needs models that are able to learn the data underlying dynamics. We propose to validate it through both systemic and empirical studies. We develop an original $\texttt{PRO-DYN}$ nomenclature to analyze existing models through the lens of dynamics. Two observations thus emerged: $\textbf{1}$. under-performing architectures learn dynamics at most partially, $\textbf{2}$. the location of the dynamics block at the model end is of prime importance. We conduct extensive experiments to confirm our observations on a set of performance-varying models with diverse backbones. Results support the need to incorporate a learnable dynamics block and its use as the final predictor.

[616] Graph Attention Specialized Expert Fusion Model for Node Classification: Based on Cora and Pubmed Datasets

Zihang Ma, Qitian Yin

Main category: cs.LG

TL;DR: The paper introduces WR-EFM, a Wasserstein-Rubinstein distance-enhanced Expert Fusion Model, to address classification disparities in graph node tasks, achieving balanced accuracy across categories.

DetailsMotivation: The study aims to resolve significant accuracy disparities in graph node classification, particularly for Category 2, which underperforms in traditional GCN models.

Method: Proposes WR-EFM, combining specialized GNN models for Categories 0/1 and Multi-hop GAT for Category 2, using WR distance for representation similarity and adaptive fusion.

Result: WR-EFM achieves balanced accuracies (77.8%, 78.0%, 79.9%) and reduces CV by 77.6%, improving Category 2 accuracy by 5.5% over GCN.

Conclusion: WR-EFM effectively handles class-imbalanced graph classification, offering a novel paradigm and releasing the project for community use.

Abstract: Graph node classification is a fundamental task in graph neural networks (GNNs), aiming to assign predefined class labels to nodes. On the PubMed citation network dataset, we observe significant classification difficulty disparities, with Category 2 achieving only 74.4% accuracy in traditional GCN, 7.5% lower than Category 1. To address this, we propose a Wasserstein-Rubinstein (WR) distance enhanced Expert Fusion Model (WR-EFM), training specialized GNN models for Categories 0/1 (with layer normalization and residual connections) and Multi-hop Graph Attention Networks (GAT) for Category 2. The WR distance metric optimizes representation similarity between models, particularly focusing on improving Category 2 performance. Our adaptive fusion strategy dynamically weights models based on category-specific performance, with Category 2 assigned a GAT weight of 0.8. WR distance further guides the fusion process by measuring distributional differences between model representations, enabling more principled integration of complementary features. Experimental results show WR-EFM achieves balanced accuracy across categories: 77.8% (Category 0), 78.0% (Category 1), and 79.9% (Category 2), outperforming both single models and standard fusion approaches. The coefficient of variation (CV) of WR-EFM’s category accuracies is 0.013, 77.6% lower than GCN’s 0.058, demonstrating superior stability. Notably, WR-EFM improves Category 2 accuracy by 5.5% compared to GCN, verifying the effectiveness of WR-guided fusion in capturing complex structural patterns. This work provides a novel paradigm for handling class-imbalanced graph classification tasks. To promote the research community, we release our project at https://github.com/s010m00n/GASEM4NC.

[617] Federated Split Learning with Improved Communication and Storage Efficiency

Yujia Mu, Cong Shen

Main category: cs.LG

TL;DR: Proposes CSE-FSL, a communication and storage-efficient federated split learning method, reducing overhead and maintaining a single server model.

DetailsMotivation: Address high communication and storage costs in federated split learning (FSL) by minimizing data transmission and server storage requirements.

Method: Uses an auxiliary network for local client updates, keeps a single server model, and transmits smashed data selectively.

Result: Theoretical convergence proven; experiments show significant communication reduction in real-world tasks.

Conclusion: CSE-FSL effectively reduces communication and storage costs while maintaining performance in federated learning.

Abstract: Federated learning (FL) is one of the popular distributed machine learning (ML) solutions but incurs significant communication and computation costs at edge devices. Federated split learning (FSL) can train sub-models in parallel and reduce the computational burden of edge devices by splitting the model architecture. However, it still requires a high communication overhead due to transmitting the smashed data and gradients between clients and the server in every global round. Furthermore, the server must maintain separate partial models for every client, leading to a significant storage requirement. To address these challenges, this paper proposes a novel communication and storage efficient federated split learning method, termed CSE-FSL, which utilizes an auxiliary network to locally update the weights of the clients while keeping a single model at the server, hence avoiding frequent transmissions of gradients from the server and greatly reducing the storage requirement of the server. Additionally, a new model update method of transmitting the smashed data in selected epochs can reduce the amount of smashed data sent from the clients. We provide a theoretical analysis of CSE-FSL, rigorously guaranteeing its convergence under non-convex loss functions. The extensive experimental results further indicate that CSE-FSL achieves a significant communication reduction over existing FSL solutions using real-world FL tasks.

[618] Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction

Shiyang Li

Main category: cs.LG

TL;DR: A hybrid CNN-LSTM-attention-Adaboost model with an improved snake-herd optimization algorithm is proposed for 4D trajectory prediction, outperforming traditional methods.

DetailsMotivation: To overcome limitations in medium- and long-term 4D trajectory prediction models by enhancing accuracy and handling high-dimensional data.

Method: Combines CNN for spatial features, LSTM for temporal features, attention for global features, and Adaboost for weak learners. Uses an improved snake-herd optimization algorithm for hyperparameter tuning.

Result: Outperforms traditional optimizers (e.g., particle swarm) and improves prediction accuracy by 39.89%.

Conclusion: The proposed hybrid model with optimized algorithms significantly enhances 4D trajectory prediction performance.

Abstract: To address the limitations of medium- and long-term four-dimensional (4D) trajectory prediction models, this paper proposes a hybrid CNN-LSTM-attention-adaboost neural network model incorporating a multi-strategy improved snake-herd optimization (SO) algorithm. The model applies the Adaboost algorithm to divide multiple weak learners, and each submodel utilizes CNN to extract spatial features, LSTM to capture temporal features, and attention mechanism to capture global features comprehensively. The strong learner model, combined with multiple sub-models, then optimizes the hyperparameters of the prediction model through the natural selection behavior pattern simulated by SO. In this study, based on the real ADS-B data from Xi’an to Tianjin, the comparison experiments and ablation studies of multiple optimizers are carried out, and a comprehensive test and evaluation analysis is carried out. The results show that SO-CLA-adaboost outperforms traditional optimizers such as particle swarm, whale, and gray wolf in handling large-scale high-dimensional trajectory data. In addition, introducing the full-strategy collaborative improvement SO algorithm improves the model’s prediction accuracy by 39.89%.

[619] Optimizing Canaries for Privacy Auditing with Metagradient Descent

Matteo Boglioni, Terrance Liu, Andrew Ilyas, Zhiwei Steven Wu

Main category: cs.LG

TL;DR: The paper introduces a method to optimize canary sets for black-box privacy auditing in differentially private learning, improving lower bounds on privacy parameters.

DetailsMotivation: To enhance privacy auditing by improving the effectiveness of canary sets used in membership inference attacks.

Method: Optimizes canary sets using metagradient optimization, tested on DP-SGD for differentially private image classification models.

Result: Empirical lower bounds for privacy parameters improved by over 2x in some cases; optimized canaries are transferable and efficient.

Conclusion: The proposed method significantly enhances privacy auditing efficiency and effectiveness for differentially private learning algorithms.

Abstract: In this work we study black-box privacy auditing, where the goal is to lower bound the privacy parameter of a differentially private learning algorithm using only the algorithm’s outputs (i.e., final trained model). For DP-SGD (the most successful method for training differentially private deep learning models), the canonical approach auditing uses membership inference-an auditor comes with a small set of special “canary” examples, inserts a random subset of them into the training set, and then tries to discern which of their canaries were included in the training set (typically via a membership inference attack). The auditor’s success rate then provides a lower bound on the privacy parameters of the learning algorithm. Our main contribution is a method for optimizing the auditor’s canary set to improve privacy auditing, leveraging recent work on metagradient optimization. Our empirical evaluation demonstrates that by using such optimized canaries, we can improve empirical lower bounds for differentially private image classification models by over 2x in certain instances. Furthermore, we demonstrate that our method is transferable and efficient: canaries optimized for non-private SGD with a small model architecture remain effective when auditing larger models trained with DP-SGD.

[620] FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs

Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro

Main category: cs.LG

TL;DR: A cost-effective method for synthetic tabular data generation using LLMs to create reusable sampling scripts, improving diversity and realism while reducing time and cost.

DetailsMotivation: Real-world data collection is costly and scarce, and direct LLM-based synthetic data generation is time-consuming and expensive for large volumes.

Method: LLMs infer and encode field distributions into reusable scripts, classifying fields into numerical, categorical, or free-text types for efficient sampling.

Result: Outperforms traditional methods in diversity and realism, significantly reducing time and cost for high-volume synthetic data generation.

Conclusion: The approach accelerates testing in production pipelines, shortens development cycles, and offers scalable, cost-effective synthetic data solutions.

Abstract: Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing high-fidelity, domain-relevant samples across various fields. However, existing approaches that directly use LLMs to generate each record individually impose prohibitive time and cost burdens, particularly when large volumes of synthetic data are required. In this work, we propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field’s distribution into a reusable sampling script. By automatically classifying fields into numerical, categorical, or free-text types, the LLM generates distribution-based scripts that can efficiently produce diverse, realistic datasets at scale without continuous model inference. Experimental results show that our approach outperforms traditional direct methods in both diversity and data realism, substantially reducing the burden of high-volume synthetic data generation. We plan to apply this methodology to accelerate testing in production pipelines, thereby shortening development cycles and improving overall system efficiency. We believe our insights and lessons learned will aid researchers and practitioners seeking scalable, cost-effective solutions for synthetic data generation.

[621] Diffusion Beats Autoregressive in Data-Constrained Settings

Mihir Prabhudesai, Menging Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak

Main category: cs.LG

TL;DR: Diffusion models outperform autoregressive models in data-scarce settings due to better data utilization and implicit augmentation.

DetailsMotivation: To explore the advantages of diffusion-based language models over autoregressive models, especially in data-constrained scenarios.

Method: Systematic study of masked diffusion models in data-constrained settings, comparing them with AR models.

Result: Diffusion models achieve lower validation loss and superior performance when compute is abundant but data is scarce.

Conclusion: Diffusion models are a compelling alternative to AR models when data is the bottleneck.

Abstract: Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR’s fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

[622] Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar

Main category: cs.LG

TL;DR: The paper investigates why single-layer transformers fail to learn first-order Markov chains, unlike deeper models, by analyzing their loss landscape and identifying global and local minima.

DetailsMotivation: To understand the contrasting behavior of single-layer transformers in learning first-order Markov chains compared to deeper models, which consistently succeed.

Method: Introduces a framework for analyzing transformers via Markov chains, theoretically characterizing the loss landscape and identifying conditions for global (bigram) and bad local (unigram) minima.

Result: Theoretical analysis and experiments confirm the existence of global and local minima, explaining why single-layer transformers struggle with Markov chains.

Conclusion: The study provides insights into the learning dynamics of transformers and highlights open problems in this area.

Abstract: Attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. To deepen our understanding of their sequential modeling capabilities, there is a growing interest in using Markov input processes to study them. A key finding is that when trained on first-order Markov chains, transformers with two or more layers consistently develop an induction head mechanism to estimate the in-context bigram conditional distribution. In contrast, single-layer transformers, unable to form an induction head, directly learn the Markov kernel but often face a surprising challenge: they become trapped in local minima representing the unigram distribution, whereas deeper models reliably converge to the ground-truth bigram. While single-layer transformers can theoretically model first-order Markov chains, their empirical failure to learn this simple kernel in practice remains a curious phenomenon. To explain this contrasting behavior of single-layer models, in this paper we introduce a new framework for a principled analysis of transformers via Markov chains. Leveraging our framework, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture. We precisely delineate the regimes under which these local optima occur. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. Finally, we outline several open problems in this arena. Code is available at https://github.com/Bond1995/Markov .

[623] STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He

Main category: cs.LG

TL;DR: The paper shows that structured pruning (expert pruning) can outperform unstructured pruning in Mixture-of-Experts (MoE) models, proposing a scalable method with O(1) complexity that achieves high sparsity with minimal performance loss.

DetailsMotivation: To reduce the high serving costs of MoEs in large language models (LLMs) by pruning experts, addressing scalability issues of existing methods.

Method: Proposes a scalable expert pruning method leveraging latent behavior similarity between experts, requiring only O(1) complexity.

Result: Achieves 40% sparsity with nearly no performance loss, even in challenging tasks like GSM8K, outperforming unstructured pruning.

Conclusion: Structured expert pruning is more effective and scalable than unstructured pruning for MoEs, enabling efficient deployment of large models.

Abstract: Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring $O(\frac{k^n}{\sqrt{n}})$ forward passes for $n$ experts, cannot scale for recent MoEs, we propose a scalable alternative with $O(1)$ complexity, yet outperforming the more expensive methods. The key idea is leveraging a latent structure between experts, based on behavior similarity, such that the greedy decision of whether to prune closely captures the joint pruning effect. Ours is highly effective – for Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art unstructured pruning fails to. The code will be made publicly available.

[624] AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

Main category: cs.LG

TL;DR: The paper introduces α-DPO, an adaptive preference optimization algorithm for aligning LLMs with human values, addressing limitations of DPO and SimPO by using a dynamic reward margin.

DetailsMotivation: Aligning LLMs with human values is critical for their utility and safety, but existing methods like RLHF, DPO, and SimPO face challenges in efficiency and adaptability.

Method: α-DPO introduces a dynamic reward margin and adaptive preference distribution, balancing policy and reference models for personalized optimization.

Result: Empirical results show α-DPO outperforms DPO and SimPO in win rates on benchmarks like AlpacaEval 2 and Arena-Hard.

Conclusion: α-DPO is a robust and effective method for fine-tuning LLMs, offering improved alignment and diversity control.

Abstract: Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO’s assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose $\alpha$-DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, $\alpha$-DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for $\alpha$-DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that $\alpha$-DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment. The code is available at https://github.com/junkangwu/alpha-DPO

[625] ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events

Duygu Sezen Islakoglu, Jan-Christoph Kalo

Main category: cs.LG

TL;DR: ChronoSense is a new benchmark for evaluating LLMs’ temporal understanding, testing Allen’s interval relations and temporal arithmetic. Results show LLMs struggle with temporal reasoning, suggesting memorization reliance.

DetailsMotivation: Despite LLMs' success in NLP, their reasoning and arithmetic, especially temporal understanding, remain weak. A comprehensive benchmark for Allen's interval relations is lacking.

Method: ChronoSense includes 16 tasks testing Allen relations and temporal arithmetic, using abstract events and real-world data. Seven recent LLMs are evaluated.

Result: LLMs perform poorly, handling Allen relations inconsistently and possibly relying on memorization for time-related questions.

Conclusion: Improved temporal understanding in LLMs is needed. ChronoSense provides a robust framework for future research.

Abstract: Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, yet they still face significant challenges in reasoning and arithmetic. Temporal reasoning, a critical component of natural language understanding, has raised increasing research attention. However, comprehensive testing of Allen’s interval relations (e.g., before, after, during) – a fundamental framework for temporal relationships – remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs’ temporal understanding. It includes 16 tasks, focusing on identifying the Allen relation between two temporal events and temporal arithmetic, using both abstract events and real-world data from Wikidata. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Moreover, the findings suggest that the models may rely on memorization to answer time-related questions. Overall, the models’ low performance highlights the need for improved temporal understanding in LLMs and ChronoSense offers a robust framework for future research in this area. Our dataset and the source code are available at https://github.com/duyguislakoglu/chronosense.

[626] OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning

Jinyuan Feng, Zhiqiang Pu, Tianyi Hu, Dongmin Li, Xiaolin Ai, Huimu Wang

Main category: cs.LG

TL;DR: OMoE introduces orthogonal constraints to MoE for LoRA, enhancing expert diversity and efficiency without changing the learning objective.

DetailsMotivation: Experts in vanilla MoE collapse to similar representations, limiting modularity and computational efficiency.

Method: Proposes Orthogonal Mixture-of-Experts (OMoE) using Gram-Schmidt to enforce orthogonal expert representations.

Result: OMoE achieves stable, efficient performance improvements with fewer experts compared to state-of-the-art methods.

Conclusion: OMoE is a resource-efficient MoE variant that promotes expert diversity and improves performance.

Abstract: Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT) for its modular design and remarkable performance. However, simply stacking the number of experts cannot guarantee significant improvement. In this work, we first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency. Ulteriorly, Our analysis reveals that the performance of previous MoE variants maybe limited by a lack of diversity among experts. Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE), a resource-efficient MoE variant that trains experts in an orthogonal manner to promote diversity. In OMoE, a Gram-Schmidt process is leveraged to enforce that the experts’ representations lie within the Stiefel manifold. By applying orthogonal constraints directly to the architecture, OMoE keeps the learning objective unchanged, without compromising optimality. Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models. Experiments on diverse commonsense reasoning benchmarks demonstrate that OMoE can consistently achieve stable and efficient performance improvement when compared with the state-of-the-art methods while significantly reducing the number of required experts.

[627] Domain-Adaptive Small Language Models for Structured Tax Code Prediction

Souvik Nath, Sumit Wadhwa, Luis Perez

Main category: cs.LG

TL;DR: A domain-adaptive small language model (SLM) with encoder-decoder architecture is proposed for accurate prediction of hierarchical tax codes like HSN, outperforming flat classifiers and other architectures.

DetailsMotivation: Multinational firms face challenges in accurately determining tax codes (e.g., HSN, SAC) due to varying regulations, necessitating a robust solution to avoid penalties.

Method: An encoder-decoder SLM is used to predict hierarchical tax code sequences from unstructured product/service data, capturing dependencies within codes.

Result: The SLM outperforms flat classifiers, decoder-only, and encoder-only architectures in predicting structured tax codes like HSN.

Conclusion: The approach is scalable to other tax codes (e.g., UNSPSC, NCM) and demonstrates the potential of SLMs in under-explored NLP domains.

Abstract: Every day, multinational firms process thousands of transactions, each of which must adhere to tax regulations that vary by jurisdiction and are often nuanced. The determination of product and service tax codes, such as HSN or SAC is a major use case in Tax compliance. An accurate determination of such codes is imperative to avoid any tax penalties. This paper proposes a domain-adaptive small language model (SLM) with an encoder-decoder architecture for the enhanced prediction of product and service tax codes. In this approach, we address the problem of predicting hierarchical tax code sequences using unstructured product and services data. We employ an SLM based upon encoder-decoder architecture as this enables sequential generation of tax codes to capture the hierarchical dependencies present within the tax codes. Our experiments demonstrate that encoder-decoder SLMs can be successfully applied to the sequential prediction of structured tax codes, a domain that remains comparatively unexplored in current NLP research. In this paper, we demonstrate the superior performance of the domain-adaptive encoder-decoder SLMs over flat classifiers when applied to the Harmonized System of Nomenclature (HSN), and achieve superior results compared to decoder-only and encoder-only architectures for structured sequence generation tasks. This approach can also be scaled to other government-mandated tax commodity codes, such as United Nations Standard Products and Services Codes (UNSPSC), or Brazil’s Nomenclatura Comum do Mercosul (NCM).

[628] Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI

Bo Shu, Yiting Zhang, Dong Shu

Main category: cs.LG

TL;DR: The paper explores the integration of generative models with CAVs to improve predictive modeling, simulation, and decision-making in autonomous vehicles, discussing benefits, challenges, and future potential.

DetailsMotivation: To understand how generative models can enhance CAV technology, addressing gaps in predictive accuracy and decision-making for safer and more innovative transportation.

Method: Investigates historical context and applications of generative models in CAVs, analyzing their impact on simulation and predictive tasks.

Result: Identifies progress in integrating these technologies but notes challenges that remain for full optimization and safety.

Conclusion: The integration of generative models and CAVs holds promise for advancing autonomous vehicle technology, though obstacles must be addressed for widespread adoption.

Abstract: This report investigates the history and impact of Generative Models and Connected and Automated Vehicles (CAVs), two groundbreaking forces pushing progress in technology and transportation. By focusing on the application of generative models within the context of CAVs, the study aims to unravel how this integration could enhance predictive modeling, simulation accuracy, and decision-making processes in autonomous vehicles. This thesis discusses the benefits and challenges of integrating generative models and CAV technology in transportation. It aims to highlight the progress made, the remaining obstacles, and the potential for advancements in safety and innovation.

[629] Likelihood-Free Gaussian Process for Regression

Yuta Shikuri

Main category: cs.LG

TL;DR: The paper introduces LFGP, a likelihood-free Gaussian process framework for scalable problems without requiring explicit likelihood functions.

DetailsMotivation: Addresses scenarios where the probability model is unknown, such as financial investments, by avoiding direct likelihood specification.

Method: LFGP clusters parameters with similar values and approximates likelihoods using asymptotic normality of maximum likelihood estimators.

Result: Enables posterior distribution representation without strict probability model assumptions, reducing computational costs.

Conclusion: LFGP advances likelihood-free modeling by minimizing assumptions and computational demands for scalable problems.

Abstract: Gaussian process regression can flexibly represent the posterior distribution of an interest parameter given sufficient information on the likelihood. However, in some cases, we have little knowledge regarding the probability model. For example, when investing in a financial instrument, the probability model of cash flow is generally unknown. In this paper, we propose a novel framework called the likelihood-free Gaussian process (LFGP), which allows representation of the posterior distributions of interest parameters for scalable problems without directly setting their likelihood functions. The LFGP establishes clusters in which the value of the interest parameter can be considered approximately identical, and it approximates the likelihood of the interest parameter in each cluster to a Gaussian using the asymptotic normality of the maximum likelihood estimator. We expect that the proposed framework will contribute significantly to likelihood-free modeling, particularly by reducing the assumptions for the probability model and the computational costs for scalable problems.

[630] Oversmoothing Alleviation in Graph Neural Networks: A Survey and Unified View

Yufei Jin, Xingquan Zhu

Main category: cs.LG

TL;DR: The paper proposes ATNPA, a unified framework to address oversmoothing in GNNs, categorizing existing methods into six groups and reviewing their strengths and weaknesses.

DetailsMotivation: Oversmoothing in GNNs limits their ability to learn long-term connections, especially in heterophilous graphs, necessitating a unified understanding of existing solutions.

Method: The authors introduce ATNPA (Augmentation, Transformation, Normalization, Propagation, Aggregation) to summarize approaches, propose a taxonomy, and categorize methods into six groups.

Result: The review provides a detailed analysis of existing methods, their relation to ATNPA, and highlights their strengths and weaknesses.

Conclusion: The paper offers a comprehensive roadmap for future research on oversmoothing alleviation in GNNs.

Abstract: Oversmoothing is a common challenge in learning graph neural networks (GNN), where, as layers increase, embedding features learned from GNNs quickly become similar or indistinguishable, making them incapable of differentiating network proximity. A GNN with shallow layer architectures can only learn short-term relation or localized structure information, limiting its power of learning long-term connection, evidenced by their inferior learning performance on heterophilous graphs. Tackling oversmoothing is crucial for harnessing deep-layer architectures for GNNs. To date, many methods have been proposed to alleviate oversmoothing. The vast difference behind their design principles, combined with graph complications, make it difficult to understand and even compare the difference between different approaches in tackling the oversmoothing. In this paper, we propose ATNPA, a unified view with five key steps: Augmentation, Transformation, Normalization, Propagation, and Aggregation, to summarize GNN oversmoothing alleviation approaches. We first propose a taxonomy for GNN oversmoothing alleviation which includes three themes to tackle oversmoothing. After that, we separate all methods into six categories, followed by detailed reviews of representative methods, including their relation to ATNPA, and discussion of their niche, strength, and weakness. The review not only draws an in-depth understanding of existing methods in the field but also shows a clear road map for future study.

[631] Escaping Saddle Points for Nonsmooth Weakly Convex Functions via Perturbed Proximal Algorithms

Minhui Huang, Weiming Zhu

Main category: cs.LG

TL;DR: Perturbed proximal algorithms escape strict saddles for nonsmooth weakly convex functions, achieving ϵ-approximate local minima in O(ϵ⁻²log(d)⁴) iterations.

DetailsMotivation: Addressing the challenge of escaping saddle points in nonsmooth weakly convex functions, leveraging insights from smooth problem methods.

Method: Introduces perturbed proximal point, gradient, and linear algorithms, building on novel ϵ-approximate local minimum characterization.

Result: Proves algorithms find ϵ-approximate local minima in O(ϵ⁻²log(d)⁴) iterations under standard assumptions.

Conclusion: The proposed perturbed proximal algorithms effectively handle nonsmooth weakly convex functions, offering theoretical guarantees for escaping saddles.

Abstract: We propose perturbed proximal algorithms that can provably escape strict saddles for nonsmooth weakly convex functions. The main results are based on a novel characterization of $\epsilon$-approximate local minimum for nonsmooth functions, and recent developments on perturbed gradient methods for escaping saddle points for smooth problems. Specifically, we show that under standard assumptions, the perturbed proximal point, perturbed proximal gradient and perturbed proximal linear algorithms find $\epsilon$-approximate local minimum for nonsmooth weakly convex functions in $O(\epsilon^{-2}\log(d)^4)$ iterations, where $d$ is the dimension of the problem.

[632] Which Experiences Are Influential for RL Agents? Efficiently Estimating The Influence of Experiences

Takuya Hiraoka, Guanquan Wang, Takashi Onishi, Yoshimasa Tsuruoka

Main category: cs.LG

TL;DR: PIToD efficiently estimates experience influence in RL, outperforming LOO, and improves agent performance by removing negative experiences.

DetailsMotivation: Understanding how experiences in replay buffers affect RL agent performance is crucial, especially for identifying harmful experiences.

Method: Proposes Policy Iteration with Turn-over Dropout (PIToD) to estimate experience influence more efficiently than LOO.

Result: PIToD accurately estimates influence and significantly improves RL agent performance by removing negative experiences.

Conclusion: PIToD is a computationally efficient and effective method for enhancing RL agent performance through experience amendment.

Abstract: In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent’s performance. Information about how these experiences influence the agent’s performance is valuable for various purposes, such as identifying experiences that negatively influence underperforming agents. One method for estimating the influence of experiences is the leave-one-out (LOO) method. However, this method is usually computationally prohibitive. In this paper, we present Policy Iteration with Turn-over Dropout (PIToD), which efficiently estimates the influence of experiences. We evaluate how correctly PIToD estimates the influence of experiences and its efficiency compared to LOO. We then apply PIToD to amend underperforming RL agents, i.e., we use PIToD to estimate negatively influential experiences for the RL agents and to delete the influence of these experiences. We show that RL agents’ performance is significantly improved via amendments with PIToD.

[633] 5G Traffic Prediction with Time Series Analysis

Nikhil Nayak, Rujula Singh R, Rameshwar Garg, Varun Danda, Chandana Kiran, Kaustuv Saha

Main category: cs.LG

TL;DR: The paper proposes using LSTM models for traffic prediction and classification in cellular networks to optimize resource allocation and utilization.

DetailsMotivation: The dramatic increase in cellular traffic demand necessitates accurate prediction and classification to improve network performance.

Method: The study employs LSTM models for predicting packet arrival intensity and burst occurrence, and replaces the regression layer with a softmax classifier for traffic classification into four application types.

Result: The LSTM model predicts uplink packets and burst occurrence probability, while the softmax classifier successfully categorizes traffic into surfing, video calling, voice calling, and video streaming.

Conclusion: Machine learning, particularly LSTM models, effectively addresses traffic prediction and classification challenges in cellular networks, enhancing operational efficiency.

Abstract: In today’s day and age, a mobile phone has become a basic requirement needed for anyone to thrive. With the cellular traffic demand increasing so dramatically, it is now necessary to accurately predict the user traffic in cellular networks, so as to improve the performance in terms of resource allocation and utilisation. Since traffic learning and prediction is a classical and appealing field, which still yields many meaningful results, there has been an increasing interest in leveraging Machine Learning tools to analyse the total traffic served in a given region, to optimise the operation of the network. With the help of this project, we seek to exploit the traffic history by using it to predict the nature and occurrence of future traffic. Furthermore, we classify the traffic into particular application types, to increase our understanding of the nature of the traffic. By leveraging the power of machine learning and identifying its usefulness in the field of cellular networks we try to achieve three main objectives - classification of the application generating the traffic, prediction of packet arrival intensity and burst occurrence. The design of the prediction and classification system is done using Long Short Term Memory (LSTM) model. The LSTM predictor developed in this experiment would return the number of uplink packets and also estimate the probability of burst occurrence in the specified future time interval. For the purpose of classification, the regression layer in our LSTM prediction model is replaced by a softmax classifier which is used to classify the application generating the cellular traffic into one of the four applications including surfing, video calling, voice calling, and video streaming.

[634] How to Leverage Predictive Uncertainty Estimates for Reducing Catastrophic Forgetting in Online Continual Learning

Giuseppe Serra, Ben Werner, Florian Buettner

Main category: cs.LG

TL;DR: The paper analyzes how predictive uncertainty can optimize memory management in online learning to combat catastrophic forgetting (CF), proposing a new uncertainty estimation method.

DetailsMotivation: Addressing catastrophic forgetting in online learning by leveraging predictive uncertainty for effective memory management.

Method: Analyzes uncertainty estimates and memory strategies, proposes a generalized variance-based uncertainty measure.

Result: Demonstrates that predictive uncertainty measures reduce CF in various settings.

Conclusion: Predictive uncertainty is effective for memory management in mitigating catastrophic forgetting.

Abstract: Many real-world applications require machine-learning models to be able to deal with non-stationary data distributions and thus learn autonomously over an extended period of time, often in an online setting. One of the main challenges in this scenario is the so-called catastrophic forgetting (CF) for which the learning model tends to focus on the most recent tasks while experiencing predictive degradation on older ones. In the online setting, the most effective solutions employ a fixed-size memory buffer to store old samples used for replay when training on new tasks. Many approaches have been presented to tackle this problem. However, it is not clear how predictive uncertainty information for memory management can be leveraged in the most effective manner and conflicting strategies are proposed to populate the memory. Are the easiest-to-forget or the easiest-to-remember samples more effective in combating CF? Starting from the intuition that predictive uncertainty provides an idea of the samples’ location in the decision space, this work presents an in-depth analysis of different uncertainty estimates and strategies for populating the memory. The investigation provides a better understanding of the characteristics data points should have for alleviating CF. Then, we propose an alternative method for estimating predictive uncertainty via the generalised variance induced by the negative log-likelihood. Finally, we demonstrate that the use of predictive uncertainty measures helps in reducing CF in different settings.

[635] Quantum Learning Theory Beyond Batch Binary Classification

Preetham Mohan, Ambuj Tewari

Main category: cs.LG

TL;DR: Extends quantum batch learning results to multiclass and online learning, introducing a new quantum online learning model.

DetailsMotivation: To generalize previous findings on quantum batch learning to broader contexts (multiclass and online learning) and introduce a novel quantum online learning framework.

Method: Extends Arunachalam and de Wolf’s (2018) approach to multiclass and online settings, including an adaptive adversary variant and a new quantum online learning model.

Result: Demonstrates that quantum sample complexities for multiclass and online learning mirror classical ones, with a new quantum online learning model introduced.

Conclusion: Quantum learning complexities align with classical ones in broader contexts, and a pioneering quantum online learning model is proposed.

Abstract: Arunachalam and de Wolf (2018) showed that the sample complexity of quantum batch learning of boolean functions, in the realizable and agnostic settings, has the same form and order as the corresponding classical sample complexities. In this paper, we extend this, ostensibly surprising, message to batch multiclass learning, online boolean learning, and online multiclass learning. For our online learning results, we first consider an adaptive adversary variant of the classical model of Dawid and Tewari (2022). Then, we introduce the first (to the best of our knowledge) model of online learning with quantum examples.

[636] A Mathematical Framework and a Suite of Learning Techniques for Neural-Symbolic Systems

Charles Dickens, Connor Pryor, Changyu Gao, Alon Albalak, Eriq Augustine, William Wang, Stephen Wright, Lise Getoor

Main category: cs.LG

TL;DR: The paper introduces Neural-Symbolic Energy-Based Models (NeSy-EBMs) as a unifying framework for neural-symbolic systems, offering general learning approaches and demonstrating practical advantages across tasks.

DetailsMotivation: The rapid growth of Neural-Symbolic (NeSy) systems lacks a unifying framework to organize modeling patterns and learning methods.

Method: The authors propose NeSy-EBMs, a mathematical framework for discriminative and generative NeSy modeling, with general gradient expressions and four learning approaches. They also introduce NeuPSL, an open-source library.

Result: Empirical analysis shows NeSy-EBMs’ effectiveness in tasks like image classification, graph node labeling, autonomous vehicle awareness, and question answering.

Conclusion: NeSy-EBMs provide a scalable and expressive framework for real-world NeSy applications, validated by diverse empirical results.

Abstract: The field of Neural-Symbolic (NeSy) systems is growing rapidly. Proposed approaches show great promise in achieving symbiotic unions of neural and symbolic methods. However, a unifying framework is needed to organize common NeSy modeling patterns and develop general learning approaches. In this paper, we introduce Neural-Symbolic Energy-Based Models (NeSy-EBMs), a unifying mathematical framework for discriminative and generative NeSy modeling. Importantly, NeSy-EBMs allow the derivation of general expressions for gradients of prominent learning losses, and we introduce a suite of four learning approaches that leverage methods from multiple domains, including bilevel and stochastic policy optimization. Finally, we ground the NeSy-EBM framework with Neural Probabilistic Soft Logic (NeuPSL), an open-source NeSy-EBM library designed for scalability and expressivity, facilitating the real-world application of NeSy systems. Through extensive empirical analysis across multiple datasets, we demonstrate the practical advantages of NeSy-EBMs in various tasks, including image classification, graph node labeling, autonomous vehicle situation awareness, and question answering.

[637] Optimizer’s Information Criterion: Dissecting and Correcting Bias in Data-Driven Optimization

Garud Iyengar, Henry Lam, Tianyu Wang

Main category: cs.LG

TL;DR: The paper introduces the Optimizer’s Information Criterion (OIC) to correct the optimistic bias in data-driven optimization without additional computational costs, generalizing the Akaike Information Criterion for decision selection.

DetailsMotivation: The Optimizer's Curse causes optimistic bias in sample performance, and existing correction methods like cross-validation are computationally expensive.

Method: Develops OIC to approximate first-order bias directly, avoiding additional optimization. Applies OIC to various optimization formulations.

Result: Numerical validation shows superior performance of OIC on synthetic and real-world datasets.

Conclusion: OIC provides an efficient, general solution for bias correction in data-driven optimization, applicable beyond model selection to decision selection.

Abstract: In data-driven optimization, the sample performance of the obtained decision typically incurs an optimistic bias against the true performance, a phenomenon commonly known as the Optimizer’s Curse and intimately related to overfitting in machine learning. Common techniques to correct this bias, such as cross-validation, require repeatedly solving additional optimization problems and are therefore computationally expensive. We develop a general bias correction approach, building on what we call Optimizer’s Information Criterion (OIC), that directly approximates the first-order bias and does not require solving any additional optimization problems. Our OIC generalizes the celebrated Akaike Information Criterion to evaluate the objective performance in data-driven optimization, which crucially involves not only model fitting but also its interplay with the downstream optimization. As such it can be used for decision selection instead of only model selection. We apply our approach to a range of data-driven optimization formulations comprising empirical and parametric models, their regularized counterparts, and furthermore contextual optimization. Finally, we provide numerical validation on the superior performance of our approach under synthetic and real-world datasets.

[638] Self-Tuning Self-Supervised Image Anomaly Detection

Jaemin Yoo, Lingxiao Zhao, Leman Akoglu

Main category: cs.LG

TL;DR: ST-SSAD introduces an unsupervised method for tuning data augmentation in self-supervised anomaly detection, improving accuracy without labeled data.

DetailsMotivation: Self-supervised learning (SSL) avoids costly manual labeling, but data augmentation choices significantly impact accuracy in anomaly detection (SSAD). Existing methods lack labeled validation data for tuning.

Method: ST-SSAD proposes an unsupervised validation loss to align augmented training data with unlabeled validation data and introduces differentiable augmentation functions for end-to-end tuning.

Result: Experiments show ST-SSAD outperforms existing methods on semantic class anomalies and industrial defects.

Conclusion: ST-SSAD effectively tunes augmentation for SSAD without labeled data, offering significant performance improvements.

Abstract: Self-supervised learning (SSL) has emerged as a promising paradigm that presents supervisory signals to real-world problems, bypassing the extensive cost of manual labeling. Consequently, self-supervised anomaly detection (SSAD) has seen a recent surge of interest, since SSL is especially attractive for unsupervised tasks. However, recent works have reported that the choice of a data augmentation function has significant impact on the accuracy of SSAD, posing augmentation search as an essential but nontrivial problem due to lack of labeled validation data. In this paper, we introduce ST-SSAD, the first unsupervised approach to end-to-end augmentation tuning for SSAD. To this end, our work presents two key contributions. The first is a new unsupervised validation loss that quantifies the alignment between augmented training data and unlabeled validation data. The second is new differentiable augmentation functions, allowing data augmentation hyperparameter(s) to be tuned in an end-to-end manner. Experiments on two testbeds with semantic class anomalies and subtle industrial defects show that ST-SSAD gives significant performance gains over existing works. All our code and testbeds are available at https://github.com/jaeminyoo/ST-SSAD.

[639] Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models

Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou

Main category: cs.LG

TL;DR: DesiGNN is a knowledge-centered framework that enhances LLMs’ ability to design Graph Neural Networks (GNNs) by converting past design experiences into structured knowledge, improving model proposals for unseen datasets.

DetailsMotivation: LLMs struggle with specialized, data-sensitive tasks like GNN design due to knowledge gaps and noisy inputs, leading to generic or misleading suggestions.

Method: DesiGNN systematically converts past model design experiences into fine-grained knowledge priors, aligning empirical property filtering with adaptive literature insights via LLMs.

Result: DesiGNN delivers top-5.77% initial model proposals for unseen datasets quickly and outperforms baselines with minimal search costs.

Conclusion: DesiGNN effectively bridges the gap between graph understanding and architecture patterns, enabling proficient, data-aware GNN design.

Abstract: High-level automation is increasingly critical in AI, driven by rapid advances in large language models (LLMs) and AI agents. However, LLMs, despite their general reasoning power, struggle significantly in specialized, data-sensitive tasks such as designing Graph Neural Networks (GNNs). This difficulty arises from (1) the inherent knowledge gaps in modeling the intricate, varying relationships between graph properties and suitable architectures and (2) the external noise from misleading descriptive inputs, often resulting in generic or even misleading model suggestions. Achieving proficiency in designing data-aware models – defined as the meta-level capability to systematically accumulate, interpret, and apply data-specific design knowledge – remains challenging for existing automated approaches, due to their inefficient construction and application of meta-knowledge. To achieve the meta-level proficiency, we propose DesiGNN, a knowledge-centered framework that systematically converts past model design experiences into structured, fine-grained knowledge priors well fitted to meta-learning with LLMs. To account for the inherent variability and external noise, DesiGNN aligns empirical property filtering from extensive benchmarks with adaptive elicitation of literature insights via LLMs. By constructing a solid meta-knowledge between unseen graph understanding and known effective architecture patterns, DesiGNN can deliver top-5.77% initial model proposals for unseen datasets within seconds, and achieve consistently superior performance with minimal search costs against baselines.

[640] Restrictions on Physical Stochastic Reservoir Computers

Anthony M. Polloreno

Main category: cs.LG

TL;DR: The paper analyzes how noise affects analog reservoir computers, showing it degrades their learning capabilities and limits feature construction, linking this to quantum complexity theory and classification tasks.

DetailsMotivation: To understand the impact of noise on the learning capabilities of analog reservoir computers, leveraging the information processing capacity (IPC) metric.

Method: Extends IPC analysis, applies quantum complexity theory to reservoir computing, and relates IPC degradation to the fat-shattering dimension of reservoir dynamics.

Result: Noise causes exponential reduction in accessible reservoir configurations, limiting learning to polynomial tasks despite an exponentially large latent space.

Conclusion: Analog reservoir computers under noise can only perform polynomial learning, even with exponential post-processing.

Abstract: Reservoir computation is a recurrent framework for learning and predicting time series data, that benefits from extremely simple training and interpretability, often as the the dynamics of a physical system. In this paper, we will study the impact of noise on the learning capabilities of analog reservoir computers. Recent work on reservoir computation has shown that the information processing capacity (IPC) is a useful metric for quantifying the degradation of the performance due to noise. We further this analysis and demonstrate that this degradation of the IPC limits the possible features that can be meaningfully constructed in an analog reservoir computing setting. We borrow a result from quantum complexity theory that relates the circuit model of computation to a continuous time model, and demonstrate an exponential reduction in the accessible volume of reservoir configurations. We conclude by relating this degradation in the IPC to the fat-shattering dimension of a family of functions describing the reservoir dynamics, which allows us to express our result in terms of a classification task. We conclude that any physical, analog reservoir computer that is exposed to noise can only be used to perform a polynomial amount of learning, despite the exponentially large latent space, even with an exponential amount of post-processing.

[641] RetroDiff: Retrosynthesis as Multi-stage Distribution Interpolation

Yiming Wang, Yuxuan Song, Yiqun Wang, Minkai Xu, Rui Wang, Hao Zhou, Wei-Ying Ma

Main category: cs.LG

TL;DR: RetroDiff, a diffusion-based method for retrosynthesis, outperforms existing methods in accuracy and molecular validity by mimicking the reverse of semi-template workflows.

DetailsMotivation: Addressing the challenge of integrating diffusion models for graph-to-graph tasks in retrosynthesis while retaining chemical reaction template information.

Method: A multi-stage diffusion process: sampling external groups, generating bonds, and mirroring semi-template workflows.

Result: RetroDiff surpasses semi-template methods in accuracy and outperforms template-based and template-free methods in large-scale scenarios and molecular validity.

Conclusion: RetroDiff is an effective diffusion-based solution for retrosynthesis, offering improved performance and validity.

Abstract: Retrosynthesis poses a key challenge in biopharmaceuticals, aiding chemists in finding appropriate reactant molecules for given product molecules. With reactants and products represented as 2D graphs, retrosynthesis constitutes a conditional graph-to-graph (G2G) generative task. Inspired by advancements in discrete diffusion models for graph generation, we aim to design a diffusion-based method to address this problem. However, integrating a diffusion-based G2G framework while retaining essential chemical reaction template information presents a notable challenge. Our key innovation involves a multi-stage diffusion process. We decompose the retrosynthesis procedure to first sample external groups from the dummy distribution given products, then generate external bonds to connect products and generated groups. Interestingly, this generation process mirrors the reverse of the widely adapted semi-template retrosynthesis workflow, \emph{i.e.} from reaction center identification to synthon completion. Based on these designs, we introduce Retrosynthesis Diffusion (RetroDiff), a novel diffusion-based method for the retrosynthesis task. Experimental results demonstrate that RetroDiff surpasses all semi-template methods in accuracy, and outperforms template-based and template-free methods in large-scale scenarios and molecular validity, respectively. Code: https://github.com/Alsace08/RetroDiff.

[642] ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift

Songming Zhang, Yuxiao Luo, Ziyu Lyu, Xiaofeng Chen

Main category: cs.LG

TL;DR: The paper introduces ShiftKD, a framework to evaluate Knowledge Distillation (KD) methods under distribution shifts, covering diversity and correlation shifts. It benchmarks over 30 methods across five datasets and analyzes key training factors.

DetailsMotivation: To address the underexplored reliability of KD methods in real-world applications, particularly under distribution shifts, which can degrade performance.

Method: Proposes ShiftKD, a systematic framework to benchmark KD methods against diversity and correlation shifts, evaluating over 30 methods from algorithmic, data-driven, and optimization perspectives.

Result: Extensive experiments reveal strengths and limitations of current KD methods, with insights into data augmentation, pruning, optimizers, and metrics.

Conclusion: ShiftKD serves as a benchmark for robust KD evaluation, driving future development to meet real-world demands.

Abstract: Knowledge Distillation (KD) transfers knowledge from large models to small models and has recently achieved remarkable success. However, the reliability of existing KD methods in real-world applications, especially under distribution shift, remains underexplored. Distribution shift refers to the data distribution drifts between the training and testing phases, and this can adversely affect the efficacy of KD. In this paper, we propose a unified and systematic framework \textsc{ShiftKD} to benchmark KD against two general distributional shifts: diversity and correlation shift. The evaluation benchmark covers more than 30 methods from algorithmic, data-driven, and optimization perspectives for five benchmark datasets. Our development of \textsc{ShiftKD} conducts extensive experiments and reveals strengths and limitations of current SOTA KD methods. More importantly, we thoroughly analyze key factors in student model training process, including data augmentation, pruning methods, optimizers, and evaluation metrics. We believe \textsc{ShiftKD} could serve as an effective benchmark for assessing KD in real-world scenarios, thus driving the development of more robust KD methods in response to evolving demands. The code will be made available upon publication.

[643] Score-based Causal Representation Learning: Linear and General Transformations

Burak Varıcı, Emre Acartürk, Karthikeyan Shanmugam, Abhishek Kumar, Ali Tajer

Main category: cs.LG

TL;DR: The paper explores causal representation learning (CRL) under nonparametric latent models, focusing on identifiability and achievability. It introduces score-based algorithms for both linear and general transformations, proving identifiability with interventions and validating results empirically.

DetailsMotivation: To address the challenge of recovering latent causal variables and graphs under unknown transformations, ensuring both theoretical guarantees (identifiability) and practical algorithms (achievability).

Method: Uses score functions (gradients of log-density) to design algorithms. Proves identifiability with interventions: one hard intervention per node for linear transformations, two for general transformations.

Result: Shows identifiability guarantees for linear and nonlinear models, with empirical validation on synthetic and image data.

Conclusion: The proposed score-based approach ensures identifiability and achievability in CRL, with theoretical and empirical support.

Abstract: This paper addresses intervention-based causal representation learning (CRL) under a general nonparametric latent causal model and an unknown transformation that maps the latent variables to the observed variables. Linear and general transformations are investigated. The paper addresses both the identifiability and achievability aspects. Identifiability refers to determining algorithm-agnostic conditions that ensure the recovery of the true latent causal variables and the underlying latent causal graph. Achievability refers to the algorithmic aspects and addresses designing algorithms that achieve identifiability guarantees. By drawing novel connections between score functions (i.e., the gradients of the logarithm of density functions) and CRL, this paper designs a score-based class of algorithms that ensures both identifiability and achievability. First, the paper focuses on linear transformations and shows that one stochastic hard intervention per node suffices to guarantee identifiability. It also provides partial identifiability guarantees for soft interventions, including identifiability up to mixing with parents for general causal models and perfect recovery of the latent graph for sufficiently nonlinear causal models. Secondly, it focuses on general transformations and demonstrates that two stochastic hard interventions per node are sufficient for identifiability. This is achieved by defining a differentiable loss function whose global optima ensure identifiability for general CRL. Notably, one does not need to know which pair of interventional environments has the same node intervened. Finally, the theoretical results are empirically validated via experiments on structured synthetic data and image data.

[644] Comparing skill of historical rainfall data based monsoon rainfall prediction in India with NWP forecasts

Apoorva Narula, Aastha Jain, Jatin Batra, MN Rajeevan, Sandeep Juneja

Main category: cs.LG

TL;DR: Transformer-based deep learning models outperform traditional numerical weather predictors in forecasting Indian summer monsoon rainfall.

DetailsMotivation: Accurate short-term forecasting of the Indian summer monsoon is crucial for over a billion people but remains challenging due to its complexity and sensitivity to multi-scale drivers.

Method: Autoformers (transformer-based deep learning models) trained on historical precipitation and auxiliary meteorological data, benchmarked against ECMWF and NCEP models.

Result: Transformer models reduce forecast errors by 22-43% for one-day and 27-66% for three-day predictions compared to HRES and NCEP.

Conclusion: Deep learning transformer architectures offer superior accuracy for monsoon rainfall forecasting, outperforming traditional numerical methods.

Abstract: The Indian summer monsoon is a highly complex and critical weather system that directly affects the livelihoods of over a billion people across the Indian subcontinent. Accurate short-term forecasting remains a major scientific challenge due to the monsoon’s intrinsic nonlinearity and its sensitivity to multi-scale drivers, including local land-atmosphere interactions and large-scale ocean-atmosphere phenomena. In this study, we address the problem of forecasting daily rainfall across India during the summer months, focusing on both one-day and three-day lead times. We use Autoformers - deep learning transformer-based architectures designed for time series forecasting. These are trained on historical gridded precipitation data from the Indian Meteorological Department (1901–2023) at spatial resolutions of $0.25^\circ \times 0.25^\circ$, as well as $1^\circ \times 1^\circ$. The models also incorporate auxiliary meteorological variables from ECMWFs reanalysis datasets, namely, cloud cover, humidity, temperature, soil moisture, vorticity, and wind speed. Forecasts at $0.25^\circ \times 0.25^\circ$ are benchmarked against ECMWFs High-Resolution Ensemble System (HRES), widely regarded as the most accurate numerical weather predictor, and at $1^\circ \times 1^\circ $ with those from National Centre for Environmental Prediction (NCEP). We conduct both nationwide evaluations and localized analyses for major Indian cities. Our results indicate that transformer-based deep learning models consistently outperform both HRES and NCEP, as well as other climatological baselines. Specifically, compared to our model, forecasts from HRES and NCEP model have about 22% and 43% higher error, respectively, for a single day prediction, and over 27% and 66% higher error respectively, for a three day prediction.

[645] Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport

Raphaël Barboni, Gabriel Peyré, François-Xavier Vialard

Main category: cs.LG

TL;DR: The paper analyzes gradient flow convergence in training deep neural networks, focusing on mean-field models of infinitely deep and wide ResNets, using conditional Optimal Transport distance for optimization.

DetailsMotivation: To understand why simple optimization algorithms like gradient descent succeed in training deep neural networks despite non-convexity and non-coercivity challenges.

Method: Proposes training with gradient flow w.r.t. conditional Optimal Transport distance, ensuring well-posedness and consistency with finite-width ResNet training.

Result: Shows convergence of gradient flow to a global minimizer under specific conditions (sufficiently large features and small initial risk).

Conclusion: First theoretical guarantee for convergence in infinitely deep and arbitrarily wide ResNets, bridging mean-field theory and practical training.

Abstract: We study the convergence of gradient flow for the training of deep neural networks. If Residual Neural Networks are a popular example of very deep architectures, their training constitutes a challenging optimization problem due notably to the non-convexity and the non-coercivity of the objective. Yet, in applications, those tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a ``mean-field’’ model of infinitely deep and arbitrarily wide ResNet, parameterized by probability measures over the product set of layers and parameters and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have proven to benefit from simplified loss-landscapes and good theoretical guarantees when trained with gradient flow for the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional Optimal Transport distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces we first show the well-posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak-\L{}ojasiewicz analysis, we then show convergence of the gradient flow for well-chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges towards a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets.

[646] A Structure-Guided Gauss-Newton Method for Shallow ReLU Neural Network

Zhiqiang Cai, Tong Ding, Min Liu, Xinyu Liu, Jianlin Xia

Main category: cs.LG

TL;DR: The paper introduces a structure-guided Gauss-Newton (SgGN) method for solving least squares problems using shallow ReLU neural networks, leveraging the structure of the problem for efficient updates of nonlinear and linear parameters.

DetailsMotivation: The motivation is to address challenges in training neural networks for function approximation, especially with discontinuities or sharp transitions, by combining least squares and neural network structures.

Method: The method categorizes weights and biases as nonlinear and linear parameters, updating them iteratively using damped Gauss-Newton for nonlinear parameters and a linear solver for linear ones. A specialized Gauss-Newton matrix is derived for efficiency.

Result: The SgGN method ensures symmetric and positive definite matrices, eliminating the need for techniques like shifting. Numerical results show convergence and accuracy for challenging function approximation problems.

Conclusion: The SgGN method is effective for training shallow ReLU networks, particularly for problems with discontinuities or sharp transitions, outperforming common training algorithms.

Abstract: In this paper, we propose a structure-guided Gauss-Newton (SgGN) method for solving least squares problems using a shallow ReLU neural network. The method effectively takes advantage of both the least squares structure and the neural network structure of the objective function. By categorizing the weights and biases of the hidden and output layers of the network as nonlinear and linear parameters, respectively, the method iterates back and forth between the nonlinear and linear parameters. The nonlinear parameters are updated by a damped Gauss-Newton method and the linear ones are updated by a linear solver. Moreover, at the Gauss-Newton step, a special form of the Gauss-Newton matrix is derived for the shallow ReLU neural network and is used for efficient iterations. It is shown that the corresponding mass and Gauss-Newton matrices in the respective linear and nonlinear steps are symmetric and positive definite under reasonable assumptions. Thus, the SgGN method naturally produces an effective search direction without the need of additional techniques like shifting in the Levenberg-Marquardt method to achieve invertibility of the Gauss-Newton matrix. The convergence and accuracy of the method are demonstrated numerically for several challenging function approximation problems, especially those with discontinuities or sharp transition layers that pose significant challenges for commonly used training algorithms in machine learning.

[647] Generalized Linear Bandits with Limited Adaptivity

Ayush Sawarni, Nirjhar Das, Siddharth Barman, Gaurav Sinha

Main category: cs.LG

TL;DR: The paper introduces two algorithms, B-GLinCB and RS-GLinCB, for generalized linear contextual bandits under limited adaptivity, achieving O~(√T) regret while removing dependence on a key parameter κ.

DetailsMotivation: To address the generalized linear contextual bandit problem under limited adaptivity constraints, where policy updates are restricted.

Method: Two algorithms are proposed: B-GLinCB for stochastic arm features with fixed update rounds, and RS-GLinCB for adversarial arm features with adaptive updates.

Result: B-GLinCB achieves O~(√T) regret with M=Ω(log log T) updates, while RS-GLinCB achieves the same regret with O~(log² T) updates, eliminating dependence on κ.

Conclusion: The paper successfully removes dependence on κ, offering efficient algorithms for limited adaptivity settings in contextual bandits.

Abstract: We study the generalized linear contextual bandit problem within the constraints of limited adaptivity. In this paper, we present two algorithms, $\texttt{B-GLinCB}$ and $\texttt{RS-GLinCB}$, that address, respectively, two prevalent limited adaptivity settings. Given a budget $M$ on the number of policy updates, in the first setting, the algorithm needs to decide upfront $M$ rounds at which it will update its policy, while in the second setting it can adaptively perform $M$ policy updates during its course. For the first setting, we design an algorithm $\texttt{B-GLinCB}$, that incurs $\tilde{O}(\sqrt{T})$ regret when $M = \Omega( \log{\log T} )$ and the arm feature vectors are generated stochastically. For the second setting, we design an algorithm $\texttt{RS-GLinCB}$ that updates its policy $\tilde{O}(\log^2 T)$ times and achieves a regret of $\tilde{O}(\sqrt{T})$ even when the arm feature vectors are adversarially generated. Notably, in these bounds, we manage to eliminate the dependence on a key instance dependent parameter $\kappa$, that captures non-linearity of the underlying reward model. Our novel approach for removing this dependence for generalized linear contextual bandits might be of independent interest.

[648] BARNN: A Bayesian Autoregressive and Recurrent Neural Network

Dario Coscia, Max Welling, Nicola Demo, Gianluigi Rozza

Main category: cs.LG

TL;DR: BARNN introduces a Bayesian framework for autoregressive and recurrent networks, improving uncertainty quantification while maintaining accuracy.

DetailsMotivation: Existing autoregressive and recurrent models lack a rigorous framework for uncertainty, which is critical in scientific applications like PDE solving and molecular generation.

Method: BARNN uses variational dropout and introduces a temporal Variational Mixtures of Posteriors prior (tVAMP-prior) for efficient Bayesian inference.

Result: BARNN achieves comparable or superior accuracy to existing methods and excels in uncertainty quantification and long-range dependency modeling.

Conclusion: BARNN provides a principled way to Bayesianize autoregressive and recurrent models, enhancing their utility in scientific applications.

Abstract: Autoregressive and recurrent networks have achieved remarkable progress across various fields, from weather forecasting to molecular generation and Large Language Models. Despite their strong predictive capabilities, these models lack a rigorous framework for addressing uncertainty, which is key in scientific applications such as PDE solving, molecular generation and Machine Learning Force Fields. To address this shortcoming we present BARNN: a variational Bayesian Autoregressive and Recurrent Neural Network. BARNNs aim to provide a principled way to turn any autoregressive or recurrent model into its Bayesian version. BARNN is based on the variational dropout method, allowing to apply it to large recurrent neural networks as well. We also introduce a temporal version of the “Variational Mixtures of Posteriors” prior (tVAMP-prior) to make Bayesian inference efficient and well-calibrated. Extensive experiments on PDE modelling and molecular generation demonstrate that BARNN not only achieves comparable or superior accuracy compared to existing methods, but also excels in uncertainty quantification and modelling long-range dependencies.

[649] Closed-form Solutions: A New Perspective on Solving Differential Equations

Shu Wei, Yanjie Li, Lina Yu, Weijun Li, Min Wu, Linjun Sun, Jingyi Liu, Hong Qin, Yusong Deng, Jufeng Han, Yan Pang

Main category: cs.LG

TL;DR: SSDE, a reinforcement learning-based method, outperforms existing machine learning approaches in deriving symbolic solutions for differential equations with better accuracy and efficiency.

DetailsMotivation: Traditional methods require extensive mathematical expertise, and existing machine learning approaches like genetic algorithms are computationally intensive and yield complex solutions.

Method: SSDE uses reinforcement learning to derive symbolic closed-form solutions for differential equations.

Result: SSDE achieves superior accuracy and efficiency compared to existing methods across various ordinary and partial differential equations.

Conclusion: SSDE is a promising tool for solving differential equations symbolically, addressing limitations of traditional and machine learning methods.

Abstract: The quest for analytical solutions to differential equations has traditionally been constrained by the need for extensive mathematical expertise. Machine learning methods like genetic algorithms have shown promise in this domain, but are hindered by significant computational time and the complexity of their derived solutions. This paper introduces SSDE (Symbolic Solver for Differential Equations), a novel reinforcement learning-based approach that derives symbolic closed-form solutions for various differential equations. Evaluations across a diverse set of ordinary and partial differential equations demonstrate that SSDE outperforms existing machine learning methods, delivering superior accuracy and efficiency in obtaining analytical solutions.

[650] Deep Learning for Computing Convergence Rates of Markov Chains

Yanlin Qu, Jose Blanchet, Peter Glynn

Main category: cs.LG

TL;DR: The paper introduces DCDC, a sample-based algorithm for bounding Markov chain convergence in Wasserstein distance, combining a novel CDE framework with neural-network-based solving.

DetailsMotivation: Traditional methods fail to provide practical convergence bounds for realistic Markov chains, necessitating a general-purpose solution.

Method: DCDC uses the Contractive Drift Equation (CDE) and a neural-network solver to compute convergence bounds.

Result: DCDC effectively generates convergence bounds for Markov chains in stochastic processing networks and stochastic optimization.

Conclusion: DCDC offers a practical and efficient solution for convergence rate analysis in general state-space Markov chains.

Abstract: Convergence rate analysis for general state-space Markov chains is fundamentally important in areas such as Markov chain Monte Carlo and algorithmic analysis (for computing explicit convergence bounds). This problem, however, is notoriously difficult because traditional analytical methods often do not generate practically useful convergence bounds for realistic Markov chains. We propose the Deep Contractive Drift Calculator (DCDC), the first general-purpose sample-based algorithm for bounding the convergence of Markov chains to stationarity in Wasserstein distance. The DCDC has two components. First, inspired by the new convergence analysis framework in Qu, Blanchet and Glynn (2023), we introduce the Contractive Drift Equation (CDE), the solution of which leads to an explicit convergence bound. Second, we develop an efficient neural-network-based CDE solver. Equipped with these two components, DCDC solves the CDE and converts the solution into a convergence bound. We analyze the sample complexity of the algorithm and further demonstrate the effectiveness of the DCDC by generating convergence bounds for realistic Markov chains arising from stochastic processing networks as well as constant step-size stochastic optimization.

[651] Brain Foundation Models: A Survey on Advancements in Neural Signal Processing and Brain Discovery

Xinliang Zhou, Chenyu Liu, Zhisheng Chen, Kun Wang, Yi Ding, Ziyu Jia, Qingsong Wen

Main category: cs.LG

TL;DR: Brain foundation models (BFMs) are a transformative paradigm in computational neuroscience, leveraging large-scale pre-training to generalize across tasks and modalities, overcoming traditional AI limitations in brain data analysis.

DetailsMotivation: To address the limitations of conventional AI in processing complex brain data and provide a unified framework for neural signal analysis.

Method: Utilizes large-scale pre-training techniques to generalize across diverse tasks and modalities, with a focus on methodological innovations and application areas.

Result: BFMs enable advanced neural data analysis, offering a unified approach and highlighting recent advancements, challenges, and future directions.

Conclusion: BFMs hold great potential but face challenges like data quality, model optimization, training efficiency, and interpretability, which need addressing for broader real-world applications.

Abstract: Brain foundation models (BFMs) have emerged as a transformative paradigm in computational neuroscience, offering a revolutionary framework for processing diverse neural signals across different brain-related tasks. These models leverage large-scale pre-training techniques, allowing them to generalize effectively across multiple scenarios, tasks, and modalities, thus overcoming the traditional limitations faced by conventional artificial intelligence (AI) approaches in understanding complex brain data. By tapping into the power of pretrained models, BFMs provide a means to process neural data in a more unified manner, enabling advanced analysis and discovery in the field of neuroscience. In this survey, we define BFMs for the first time, providing a clear and concise framework for constructing and utilizing these models in various applications. We also examine the key principles and methodologies for developing these models, shedding light on how they transform the landscape of neural signal processing. This survey presents a comprehensive review of the latest advancements in BFMs, covering the most recent methodological innovations, novel views of application areas, and challenges in the field. Notably, we highlight the future directions and key challenges that need to be addressed to fully realize the potential of BFMs. These challenges include improving the quality of brain data, optimizing model architecture for better generalization, increasing training efficiency, and enhancing the interpretability and robustness of BFMs in real-world applications.

[652] Knockout: A simple way to handle missing inputs

Minh Nguyen, Batuhan K. Karaman, Heejong Kim, Alan Q. Wang, Fengbei Liu, Mert R. Sabuncu

Main category: cs.LG

TL;DR: Knockout is an efficient method for handling missing multimodal inputs in deep learning by randomly replacing features during training, offering strong performance without the drawbacks of marginalization, imputation, or multiple models.

DetailsMotivation: Multimodal models face challenges with missing inputs at inference, and current solutions (marginalization, imputation, multiple models) are either computationally expensive, inaccurate, or costly.

Method: Knockout randomly replaces input features with placeholder values during training, learning both conditional and marginal distributions implicitly.

Result: Knockout shows strong empirical performance across simulations and real-world datasets, outperforming traditional methods.

Conclusion: Knockout provides a practical and efficient solution for handling missing inputs in multimodal deep learning models.

Abstract: Deep learning models benefit from rich (e.g., multi-modal) input features. However, multimodal models might be challenging to deploy, because some inputs may be missing at inference. Current popular solutions include marginalization, imputation, and training multiple models. Marginalization achieves calibrated predictions, but it is computationally expensive and only feasible for low dimensional inputs. Imputation may result in inaccurate predictions, particularly when high-dimensional data, such as images, are missing. Training multiple models, where each model is designed to handle different subsets of inputs, can work well but requires prior knowledge of missing input patterns. Furthermore, training and retaining multiple models can be costly. We propose an efficient method to learn both the conditional distribution using full inputs and the marginal distributions. Our method, Knockout, randomly replaces input features with appropriate placeholder values during training. We provide a theoretical justification for Knockout and show that it can be interpreted as an implicit marginalization strategy. We evaluate Knockout across a wide range of simulations and real-world datasets and show that it offers strong empirical performance.

[653] Attend or Perish: Benchmarking Attention in Algorithmic Reasoning

Michal Spiegel, Michal Štefánik, Marek Kadlčík, Josef Kuchař

Main category: cs.LG

TL;DR: The paper introduces AttentionSpan, a benchmark to evaluate transformers’ algorithmic reasoning by testing extrapolation and robustness across infinite input domains.

DetailsMotivation: To assess whether transformers genuinely understand algorithms or merely memorize patterns, especially in unseen input/output domains.

Method: Proposes AttentionSpan, a benchmark with five tasks of infinite input domains, analyzing attention maps and performing interventions.

Result: Attention mechanisms directly cause extrapolation failures, revealing limitations in robust algorithmic reasoning.

Conclusion: AttentionSpan provides a tool to evaluate and improve transformers’ algorithmic reliability, highlighting the need for better mechanisms.

Abstract: Can transformers learn to perform algorithmic tasks reliably across previously unseen input/output domains? While pre-trained language models show solid accuracy on benchmarks incorporating algorithmic reasoning, assessing the reliability of these results necessitates an ability to distinguish genuine algorithmic understanding from memorization. In this paper, we propose AttentionSpan, an algorithmic benchmark comprising five tasks of infinite input domains where we can disentangle and trace the correct, robust algorithm necessary for the task. This allows us to assess (i) models’ ability to extrapolate to unseen types of inputs, including new lengths, value ranges or input domains, but also (ii)to assess the robustness of their learned mechanisms. By analyzing attention maps and performing targeted interventions, we show that attention mechanism directly causes failures in extrapolation. We make the implementation of all our tasks and interpretability methods publicly available at https://github.com/michalspiegel/AttentionSpan .

[654] ModelVerification.jl: a Comprehensive Toolbox for Formally Verifying Deep Neural Networks

Tianhao Wei, Hanjiang Hu, Luca Marzari, Kai S. Yun, Peizhi Niu, Xusheng Luo, Changliu Liu

Main category: cs.LG

TL;DR: A toolbox called ModelVerification.jl (MV) is introduced for verifying Deep Neural Networks (DNNs) with state-of-the-art methods, addressing the challenge of verifying input-output properties.

DetailsMotivation: The lack of a unified framework for verifying DNNs across various applications motivates the creation of MV.

Method: MV is a comprehensive toolbox offering a suite of advanced verification methods for different DNN types and safety specifications.

Result: MV provides robust tools for developers to verify and ensure the trustworthiness of DNN models.

Conclusion: MV is a versatile and cutting-edge solution for DNN verification, empowering practitioners with reliable verification tools.

Abstract: Deep Neural Networks (DNN) are crucial in approximating nonlinear functions across diverse applications, ranging from image classification to control. Verifying specific input-output properties can be a highly challenging task due to the lack of a single, self-contained framework that allows a complete range of verification types. To this end, we present \texttt{ModelVerification.jl (MV)}, the first comprehensive, cutting-edge toolbox that contains a suite of state-of-the-art methods for verifying different types of DNNs and safety specifications. This versatile toolbox is designed to empower developers and machine learning practitioners with robust tools for verifying and ensuring the trustworthiness of their DNN models.

[655] State-observation augmented diffusion model for nonlinear assimilation with unknown dynamics

Zhuoyuan Li, Bin Dong, Pingwen Zhang

Main category: cs.LG

TL;DR: SOAD model proposed for data assimilation, outperforms classical and score-based methods.

DetailsMotivation: High nonlinearity in physical and observational models challenges classical assimilation algorithms.

Method: Introduces the State-Observation Augmented Diffusion (SOAD) model for data-driven assimilation.

Result: SOAD matches true posterior distribution under mild assumptions and shows improved performance.

Conclusion: SOAD offers theoretical and practical advantages over existing methods.

Abstract: Data assimilation has become a key technique for combining physical models with observational data to estimate state variables. However, classical assimilation algorithms often struggle with the high nonlinearity present in both physical and observational models. To address this challenge, a novel generative model, termed the State-Observation Augmented Diffusion (SOAD) model is proposed for data-driven assimilation. The marginal posterior associated with SOAD has been derived and then proved to match the true posterior distribution under mild assumptions, suggesting its theoretical advantages over previous score-based approaches. Experimental results also indicate that SOAD may offer improved performance compared to existing data-driven methods.

[656] Variational Mode-Driven Graph Convolutional Network for Spatiotemporal Traffic Forecasting

Osama Ahmad, Lukas Wesemann, Fabian Waschkowski, Zubair Khalid

Main category: cs.LG

TL;DR: The paper introduces VMGCN, a hybrid framework combining VMD and GNNs for spatiotemporal traffic prediction, improving accuracy and interpretability.

DetailsMotivation: Spatiotemporal traffic data is complex and non-stationary, making prediction challenging. Existing methods lack interpretability and struggle with raw or aggregated data.

Method: Proposes VMGCN: decomposes data into interpretable modes via VMD, then uses an attention-augmented GCN to learn dependencies. Optimizes mode count via reconstruction-loss minimization.

Result: VMGCN outperforms existing methods on the LargeST dataset, offering frequency-level interpretability and accuracy gains for short- and long-term predictions.

Conclusion: The two-stage VMGCN framework enhances predictive performance and interpretability, with publicly available implementation.

Abstract: This paper focuses on spatiotemporal (ST) traffic prediction using graph neural networks (GNNs). Given that ST data comprises non-stationary and complex temporal patterns, interpreting and predicting such trends is inherently challenging. Representing ST data in decomposed modes helps infer underlying behavior and assess the impact of noise on predictive performance. We propose a framework that decomposes ST data into interpretable modes using variational mode decomposition (VMD) and processes them through a neural network for future state forecasting. Unlike existing graph-based traffic forecasters that operate directly on raw or aggregated time series, the proposed hybrid approach, termed the Variational Mode Graph Convolutional Network (VMGCN), first decomposes non-stationary signals into interpretable variational modes by determining the optimal mode count via reconstruction-loss minimization and then learns both intramode and cross-mode spatiotemporal dependencies through a novel attention-augmented GCN. Additionally, we analyze the significance of each mode and the effect of bandwidth constraints on multi-horizon traffic flow predictions. The proposed two-stage design yields significant accuracy gains while providing frequency-level interpretability with demonstrated superior performance on the LargeST dataset for both short-term and long-term forecasting tasks. The implementation is publicly available on https://github.com/OsamaAhmad369/VMGCN.

[657] PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

Kwanyoung Kim, Byeongsu Sim

Main category: cs.LG

TL;DR: PLADIS enhances pre-trained diffusion models using sparse attention, improving text alignment and human preference without extra training or NFEs.

DetailsMotivation: Existing guidance techniques for diffusion models require additional training or NFEs and rely on heuristics, limiting compatibility with guidance-distilled models.

Method: PLADIS leverages sparse attention in cross-attention layers during inference, extrapolating query-key correlations without extra training or NFEs.

Result: PLADIS significantly improves text alignment and human preference, working seamlessly with guidance techniques.

Conclusion: PLADIS offers an efficient, universally applicable solution to enhance diffusion models, unlocking their latent potential.

Abstract: Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution. See Our project page : https://cubeyoung.github.io/pladis-proejct/

[658] Brain-Inspired Online Adaptation for Remote Sensing with Spiking Neural Network

Dexin Duan, Peilin liu, Bingwei Hui, Fei Wen

Main category: cs.LG

TL;DR: Proposes an online adaptation framework using spiking neural networks (SNNs) for energy-efficient and adaptable remote sensing on edge devices.

DetailsMotivation: Addresses the need for high energy efficiency and online adaptation in remote sensing models for edge devices like satellites and UAVs.

Method: Uses pretrained SNNs with an unsupervised online adaptation algorithm, adaptive activation scaling, and confidence-based instance weighting for detection tasks.

Result: Outperforms existing domain adaptation and generalization methods across seven benchmark datasets in classification, segmentation, and detection tasks.

Conclusion: The framework enables efficient, fast adaptation on edge devices, promising for remote sensing applications.

Abstract: On-device computing, or edge computing, is becoming increasingly important for remote sensing, particularly in applications like deep network-based perception on on-orbit satellites and unmanned aerial vehicles (UAVs). In these scenarios, two brain-like capabilities are crucial for remote sensing models: (1) high energy efficiency, allowing the model to operate on edge devices with limited computing resources, and (2) online adaptation, enabling the model to quickly adapt to environmental variations, weather changes, and sensor drift. This work addresses these needs by proposing an online adaptation framework based on spiking neural networks (SNNs) for remote sensing. Starting with a pretrained SNN model, we design an efficient, unsupervised online adaptation algorithm, which adopts an approximation of the BPTT algorithm and only involves forward-in-time computation that significantly reduces the computational complexity of SNN adaptation learning. Besides, we propose an adaptive activation scaling scheme to boost online SNN adaptation performance, particularly in low time-steps. Furthermore, for the more challenging remote sensing detection task, we propose a confidence-based instance weighting scheme, which substantially improves adaptation performance in the detection task. To our knowledge, this work is the first to address the online adaptation of SNNs. Extensive experiments on seven benchmark datasets across classification, segmentation, and detection tasks demonstrate that our proposed method significantly outperforms existing domain adaptation and domain generalization approaches under varying weather conditions. The proposed method enables energy-efficient and fast online adaptation on edge devices, and has much potential in applications such as remote perception on on-orbit satellites and UAV.

[659] Fine-Tuning Diffusion Generative Models via Rich Preference Optimization

Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang

Main category: cs.LG

TL;DR: Rich Preference Optimization (RPO) improves text-to-image diffusion models by using detailed critiques to create refined preference pairs, outperforming traditional methods like Diffusion-DPO.

DetailsMotivation: Traditional methods rely on opaque reward models, leading to issues like reward hacking. RPO aims to provide clearer, actionable feedback for better model tuning.

Method: Generates detailed critiques of synthesized images, extracts actionable editing instructions, and creates refined preference pairs for fine-tuning.

Result: Demonstrates effectiveness in fine-tuning state-of-the-art diffusion models with enhanced datasets.

Conclusion: RPO offers a more reliable and informative approach to preference pair curation for diffusion models.

Abstract: We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images, from which we extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models. Our code is available at https://github.com/Diffusion-RLHF/RPO.

[660] In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting

Constantin Philippenko, Kevin Scaman, Laurent Massoulié

Main category: cs.LG

TL;DR: A distributed algorithm for low-rank matrix factorization is proposed, improving convergence rates and reducing communication overhead.

DetailsMotivation: To efficiently solve low-rank matrix factorization across distributed clients with minimal communication.

Method: Uses power initialization and parallel Nesterov gradient descent, requiring only one communication step.

Result: Achieves a linear convergence rate dependent on singular values, outperforming existing methods.

Conclusion: The method is effective for distributed settings, validated by synthetic and real data experiments.

Abstract: We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients, each holding a local dataset $\mathbf{S}^i \in \mathbb{R}^{n_i \times d}$, mathematically, we seek to solve $min_{\mathbf{U}^i \in \mathbb{R}^{n_i\times r}, \mathbf{V}\in \mathbb{R}^{d \times r} } \frac{1}{2} \sum_{i=1}^N |\mathbf{S}^i - \mathbf{U}^i \mathbf{V}^\top|^2_{\text{F}}$. Considering a power initialization of $\mathbf{V}$, we rewrite the previous smooth non-convex problem into a smooth strongly-convex problem that we solve using a parallel Nesterov gradient descent potentially requiring a single step of communication at the initialization step. For any client $i$ in ${1, \dots, N}$, we obtain a global $\mathbf{V}$ in $\mathbb{R}^{d \times r}$ common to all clients and a local variable $\mathbf{U}^i$ in $\mathbb{R}^{n_i \times r}$. We provide a linear rate of convergence of the excess loss which depends on $\sigma_{\max} / \sigma_{r}$, where $\sigma_{r}$ is the $r^{\mathrm{th}}$ singular value of the concatenation $\mathbf{S}$ of the matrices $(\mathbf{S}^i){i=1}^N$. This result improves the rates of convergence given in the literature, which depend on $\sigma{\max}^2 / \sigma_{\min}^2$. We provide an upper bound on the Frobenius-norm error of reconstruction under the power initialization strategy. We complete our analysis with experiments on both synthetic and real data.

[661] Federated Continual Instruction Tuning

Haiyang Guo, Fanhu Zeng, Fei Zhu, Wenzhuo Liu, Da-Han Wang, Jian Xu, Xu-Yao Zhang, Cheng-Lin Liu

Main category: cs.LG

TL;DR: The paper introduces Federated Continual Instruction Tuning (FCIT) to address the challenges of continuous learning in federated settings, proposing dynamic knowledge organization and subspace selective activation to improve performance.

DetailsMotivation: High computational costs and data demands for instruction tuning in Large Multimodal Models (LMMs) make federated learning (FL) appealing, but existing FL methods assume fixed tasks, unlike real-world scenarios where clients encounter new tasks continuously.

Method: Proposes FCIT benchmark with realistic scenarios, dynamic knowledge organization for integrating task updates, and subspace selective activation for task-specific outputs.

Result: The method significantly improves model performance under data heterogeneity and reduces catastrophic forgetting.

Conclusion: FCIT and the proposed techniques effectively address continuous learning challenges in federated settings, with code and datasets made publicly available.

Abstract: A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Code and dataset are released at https://github.com/Ghy0501/FCIT.

[662] Trustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey

Yi Zhang, Zhen Chen, Chih-Hong Cheng, Wenjie Ruan, Xiaowei Huang, Dezong Zhao, David Flynn, Siddartha Khastgir, Xingyu Zhao

Main category: cs.LG

TL;DR: The paper surveys trustworthiness in Text-to-Image Diffusion Models, addressing gaps in robustness, fairness, security, privacy, factuality, and explainability, and proposes future research directions.

DetailsMotivation: The growing popularity of T2I DMs raises ethical and social concerns about trustworthiness, necessitating a focused review to address gaps and propose solutions.

Method: The survey provides a taxonomy of trustworthiness in T2I DMs, covering properties, means, benchmarks, and applications, and analyzes recent literature.

Result: The review identifies gaps in current research, limitations of existing methods, and suggests future directions for trustworthy T2I DMs.

Conclusion: The paper highlights the need for further research to enhance trustworthiness in T2I DMs and maintains an updated GitHub repository for ongoing developments.

Abstract: Text-to-Image (T2I) Diffusion Models (DMs) have garnered widespread attention for their impressive advancements in image generation. However, their growing popularity has raised ethical and social concerns related to key non-functional properties of trustworthiness, such as robustness, fairness, security, privacy, factuality, and explainability, similar to those in traditional deep learning (DL) tasks. Conventional approaches for studying trustworthiness in DL tasks often fall short due to the unique characteristics of T2I DMs, e.g., the multi-modal nature. Given the challenge, recent efforts have been made to develop new methods for investigating trustworthiness in T2I DMs via various means, including falsification, enhancement, verification & validation and assessment. However, there is a notable lack of in-depth analysis concerning those non-functional properties and means. In this survey, we provide a timely and focused review of the literature on trustworthy T2I DMs, covering a concise-structured taxonomy from the perspectives of property, means, benchmarks and applications. Our review begins with an introduction to essential preliminaries of T2I DMs, and then we summarise key definitions/metrics specific to T2I tasks and analyses the means proposed in recent literature based on these definitions/metrics. Additionally, we review benchmarks and domain applications of T2I DMs. Finally, we highlight the gaps in current research, discuss the limitations of existing methods, and propose future research directions to advance the development of trustworthy T2I DMs. Furthermore, we keep up-to-date updates in this field to track the latest developments and maintain our GitHub repository at: https://github.com/wellzline/Trustworthy_T2I_DMs

[663] Sampling Decisions

Michael Chertkov, Sungsoo Ahn, Hamidreza Behjoo

Main category: cs.LG

TL;DR: The paper introduces a Decision Flow (DF) framework for sampling decisions from a target distribution using prior guidance, extending existing methods like MDP and GFN, and demonstrates its efficiency with the Ising model.

DetailsMotivation: To develop a sampling framework that combines prior guidance with target distributions, improving efficiency and generalizability over existing methods.

Method: DF leverages linear solvability of MDPs to adjust transition probabilities of a prior sampler, forming a Markov process as a convolution of reverse-time Green’s function and the target distribution.

Result: DF is illustrated with the Ising model, showing efficiency gains over Metropolis-Hastings, and potential for NN-based extensions.

Conclusion: DF enhances guided sampling, offering a flexible and efficient approach for various applications.

Abstract: In this manuscript, we introduce a novel Decision Flow (DF) framework for sampling decisions from a target distribution while incorporating additional guidance from a prior sampler. DF can be viewed as an AI-driven algorithmic reincarnation of the Markov Decision Process (MDP) approach in stochastic optimal control. It extends the continuous-space, continuous-time Path Integral Diffusion sampling technique of [Behjoo, Chertkov 2025] to discrete time and space, while also generalizing the Generative Flow Network (GFN) framework of [Bengio, et al 2021]. In its most basic form an explicit formulation that does not require Neural Networks (NNs), DF leverages the linear solvability of the underlying MDP [Todorov, 2007] to adjust the transition probabilities of the prior sampler. The resulting Markov process is expressed as a convolution of the reverse-time Green’s function of the prior sampling with the target distribution. We illustrate the DF framework through an example of sampling from the Ising model – compare DF to Metropolis-Hastings to quantify its efficiency, discuss potential NN-based extensions, and outline how DF can enhance guided sampling across various applications.

[664] TensorSocket: Shared Data Loading for Deep Learning Training

Ties Robroek, Neil Kim Nielsen, Pınar Tözün

Main category: cs.LG

TL;DR: TensorSocket reduces computational needs in deep learning training by enabling shared data loaders, improving efficiency and reducing costs.

DetailsMotivation: The repetitive and resource-intensive nature of deep learning training, especially in hyper-parameter tuning and architecture search, creates inefficiencies and high costs due to redundant data processing.

Method: TensorSocket allows simultaneous training processes to share the same data loader, reducing redundant computations and leveraging GPU-GPU interconnects. It supports differently-sized models and batch sizes while being hardware-agnostic.

Result: TensorSocket increases training throughput by up to 100%, reduces CPU resource needs by 50%, and outperforms state-of-the-art solutions like CoorDL and Joader.

Conclusion: TensorSocket effectively addresses CPU bottlenecks in deep learning training, offering significant efficiency gains, cost savings, and ease of deployment.

Abstract: Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on a set of parameters (e.g., hyper-parameter tuning) and model architecture (e.g., neural architecture search), among other things that yield the highest accuracy. The computational efficiency of these training tasks depends highly on how well the training data is supplied to the training process. The repetitive nature of these tasks results in the same data processing pipelines running over and over, exacerbating the need for and costs of computational resources. In this paper, we present TensorSocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. TensorSocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. TensorSocket achieves this by reducing redundant computations and data duplication across collocated training processes and leveraging modern GPU-GPU interconnects. While doing so, TensorSocket is able to train and balance differently-sized models and serve multiple batch sizes simultaneously and is hardware- and pipeline-agnostic in nature. Our evaluation shows that TensorSocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100%, and when utilizing cloud instances, achieves cost savings of 50% by reducing the hardware resource needs on the CPU side. Furthermore, TensorSocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader; it is easier to deploy and maintain and either achieves higher or matches their throughput while requiring fewer CPU resources.

[665] Vector Quantization Prompting for Continual Learning

Li Jiao, Qiuxia Lai, Yu Li, Qiang Xu

Main category: cs.LG

TL;DR: VQ-Prompt introduces Vector Quantization (VQ) for discrete prompt learning in continual learning, optimizing prompt selection with task loss and improving task knowledge abstraction.

DetailsMotivation: Existing prompt-based methods for continual learning suffer from sub-optimal prompt selection due to unoptimized identity prediction and lack of abstraction in continuous prompts.

Method: VQ-Prompt integrates Vector Quantization (VQ) into end-to-end training of discrete prompts, enabling optimized selection and better task knowledge representation.

Result: VQ-Prompt outperforms state-of-the-art continual learning methods in class-incremental benchmarks.

Conclusion: VQ-Prompt effectively addresses prompt selection and abstraction challenges, enhancing continual learning performance.

Abstract: Continual learning requires to overcome catastrophic forgetting when training a single model on a sequence of tasks. Recent top-performing approaches are prompt-based methods that utilize a set of learnable parameters (i.e., prompts) to encode task knowledge, from which appropriate ones are selected to guide the fixed pre-trained model in generating features tailored to a certain task. However, existing methods rely on predicting prompt identities for prompt selection, where the identity prediction process cannot be optimized with task loss. This limitation leads to sub-optimal prompt selection and inadequate adaptation of pre-trained features for a specific task. Previous efforts have tried to address this by directly generating prompts from input queries instead of selecting from a set of candidates. However, these prompts are continuous, which lack sufficient abstraction for task knowledge representation, making them less effective for continual learning. To address these challenges, we propose VQ-Prompt, a prompt-based continual learning method that incorporates Vector Quantization (VQ) into end-to-end training of a set of discrete prompts. In this way, VQ-Prompt can optimize the prompt selection process with task loss and meanwhile achieve effective abstraction of task knowledge for continual learning. Extensive experiments show that VQ-Prompt outperforms state-of-the-art continual learning methods across a variety of benchmarks under the challenging class-incremental setting. The code is available at \href{https://github.com/jiaolifengmi/VQ-Prompt}{this https URL}.

[666] Measuring Leakage in Concept-Based Methods: An Information Theoretic Approach

Mikael Makonnen, Moritz Vandenhirtz, Sonia Laguna, Julia E Vogt

Main category: cs.LG

TL;DR: The paper introduces an information-theoretic measure to quantify unintended information leakage in Concept Bottleneck Models (CBMs), validated through synthetic experiments.

DetailsMotivation: CBMs aim for interpretability, but unintended information leakage compromises transparency, necessitating a method to quantify such leakage.

Method: An information-theoretic measure is proposed and tested in controlled synthetic experiments, analyzing leakage trends across configurations.

Result: Feature and concept dimensionality significantly affect leakage, and XGBoost is the most reliable classifier for stable measurement. The measure also behaves as expected in soft joint CBMs.

Conclusion: The measure effectively quantifies leakage in synthetic settings, with potential for future application to real-world datasets.

Abstract: Concept Bottleneck Models (CBMs) aim to enhance interpretability by structuring predictions around human-understandable concepts. However, unintended information leakage, where predictive signals bypass the concept bottleneck, compromises their transparency. This paper introduces an information-theoretic measure to quantify leakage in CBMs, capturing the extent to which concept embeddings encode additional, unintended information beyond the specified concepts. We validate the measure through controlled synthetic experiments, demonstrating its effectiveness in detecting leakage trends across various configurations. Our findings highlight that feature and concept dimensionality significantly influence leakage, and that classifier choice impacts measurement stability, with XGBoost emerging as the most reliable estimator. Additionally, preliminary investigations indicate that the measure exhibits the anticipated behavior when applied to soft joint CBMs, suggesting its reliability in leakage quantification beyond fully synthetic settings. While this study rigorously evaluates the measure in controlled synthetic experiments, future work can extend its application to real-world datasets.

[667] Constrained Optimal Fuel Consumption of HEVs under Observational Noise

Shuchang Yan, Haoran Sun

Main category: cs.LG

TL;DR: The paper reformulates the constrained optimal fuel consumption (COFC) problem for hybrid electric vehicles (HEVs) by incorporating observational noise in SOC and speed, using robust constrained reinforcement learning (CRL) to ensure stability and performance under real-world conditions.

DetailsMotivation: Address the limitations of prior work by accounting for sensor noise and deviations in reference speeds, common in real-world HEV applications.

Method: Adopt a robust CRL approach with uniform noise modeling and structured training. Evaluate using simulations on the Toyota Prius hybrid system under NEDC and WLTC cycles.

Result: Fuel consumption and SOC constraint satisfaction remain robust across noise levels, with varying impacts from SOC and speed noise.

Conclusion: This is the first study to analyze the effects of observational noise on COFC in HEVs, demonstrating robustness in real-world scenarios.

Abstract: In our prior work, we investigated the minimum fuel consumption of a hybrid electric vehicle (HEV) under a state-of-charge (SOC) balance constraint, assuming perfect SOC measurements and accurate reference speed profiles. The constrained optimal fuel consumption (COFC) problem was addressed using a constrained reinforcement learning (CRL) framework. However, in real-world scenarios, SOC readings are often corrupted by sensor noise, and reference speeds may deviate from actual driving conditions. To account for these imperfections, this study reformulates the COFC problem by explicitly incorporating observational noise in both SOC and reference speed. We adopt a robust CRL approach, where the noise is modeled as a uniform distribution, and employ a structured training procedure to ensure stability. The proposed method is evaluated through simulations on the Toyota Prius hybrid system (THS), using both the New European Driving Cycle (NEDC) and the Worldwide Harmonized Light Vehicles Test Cycle (WLTC). Results show that fuel consumption and SOC constraint satisfaction remain robust across varying noise levels. Furthermore, the analysis reveals that observational noise in SOC and speed can impact fuel consumption to different extents. To the best of our knowledge, this is the first study to explicitly examine how observational noise – commonly encountered in dynamometer testing and predictive energy control (PEC) applications – affects constrained optimal fuel consumption in HEVs.

[668] Enhanced Pruning Strategy for Multi-Component Neural Architectures Using Component-Aware Graph Analysis

Ganesh Sundaram, Jonas Ulmen, Daniel Görges

Main category: cs.LG

TL;DR: A component-aware pruning strategy for Multi-Component Neural Architectures (MCNAs) improves sparsity and reduces performance degradation while maintaining network integrity.

DetailsMotivation: Deep neural networks (DNNs) are complex and resource-intensive, limiting deployment in constrained settings. Existing pruning methods risk network integrity in MCNAs.

Method: Extends dependency graphs to isolate components and inter-component flows, creating smaller, targeted pruning groups.

Result: Achieves greater sparsity and reduced performance degradation, demonstrated effectively on a control task.

Conclusion: The approach optimizes complex, multi-component DNNs efficiently while conserving functional integrity.

Abstract: Deep neural networks (DNNs) deliver outstanding performance, but their complexity often prohibits deployment in resource-constrained settings. Comprehensive structured pruning frameworks based on parameter dependency analysis reduce model size with specific regard to computational performance. When applying them to Multi-Component Neural Architectures (MCNAs), they risk network integrity by removing large parameter groups. We introduce a component-aware pruning strategy, extending dependency graphs to isolate individual components and inter-component flows. This creates smaller, targeted pruning groups that conserve functional integrity. Demonstrated effectively on a control task, our approach achieves greater sparsity and reduced performance degradation, opening a path for optimizing complex, multi-component DNNs efficiently.

[669] Transparent Trade-offs between Properties of Explanations

Hiwot Belay Tadesse, Alihan Hüyük, Yaniv Yacoby, Weiwei Pan, Finale Doshi-Velez

Main category: cs.LG

TL;DR: The paper critiques existing methods for explaining black-box ML models, showing they fail to consistently achieve desired properties. It proposes direct optimization of explanations for better control and consistency.

DetailsMotivation: Existing explanation methods for black-box ML models inadequately ensure desirable properties and lack user control over trade-offs between conflicting properties.

Method: The authors propose directly optimizing explanations for desired properties, enabling consistent results and user control over trade-offs.

Result: Direct optimization produces explanations with optimal properties more consistently and allows users to prioritize specific properties as needed.

Conclusion: The direct optimization approach outperforms existing methods by ensuring desired properties and offering flexible control over explanation trade-offs.

Abstract: When explaining black-box machine learning models, it’s often important for explanations to have certain desirable properties. Most existing methods `encourage’ desirable properties in their construction of explanations. In this work, we demonstrate that these forms of encouragement do not consistently create explanations with the properties that are supposedly being targeted. Moreover, they do not allow for any control over which properties are prioritized when different properties are at odds with each other. We propose to directly optimize explanations for desired properties. Our direct approach not only produces explanations with optimal properties more consistently but also empowers users to control trade-offs between different properties, allowing them to create explanations with exactly what is needed for a particular task.

[670] Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation

Harry Mead, Clarissa Costen, Bruno Lacerda, Nick Hawes

Main category: cs.LG

TL;DR: Reformulating CVaR optimization by capping trajectory returns improves sample efficiency over current methods that discard trajectories.

DetailsMotivation: Current methods for CVaR optimization via policy gradients discard many trajectories, leading to poor sample efficiency.

Method: Proposes capping the total return of trajectories used in training instead of discarding them, showing equivalence to the original problem with proper cap settings.

Result: Empirical results demonstrate consistently improved performance across various environments.

Conclusion: The reformulation enhances sample efficiency and performance, with code made publicly available.

Abstract: When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.

[671] Attention-Based Reconstruction of Full-Field Tsunami Waves from Sparse Tsunameter Networks

Edward McDugald, Arvind Mohan, Darren Engwirda, Agnese Marcato, Javier Santos

Main category: cs.LG

TL;DR: The Senseiver, an attention-based neural network, improves tsunami forecasting by reconstructing high-resolution wavefields from sparse data, even for untrained epicenters, outperforming traditional methods.

DetailsMotivation: To enhance tsunami forecasting accuracy by leveraging sparse tsunameter data, especially in cases where epicenters are not part of the training set.

Method: Uses the Senseiver, an attention-based neural network, to reconstruct high-resolution tsunami wavefields from sparse observations.

Result: Outperforms Linear Interpolation with Huygens-Fresnel Principle, achieving higher accuracy in dense observation networks.

Conclusion: The Senseiver offers a promising approach for sparse sensing in tsunami forecasting, with superior performance over traditional methods.

Abstract: We investigate the potential of an attention-based neural network architecture, the Senseiver, for sparse sensing in tsunami forecasting. Specifically, we focus on the Tsunami Data Assimilation Method, which generates forecasts from tsunameter networks. Our model is used to reconstruct high-resolution tsunami wavefields from extremely sparse observations, including cases where the tsunami epicenters are not represented in the training set. Furthermore, we demonstrate that our approach significantly outperforms the Linear Interpolation with Huygens-Fresnel Principle in generating dense observation networks, achieving markedly improved accuracy.

[672] Splitting criteria for ordinal decision trees: an experimental study

Rafael Ayllón-Gavilán, Francisco José Martínez-Estudillo, David Guijo-Rubio, César Hervás-Martínez, Pedro Antonio Gutiérrez

Main category: cs.LG

TL;DR: The paper surveys ordinal splitting criteria for decision trees in Ordinal Classification (OC), comparing three ordinal criteria (OGini, WIG, RI) to nominal ones (Gini, information gain). OGini outperforms others, reducing error by 3.02%.

DetailsMotivation: OC tasks require methods that account for label order, but nominal approaches are often misused, leading to suboptimal results. Ordinal tree-based methods are understudied.

Method: The study standardizes notations, compares ordinal and nominal splitting criteria in decision trees, and evaluates them on 45 OC datasets using OC metrics.

Result: OGini is the best ordinal criterion, reducing mean absolute error by 3.02% compared to Gini.

Conclusion: The work provides a standardized comparison, highlighting OGini’s superiority, and shares resources for reproducibility.

Abstract: Ordinal Classification (OC) addresses those classification tasks where the labels exhibit a natural order. Unlike nominal classification, which treats all classes as mutually exclusive and unordered, OC takes the ordinal relationship into account, producing more accurate and relevant results. This is particularly critical in applications where the magnitude of classification errors has significant consequences. Despite this, OC problems are often tackled using nominal methods, leading to suboptimal solutions. Although decision trees are among the most popular classification approaches, ordinal tree-based approaches have received less attention when compared to other classifiers. This work provides a comprehensive survey of ordinal splitting criteria, standardising the notations used in the literature to enhance clarity and consistency. Three ordinal splitting criteria, Ordinal Gini (OGini), Weighted Information Gain (WIG), and Ranking Impurity (RI), are compared to the nominal counterparts of the first two (Gini and information gain), by incorporating them into a decision tree classifier. An extensive repository considering $45$ publicly available OC datasets is presented, supporting the first experimental comparison of ordinal and nominal splitting criteria using well-known OC evaluation metrics. The results have been statistically analysed, highlighting that OGini stands out as the best ordinal splitting criterion to date, reducing the mean absolute error achieved by Gini by more than 3.02%. To promote reproducibility, all source code developed, a detailed guide for reproducing the results, the 45 OC datasets, and the individual results for all the evaluated methodologies are provided.

[673] HI-PMK: A Data-Dependent Kernel for Incomplete Heterogeneous Data Representation

Youran Zhou, Mohamed Reda Bouadjenek, Jonathan Wells, Sunil Aryal

Main category: cs.LG

TL;DR: HI-PMK is a novel method for handling incomplete and heterogeneous data without imputation, using a probability mass-based dissimilarity measure and missingness-aware uncertainty strategy.

DetailsMotivation: Addressing challenges of incomplete and heterogeneous data in real-world machine learning, where existing methods like imputation introduce bias or fail to handle data complexity.

Method: HI-PMK uses a probability mass-based dissimilarity measure for heterogeneous features and a MaxU strategy for missingness mechanisms (MCAR, MAR, MNAR).

Result: Outperforms traditional imputation-based methods and kernel approaches on 15 benchmark datasets.

Conclusion: HI-PMK is a privacy-preserving, scalable solution for incomplete and heterogeneous data, suitable for downstream tasks like classification and clustering.

Abstract: Handling incomplete and heterogeneous data remains a central challenge in real-world machine learning, where missing values may follow complex mechanisms (MCAR, MAR, MNAR) and features can be of mixed types (numerical and categorical). Existing methods often rely on imputation, which may introduce bias or privacy risks, or fail to jointly address data heterogeneity and structured missingness. We propose the \textbf{H}eterogeneous \textbf{I}ncomplete \textbf{P}robability \textbf{M}ass \textbf{K}ernel (\textbf{HI-PMK}), a novel data-dependent representation learning approach that eliminates the need for imputation. HI-PMK introduces two key innovations: (1) a probability mass-based dissimilarity measure that adapts to local data distributions across heterogeneous features (numerical, ordinal, nominal), and (2) a missingness-aware uncertainty strategy (MaxU) that conservatively handles all three missingness mechanisms by assigning maximal plausible dissimilarity to unobserved entries. Our approach is privacy-preserving, scalable, and readily applicable to downstream tasks such as classification and clustering. Extensive experiments on over 15 benchmark datasets demonstrate that HI-PMK consistently outperforms traditional imputation-based pipelines and kernel methods across a wide range of missing data settings. Code is available at: https://github.com/echoid/Incomplete-Heter-Kernel

[674] Beyond Win Rates: A Clustering-Based Approach to Character Balance Analysis in Team-Based Games

Haokun Zhou

Main category: cs.LG

TL;DR: A clustering-based method is proposed to analyze character balance in competitive games like Valorant, using in-game data to reveal latent roles and synergies.

DetailsMotivation: Traditional balance metrics (win/pick rates) lack depth for team-based games, necessitating a nuanced approach to understand character dynamics.

Method: Hierarchical agglomerative clustering with Jensen-Shannon Divergence is applied to Valorant Champions Tour 2022 data to identify agent clusters based on co-occurrence patterns.

Result: Distinct clusters of agents with similar team composition roles are identified, providing deeper insights into synergies and imbalances.

Conclusion: The method offers a holistic, interpretable tool for game developers to make context-aware balance adjustments.

Abstract: Character diversity in competitive games, while enriching gameplay, often introduces balance challenges that can negatively impact player experience and strategic depth. Traditional balance assessments rely on aggregate metrics like win rates and pick rates, which offer limited insight into the intricate dynamics of team-based games and nuanced character roles. This paper proposes a novel clustering-based methodology to analyze character balance, leveraging in-game data from Valorant to account for team composition influences and reveal latent character roles. By applying hierarchical agglomerative clustering with Jensen-Shannon Divergence to professional match data from the Valorant Champions Tour 2022, our approach identifies distinct clusters of agents exhibiting similar co-occurrence patterns within team compositions. This method not only complements existing quantitative metrics but also provides a more holistic and interpretable perspective on character synergies and potential imbalances, offering game developers a valuable tool for informed and context-aware balance adjustments.

[675] Adversarial bandit optimization for approximately linear functions

Zhuoyu Cheng, Kohei Hatano, Eiji Takimoto

Main category: cs.LG

TL;DR: The paper analyzes bandit optimization for nonconvex, nonsmooth functions with linear and perturbation components, providing regret bounds and a lower bound.

DetailsMotivation: To address the challenge of optimizing nonconvex, nonsmooth functions in bandit settings, especially with adversarial perturbations.

Method: Analyzes regret bounds (expected and high probability) for the problem, including a special case of bandit linear optimization.

Result: Provides improved high-probability regret bounds for bandit linear optimization and a lower bound on expected regret.

Conclusion: The work advances understanding of bandit optimization under adversarial perturbations and nonconvexity.

Abstract: We consider a bandit optimization problem for nonconvex and non-smooth functions, where in each trial the loss function is the sum of a linear function and a small but arbitrary perturbation chosen after observing the player’s choice. We give both expected and high probability regret bounds for the problem. Our result also implies an improved high-probability regret bound for the bandit linear optimization, a special case with no perturbation. We also give a lower bound on the expected regret.

[676] Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions

Elisabetta Cornacchia, Dan Mikulincer, Elchanan Mossel

Main category: cs.LG

TL;DR: The paper shows that high complexity cases in learning single and multi index models are rare. A small random perturbation to the data distribution makes Gaussian single index models as easy to learn as linear functions, and extends this to sparse Boolean functions (Juntas).

DetailsMotivation: To address the challenge of learning high-complexity single and multi index models, which are fundamental in high-dimensional statistics, and to explore the impact of data distribution perturbations on learning ease.

Method: Introduces a small random perturbation (random shift in the first moment) to the data distribution and analyzes its effect on the learnability of Gaussian single index models and sparse Boolean functions.

Result: Proves that high complexity cases become rare with such perturbations, making Gaussian single index models as easy to learn as linear functions, and extends this to sparse Boolean functions.

Conclusion: Small perturbations to data distributions can significantly simplify the learning of high-complexity models, reducing their sample complexity to that of simpler tasks.

Abstract: The problem of learning single index and multi index models has gained significant interest as a fundamental task in high-dimensional statistics. Many recent works have analysed gradient-based methods, particularly in the setting of isotropic data distributions, often in the context of neural network training. Such studies have uncovered precise characterisations of algorithmic sample complexity in terms of certain analytic properties of the target function, such as the leap, information, and generative exponents. These properties establish a quantitative separation between low and high complexity learning tasks. In this work, we show that high complexity cases are rare. Specifically, we prove that introducing a small random perturbation to the data distribution–via a random shift in the first moment–renders any Gaussian single index model as easy to learn as a linear function. We further extend this result to a class of multi index models, namely sparse Boolean functions, also known as Juntas.

[677] Neural Flow Samplers with Shortcut Models

Wuhao Chen, Zijing Ou, Yingzhen Li

Main category: cs.LG

TL;DR: The paper introduces an improved estimator and a shortcut consistency model to enhance flow-based neural samplers, outperforming existing methods on synthetic and complex datasets.

DetailsMotivation: Sampling from unnormalized densities is challenging but crucial for applications like posterior inference and molecular dynamics. Existing estimators for partition functions often suffer from high variance or low accuracy.

Method: The authors propose a velocity-driven Sequential Monte Carlo method with control variates for better estimation and a shortcut consistency model to reduce sampling steps in flow-based neural samplers.

Result: The Neural Flow Shortcut Sampler outperforms existing flow-based neural samplers on synthetic datasets and complex n-body system targets.

Conclusion: The improved estimator and shortcut model enhance the efficiency and accuracy of flow-based neural samplers, making them more practical for real-world applications.

Abstract: Sampling from unnormalized densities presents a fundamental challenge with wide-ranging applications, from posterior inference to molecular dynamics simulations. Continuous flow-based neural samplers offer a promising approach, learning a velocity field that satisfies key principles of marginal density evolution (e.g., the continuity equation) to generate samples. However, this learning procedure requires accurate estimation of intractable terms linked to the computationally challenging partition function, for which existing estimators often suffer from high variance or low accuracy. To overcome this, we introduce an improved estimator for these challenging quantities, employing a velocity-driven Sequential Monte Carlo method enhanced with control variates. Furthermore, we introduce a shortcut consistency model to boost the runtime efficiency of the flow-based neural sampler by minimizing its required sampling steps. Our proposed Neural Flow Shortcut Sampler empirically outperforms existing flow-based neural samplers on both synthetic datasets and complex n-body system targets.

[678] PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li

Main category: cs.LG

TL;DR: A wavelet-based approach for analyzing physiological signals is introduced, addressing noise and non-stationarity. Pretrained models for EMG and ECG, and a multi-modal framework with EEG, achieve superior performance.

DetailsMotivation: Challenges like motion artifacts, baseline drift, and non-stationarity in physiological signals hinder analysis. Traditional methods struggle with these issues.

Method: A wavelet-based technique captures multi-scale time-frequency features. Pretrained models for EMG and ECG are developed, and a multi-modal framework integrates EEG via weighted fusion.

Result: The approach outperforms existing methods, handling low SNR, inter-subject variability, and device mismatch. New baselines are set for downstream tasks.

Conclusion: The wavelet-based method and multi-modal framework advance physiological signal analysis, with potential applications in health monitoring and diagnostics.

Abstract: Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications.

[679] Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, Samuel Sokota

Main category: cs.LG

TL;DR: Simpler policy gradient methods like PPO outperform FP-, DO-, and CFR-based DRL approaches in imperfect-information games, as shown by extensive exploitability comparisons.

DetailsMotivation: Address the failure of naive self-play DRL in adversarial imperfect-information games and test the hypothesis that simpler methods like PPO are superior to FP-, DO-, and CFR-based approaches.

Method: Implemented exact exploitability computations for four large games and conducted over 5600 training runs to compare DRL algorithms.

Result: FP-, DO-, and CFR-based methods did not outperform generic policy gradient methods like PPO.

Conclusion: Generic policy gradient methods are competitive or superior to specialized approaches in imperfect-information games.

Abstract: In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for four large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 5600 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .

[680] Dictionary-Learning-Based Data Pruning for System Identification

Tingna Wang, Sikai Zhang, Mingming Song, Limin Sun

Main category: cs.LG

TL;DR: The paper introduces mini-batch FastCan, a data pruning method to reduce sample-wise redundancy in time series data using dictionary learning, outperforming random pruning.

DetailsMotivation: Existing methods focus on feature-wise redundancy reduction, neglecting sample-wise redundancy, which this paper addresses.

Method: Proposes mini-batch FastCan, using dictionary learning to represent data with atoms and selecting useful samples based on their correlation with these atoms.

Result: Tested on simulated and benchmark datasets, the method significantly outperforms random pruning, measured by R-squared of model coefficients.

Conclusion: Mini-batch FastCan effectively reduces sample-wise redundancy, improving model training efficiency.

Abstract: System identification is normally involved in augmenting time series data by time shifting and nonlinearisation (e.g., polynomial basis), both of which introduce redundancy in features and samples. Many research works focus on reducing redundancy feature-wise, while less attention is paid to sample-wise redundancy. This paper proposes a novel data pruning method, called mini-batch FastCan, to reduce sample-wise redundancy based on dictionary learning. Time series data is represented by some representative samples, called atoms, via dictionary learning. The useful samples are selected based on their correlation with the atoms. The method is tested on one simulated dataset and two benchmark datasets. The R-squared between the coefficients of models trained on the full datasets and the coefficients of models trained on pruned datasets is adopted to evaluate the performance of data pruning methods. It is found that the proposed method significantly outperforms the random pruning method.

[681] Controlled Model Debiasing through Minimal and Interpretable Updates

Federico Di Gennaro, Thibault Laugel, Vincent Grari, Marcin Detyniecki

Main category: cs.LG

TL;DR: The paper introduces COMMOD, a model-agnostic algorithm for controlled debiasing in machine learning, ensuring minimal and interpretable changes to existing models.

DetailsMotivation: Traditional fairness methods require rebuilding models from scratch, leading to inefficiencies and inconsistencies. This work aims to address these issues by focusing on minimal and interpretable updates.

Method: The proposed algorithm, COMMOD, combines concept-based architecture and adversarial learning to debias models without needing sensitive attributes at test time.

Result: COMMOD achieves comparable performance to state-of-the-art debiasing methods while enforcing minimal and interpretable changes.

Conclusion: The approach is effective for high-stakes applications, offering a practical solution for model fairness without complete retraining.

Abstract: Traditional approaches to learning fair machine learning models often require rebuilding models from scratch, typically without considering potentially existing models. In a context where models need to be retrained frequently, this can lead to inconsistent model updates, as well as redundant and costly validation testing. To address this limitation, we introduce the notion of controlled model debiasing, a novel supervised learning task relying on two desiderata: that the differences between the new fair model and the existing one should be (i) minimal and (ii) interpretable. After providing theoretical guarantees to this new problem, we introduce a novel algorithm for algorithmic fairness, COMMOD, that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce minimal and interpretable changes between biased and debiased predictions in a binary classification task, a property that, while highly desirable in high-stakes applications, is rarely prioritized as an explicit objective in fairness literature. Our approach combines a concept-based architecture and adversarial learning and we demonstrate through empirical results that it achieves comparable performance to state-of-the-art debiasing methods while performing minimal and interpretable prediction changes.

[682] FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization

Seung-Wook Kim, Seongyeol Kim, Jiah Kim, Seowon Ji, Se-Ho Lee

Main category: cs.LG

TL;DR: FedWSQ improves federated learning by combining weight standardization and non-uniform quantization, addressing data heterogeneity and communication constraints.

DetailsMotivation: Federated learning (FL) faces performance issues due to data heterogeneity and communication constraints, which FedWSQ aims to resolve.

Method: FedWSQ integrates weight standardization (WS) to filter biased updates and distribution-aware non-uniform quantization (DANUQ) to reduce quantization errors.

Result: FedWSQ reduces communication overhead and maintains high model accuracy, outperforming existing FL methods in challenging settings.

Conclusion: FedWSQ is a robust solution for FL, excelling in scenarios with data heterogeneity and low-bit communication.

Abstract: Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.

[683] Further exploration of binding energy residuals using machine learning and the development of a composite ensemble model

I. Bentley, J. Tedder, M. Gebran, A. Paul

Main category: cs.LG

TL;DR: The paper introduces the Four Model Tree Ensemble (FMTE), a machine learning composite trained on AME 2012 data, achieving high accuracy in predicting nuclear binding energies for AME 2020 nuclei.

DetailsMotivation: To improve the prediction of nuclear binding energies by leveraging machine learning and combining multiple models for better accuracy and extrapolation capabilities.

Method: Combines three new models with one prior model, trained on binding energy residuals using four machine learning approaches, focusing on shape parameters and physical features.

Result: FMTE predicts binding energies with a standard deviation of 76 keV and mean average deviation of 34 keV, demonstrating superior interpolation and extrapolation.

Conclusion: The least-squares boosted ensemble of trees is the preferred approach for binding energy residuals, validated by comparisons with new isotope masses and extrapolations near the neutron drip line.

Abstract: This paper describes the development of the Four Model Tree Ensemble (FMTE). The FMTE is a composite of machine learning models trained on experimental binding energies from the Atomic Mass Evaluation (AME) 2012. The FMTE predicts binding energy values for all nuclei with N > 7 and Z > 7 from AME 2020 with a standard deviation of 76 keV and a mean average deviation of 34 keV. The FMTE model was developed by combining three new models with one prior model. The new models presented here have been trained on binding energy residuals from mass models using four machine learning approaches. The models presented in this work leverage shape parameters along with other physical features. We have determined the preferred machine learning approach for binding energy residuals is the least-squares boosted ensemble of trees. This approach appears to have a superior ability to both interpolate and extrapolate binding energy residuals. A comparison with the masses of isotopes that were not measured previously and a discussion of extrapolations approaching the neutron drip line have been included.

[684] Reciprocity-Aware Convolutional Neural Networks for Map-Based Path Loss Prediction

Ryan G. Dempsey, Jonathan Ethier, Halim Yanikomeroglu

Main category: cs.LG

TL;DR: The paper proposes data augmentation to generalize path loss models for uplink, downlink, and backhaul scenarios using only downlink drive test data, improving accuracy by >8 dB for uplink predictions.

DetailsMotivation: Existing path loss models, trained on downlink drive test data, lack representation for uplink scenarios, limiting their applicability.

Method: The study uses data augmentation by adding synthetic uplink samples to downlink training data to train a generalized path loss model.

Result: Root mean squared error is reduced by >8 dB for uplink predictions in the test set.

Conclusion: Data augmentation effectively generalizes path loss models to cover uplink scenarios without additional real-world measurements.

Abstract: Path loss modeling is a widely used technique for estimating point-to-point losses along a communications link from transmitter (Tx) to receiver (Rx). Accurate path loss predictions can optimize use of the radio frequency spectrum and minimize unwanted interference. Modern path loss modeling often leverages data-driven approaches, using machine learning to train models on drive test measurement datasets. Drive tests primarily represent downlink scenarios, where the Tx is located on a building and the Rx is located on a moving vehicle. Consequently, trained models are frequently reserved for downlink coverage estimation, lacking representation of uplink scenarios. In this paper, we demonstrate that data augmentation can be used to train a path loss model that is generalized to uplink, downlink, and backhaul scenarios, training using only downlink drive test measurements. By adding a small number of synthetic samples representing uplink scenarios to the training set, root mean squared error is reduced by > 8 dB on uplink examples in the test set.

[685] Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime

Raphaël Barboni, Gabriel Peyré, François-Xavier Vialard

Main category: cs.LG

TL;DR: The paper analyzes gradient methods for training mean-field single-hidden-layer neural networks with square loss, proposing a Variable Projection (VarPro) algorithm for provable convergence rates in a teacher-student scenario.

DetailsMotivation: To address the lack of quantitative convergence results for high-dimensional, non-convex optimization in neural network training, especially beyond neural tangent kernel analysis.

Method: Uses a Variable Projection (VarPro) or two-timescale learning algorithm to eliminate linear variables, focusing on nonlinear feature training. Analyzes dynamics via a weighted ultra-fast diffusion equation.

Result: In a teacher-student setup, the method achieves provable convergence rates for sampling a teacher feature distribution, with dynamics described by a PDE.

Conclusion: The VarPro approach provides quantitative convergence guarantees for feature distribution learning, leveraging PDE analysis.

Abstract: We study the convergence of gradient methods for the training of mean-field single-hidden-layer neural networks with square loss. For this high-dimensional and non-convex optimization problem, most known convergence results are either qualitative or rely on a neural tangent kernel analysis where nonlinear representations of the data are fixed. Using that this problem belongs to the class of separable nonlinear least squares problems, we consider here a Variable Projection (VarPro) or two-timescale learning algorithm, thereby eliminating the linear variables and reducing the learning problem to the training of nonlinear features. In a teacher-student scenario, we show such a strategy enables provable convergence rates for the sampling of a teacher feature distribution. Precisely, in the limit where the regularization strength vanishes, we show that the dynamic of the feature distribution corresponds to a weighted ultra-fast diffusion equation. Recent results on the asymptotic behavior of such PDEs then give quantitative guarantees for the convergence of the learned feature distribution.

[686] Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting

Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An, Yongjun Xu, Yingli Tian, Hao Wu

Main category: cs.LG

TL;DR: A lightweight framework, SDKD, transfers multi-scale spatiotemporal representations from a complex teacher model to a student network, improving efficiency and performance.

DetailsMotivation: Complex models for spatiotemporal forecasting suffer from inefficiency and high memory usage, necessitating a lightweight solution.

Method: SDKD uses frequency-aligned knowledge distillation to guide the student model with multi-scale spectral features from the teacher’s latent space.

Result: SDKD reduces MSE by 81.3% and MAE by 52.3% on the Navier-Stokes dataset, capturing high-frequency and long-term trends efficiently.

Conclusion: SDKD effectively balances performance and computational efficiency for spatiotemporal forecasting tasks.

Abstract: Spatiotemporal forecasting tasks, such as traffic flow, combustion dynamics, and weather forecasting, often require complex models that suffer from low training efficiency and high memory consumption. This paper proposes a lightweight framework, Spectral Decoupled Knowledge Distillation (termed SDKD), which transfers the multi-scale spatiotemporal representations from a complex teacher model to a more efficient lightweight student network. The teacher model follows an encoder-latent evolution-decoder architecture, where its latent evolution module decouples high-frequency details and low-frequency trends using convolution and Transformer (global low-frequency modeler). However, the multi-layer convolution and deconvolution structures result in slow training and high memory usage. To address these issues, we propose a frequency-aligned knowledge distillation strategy, which extracts multi-scale spectral features from the teacher’s latent space, including both high and low frequency components, to guide the lightweight student model in capturing both local fine-grained variations and global evolution patterns. Experimental results show that SDKD significantly improves performance, achieving reductions of up to 81.3% in MSE and in MAE 52.3% on the Navier-Stokes equation dataset. The framework effectively captures both high-frequency variations and long-term trends while reducing computational complexity. Our codes are available at https://github.com/itsnotacie/SDKD

[687] Conditional Front-door Adjustment for Heterogeneous Treatment Assignment Effect Estimation Under Non-adherence

Winston Chen, Trenton Chang, Jenna Wiens

Main category: cs.LG

TL;DR: CFD provides lower-variance estimates than SBD for small treatment effects under non-adherence, and LobsterNet improves CFD’s accuracy by jointly modeling nuisance parameters.

DetailsMotivation: To address the variance issue in treatment effect estimation under non-adherence and improve accuracy.

Method: Theoretical and empirical comparison of CFD and SBD, plus LobsterNet for joint nuisance parameter modeling.

Result: CFD outperforms SBD for small effects; LobsterNet reduces estimation error in datasets.

Conclusion: CFD with shared nuisance modeling (LobsterNet) enhances treatment effect estimation under non-adherence.

Abstract: Estimates of heterogeneous treatment assignment effects can inform treatment decisions. Under the presence of non-adherence (e.g., patients do not adhere to their assigned treatment), both the standard backdoor adjustment (SBD) and the conditional front-door adjustment (CFD) can recover unbiased estimates of the treatment assignment effects. However, the estimation variance of these approaches may vary widely across settings, which remains underexplored in the literature. In this work, we demonstrate theoretically and empirically that CFD yields lower-variance estimates than SBD when the true effect of treatment assignment is small (i.e., assigning an intervention leads to small changes in patients’ future outcome). Additionally, since CFD requires estimating multiple nuisance parameters, we introduce LobsterNet, a multi-task neural network that implements CFD with joint modeling of the nuisance parameters. Empirically, LobsterNet reduces estimation error across several semi-synthetic and real-world datasets compared to baselines. Our findings suggest CFD with shared nuisance parameter modeling can improve treatment assignment effect estimation under non-adherence.

[688] Rethinking Data Protection in the (Generative) Artificial Intelligence Era

Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, Kui Ren

Main category: cs.LG

TL;DR: The paper proposes a four-level taxonomy for data protection in generative AI, addressing gaps in traditional methods and emphasizing the need for updated safeguards.

DetailsMotivation: The rise of generative AI has blurred traditional data protection boundaries, creating risks for society and individuals, necessitating a redefined framework.

Method: A four-level taxonomy (non-usability, privacy preservation, traceability, deletability) is introduced to address protection needs across the AI lifecycle.

Result: The framework highlights trade-offs between data utility and control, identifies regulatory blind spots, and offers guidance for stakeholders.

Conclusion: The paper calls for urgent rethinking of data protection in AI, providing a structured approach for future technologies and governance.

Abstract: The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.

[689] A Large Language Model-Enhanced Q-learning for Capacitated Vehicle Routing Problem with Time Windows

Linjiang Cao, Maonan Wang, Xi Xiong

Main category: cs.LG

TL;DR: The paper proposes an LLM-enhanced Q-learning framework for solving CVRPTW with real-time constraints, achieving a 7.3% cost reduction over traditional Q-learning.

DetailsMotivation: The complexity of CVRPTW, due to vehicle capacity and time window constraints, challenges traditional methods, prompting the use of LLMs for approximate solutions.

Method: A two-phase training mechanism (LLM-guided exploration to autonomous Q-network optimization) with a three-tier self-correction (syntactic, semantic, physical) and prioritized replay of LLM-generated experiences.

Result: The framework reduces costs by 7.3% on average compared to traditional Q-learning and converges faster.

Conclusion: LLM-enhanced Q-learning effectively addresses CVRPTW, offering improved performance and efficiency.

Abstract: The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) is a classic NP-hard combinatorial optimization problem widely applied in logistics distribution and transportation management. Its complexity stems from the constraints of vehicle capacity and time windows, which pose significant challenges to traditional approaches. Advances in Large Language Models (LLMs) provide new possibilities for finding approximate solutions to CVRPTW. This paper proposes a novel LLM-enhanced Q-learning framework to address the CVRPTW with real-time emergency constraints. Our solution introduces an adaptive two-phase training mechanism that transitions from the LLM-guided exploration phase to the autonomous optimization phase of Q-network. To ensure reliability, we design a three-tier self-correction mechanism based on the Chain-of-Thought (CoT) for LLMs: syntactic validation, semantic verification, and physical constraint enforcement. In addition, we also prioritized replay of the experience generated by LLMs to amplify the regulatory role of LLMs in the architecture. Experimental results demonstrate that our framework achieves a 7.3% average reduction in cost compared to traditional Q-learning, with fewer training steps required for convergence.

[690] Ranking-Based At-Risk Student Prediction Using Federated Learning and Differential Features

Shunsuke Yoneda, Valdemar Švábenský, Gen Li, Daisuke Deguchi, Atsushi Shimada

Main category: cs.LG

TL;DR: The study proposes a federated learning and differential features method to address privacy concerns in educational data mining, achieving comparable performance to centralized models while enhancing generalizability.

DetailsMotivation: Privacy concerns limit the integration of confidential academic and learning log data across schools, hindering the development of high-performing, generalizable models.

Method: Combines federated learning (to train models without centralizing data) and differential features (using relative values) for predicting at-risk students.

Result: The method matched centralized learning performance in Top-n precision, nDCG, and PR-AUC, while differential features improved prediction across datasets. Early prediction of at-risk students was also effective.

Conclusion: The proposed method successfully addresses privacy issues and enhances model performance and generalizability, proving useful for early intervention in education.

Abstract: Digital textbooks are widely used in various educational contexts, such as university courses and online lectures. Such textbooks yield learning log data that have been used in numerous educational data mining (EDM) studies for student behavior analysis and performance prediction. However, these studies have faced challenges in integrating confidential data, such as academic records and learning logs, across schools due to privacy concerns. Consequently, analyses are often conducted with data limited to a single school, which makes developing high-performing and generalizable models difficult. This study proposes a method that combines federated learning and differential features to address these issues. Federated learning enables model training without centralizing data, thereby preserving student privacy. Differential features, which utilize relative values instead of absolute values, enhance model performance and generalizability. To evaluate the proposed method, a model for predicting at-risk students was trained using data from 1,136 students across 12 courses conducted over 4 years, and validated on hold-out test data from 5 other courses. Experimental results demonstrated that the proposed method addresses privacy concerns while achieving performance comparable to that of models trained via centralized learning in terms of Top-n precision, nDCG, and PR-AUC. Furthermore, using differential features improved prediction performance across all evaluation datasets compared to non-differential approaches. The trained models were also applicable for early prediction, achieving high performance in detecting at-risk students in earlier stages of the semester within the validation datasets.

[691] Multi-parameter Control for the $(1+(λ,λ))$-GA on OneMax via Deep Reinforcement Learning

Tai Nguyen, Phong Le, Carola Doerr, Nguyen Dang

Main category: cs.LG

TL;DR: Dynamic parameter control in evolutionary algorithms, like the $(1+(\lambda,\lambda))$ Genetic Algorithm, outperforms static choices. Deep reinforcement learning (DRL) is used to optimize multiple parameters, achieving superior performance over existing policies.

DetailsMotivation: To address the challenge of controlling multiple parameters in evolutionary algorithms, particularly for the $(1+(\lambda,\lambda))$ Genetic Algorithm optimizing OneMax, and to explore the potential of DRL in deriving effective control policies.

Method: Decoupling the algorithm’s four main parameters and employing state-of-the-art DRL techniques to approximate optimal control policies.

Result: DRL-derived policies outperform all known control policies, with a simple derived policy surpassing the default theory-recommended setting by 27% and the irace-tuned policy by 13% for problem sizes up to 40,000.

Conclusion: DRL is a powerful tool for parameter control in evolutionary algorithms, capable of discovering policies that significantly enhance performance, even leading to simpler yet more effective control strategies.

Abstract: It is well known that evolutionary algorithms can benefit from dynamic choices of the key parameters that control their behavior, to adjust their search strategy to the different stages of the optimization process. A prominent example where dynamic parameter choices have shown a provable super-constant speed-up is the $(1+(\lambda,\lambda))$ Genetic Algorithm optimizing the OneMax function. While optimal parameter control policies result in linear expected running times, this is not possible with static parameter choices. This result has spurred a lot of interest in parameter control policies. However, many works, in particular theoretical running time analyses, focus on controlling one single parameter. Deriving policies for controlling multiple parameters remains very challenging. In this work we reconsider the problem of the $(1+(\lambda,\lambda))$ Genetic Algorithm optimizing OneMax. We decouple its four main parameters and investigate how well state-of-the-art deep reinforcement learning techniques can approximate good control policies. We show that although making deep reinforcement learning learn effectively is a challenging task, once it works, it is very powerful and is able to find policies that outperform all previously known control policies on the same benchmark. Based on the results found through reinforcement learning, we derive a simple control policy that consistently outperforms the default theory-recommended setting by $27%$ and the irace-tuned policy, the strongest existing control policy on this benchmark, by $13%$, for all tested problem sizes up to $40{,}000$.

[692] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning

Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou

Main category: cs.LG

TL;DR: The paper addresses inefficiency in LLMs due to overthinking, proposing Plan-and-Budget, a framework that improves reasoning efficiency by decomposing queries and adaptive budget allocation.

DetailsMotivation: LLMs often overthink, generating verbose reasoning for simple queries, leading to computational inefficiency. Fixed token budgets can cause underthinking on harder problems.

Method: Developed BBAM (Bayesian Budget Allocation Model) and introduced the $E^3$ metric. Proposed Plan-and-Budget, a test-time framework for decomposing queries and adaptive token allocation.

Result: Plan-and-Budget improves efficiency: +70% accuracy, -39% tokens, +187.5% $E^3$ improvement. Smaller models match larger ones in efficiency.

Conclusion: Plan-and-Budget effectively mitigates overthinking, enhancing LLM efficiency without retraining, closing performance gaps between models.

Abstract: Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the $E^3$ metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in $E^3$. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget’s ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.

[693] Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning

Wooseong Jeong, Kuk-Jin Yoon

Main category: cs.LG

TL;DR: DTME-MTL is a dynamic framework for transformer-based MTL that mitigates negative transfer by adapting token space, improving performance without excessive parameters.

DetailsMotivation: Negative transfer in MTL due to conflicting objectives and rigid transformer structures limits adaptability and efficiency.

Method: DTME-MTL identifies gradient conflicts in token space and applies adaptive solutions, avoiding parameter duplication.

Result: DTME-MTL consistently enhances multi-task performance with minimal computational overhead.

Conclusion: DTME-MTL provides a scalable and efficient solution for improving transformer-based MTL models.

Abstract: Multi-Task Learning (MTL) enables multiple tasks to be learned within a shared network, but differences in objectives across tasks can cause negative transfer, where the learning of one task degrades another task’s performance. While pre-trained transformers significantly improve MTL performance, their fixed network capacity and rigid structure limit adaptability. Previous dynamic network architectures attempt to address this but are inefficient as they directly convert shared parameters into task-specific ones. We propose Dynamic Token Modulation and Expansion (DTME-MTL), a framework applicable to any transformer-based MTL architecture. DTME-MTL enhances adaptability and reduces overfitting by identifying gradient conflicts in token space and applying adaptive solutions based on conflict type. Unlike prior methods that mitigate negative transfer by duplicating network parameters, DTME-MTL operates entirely in token space, enabling efficient adaptation without excessive parameter growth. Extensive experiments demonstrate that DTME-MTL consistently improves multi-task performance with minimal computational overhead, offering a scalable and effective solution for enhancing transformer-based MTL models.

[694] Supervised Graph Contrastive Learning for Gene Regulatory Network

Sho Oshima, Yuji Okamoto, Taisei Tosaki, Ryosuke Kojima, Yasushi Okuno

Main category: cs.LG

TL;DR: SupGCL introduces supervised graph contrastive learning for Gene Regulatory Networks by incorporating biological perturbations from gene knockdowns, outperforming existing methods in biological tasks.

DetailsMotivation: Existing GCL methods overlook biologically relevant perturbations in GRNs, limiting their effectiveness for biological applications.

Method: SupGCL extends GCL by integrating gene knockdown data as supervision, creating probabilistic models for biological perturbations.

Result: SupGCL outperforms state-of-the-art baselines in patient hazard prediction, disease subtype classification, and gene function classification.

Conclusion: SupGCL enhances GCL for GRNs by leveraging biological perturbations, improving performance in downstream biological tasks.

Abstract: Graph representation learning is effective for obtaining a meaningful latent space utilizing the structure of graph data and is widely applied, including biological networks. In particular, Graph Contrastive Learning (GCL) has emerged as a powerful self-supervised method that relies on applying perturbations to graphs for data augmentation. However, when applying existing GCL methods to biological networks such as Gene Regulatory Networks (GRNs), they overlooked meaningful biologically relevant perturbations, e.g., gene knockdowns. In this study, we introduce SupGCL (Supervised Graph Contrastive Learning), a novel GCL method for GRNs that directly incorporates biological perturbations derived from gene knockdown experiments as the supervision. SupGCL mathematically extends existing GCL methods that utilize non-biological perturbations to probabilistic models that introduce actual biological gene perturbation utilizing gene knockdown data. Using the GRN representation obtained by our proposed method, our aim is to improve the performance of biological downstream tasks such as patient hazard prediction and disease subtype classification (graph-level task), and gene function classification (node-level task). We applied SupGCL on real GRN datasets derived from patients with multiple types of cancer, and in all experiments SupGCL achieves better performance than state-of-the-art baselines.

[695] Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training

Wooseong Jeong, Jegyeong Cho, Youngho Yoon, Kuk-Jin Yoon

Main category: cs.LG

TL;DR: S4T improves test-time training for multi-task models by synchronizing tasks during adaptation, outperforming existing methods.

DetailsMotivation: Addressing the challenge of unsynchronized task behavior in conventional TTT methods when handling multiple tasks under domain shifts.

Method: Proposes S4T, which predicts task relations across domain shifts to synchronize tasks during test-time training.

Result: S4T outperforms state-of-the-art TTT methods on various benchmarks.

Conclusion: S4T effectively synchronizes tasks in multi-task scenarios, enhancing generalization under domain shifts.

Abstract: Generalizing neural networks to unseen target domains is a significant challenge in real-world deployments. Test-time training (TTT) addresses this by using an auxiliary self-supervised task to reduce the domain gap caused by distribution shifts between the source and target. However, we find that when models are required to perform multiple tasks under domain shifts, conventional TTT methods suffer from unsynchronized task behavior, where the adaptation steps needed for optimal performance in one task may not align with the requirements of other tasks. To address this, we propose a novel TTT approach called Synchronizing Tasks for Test-time Training (S4T), which enables the concurrent handling of multiple tasks. The core idea behind S4T is that predicting task relations across domain shifts is key to synchronizing tasks during test time. To validate our approach, we apply S4T to conventional multi-task benchmarks, integrating it with traditional TTT protocols. Our empirical results show that S4T outperforms state-of-the-art TTT methods across various benchmarks.

[696] Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework

Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban

Main category: cs.LG

TL;DR: MCS-Set integrates atomic structures with 2D projections and textual annotations to enhance materials science datasets for multimodal learning.

DetailsMotivation: Overcome limitations of atomic geometry-only datasets to enable advanced machine learning in materials science.

Method: Introduces MCS-Set, a framework combining atomic structures, 2D projections, and textual annotations via a human-in-the-loop pipeline.

Result: Reveals modality-specific performance gaps and underscores annotation quality’s role in generalization.

Conclusion: MCS-Set supports benchmarking multimodal models, improving annotation practices, and creating versatile materials datasets.

Abstract: Most materials science datasets are limited to atomic geometries (e.g., XYZ files), restricting their utility for multimodal learning and comprehensive data-centric analysis. These constraints have historically impeded the adoption of advanced machine learning techniques in the field. This work introduces MultiCrystalSpectrumSet (MCS-Set), a curated framework that expands materials datasets by integrating atomic structures with 2D projections and structured textual annotations, including lattice parameters and coordination metrics. MCS-Set enables two key tasks: (1) multimodal property and summary prediction, and (2) constrained crystal generation with partial cluster supervision. Leveraging a human-in-the-loop pipeline, MCS-Set combines domain expertise with standardized descriptors for high-quality annotation. Evaluations using state-of-the-art language and vision-language models reveal substantial modality-specific performance gaps and highlight the importance of annotation quality for generalization. MCS-Set offers a foundation for benchmarking multimodal models, advancing annotation practices, and promoting accessible, versatile materials science datasets. The dataset and implementations are available at https://github.com/KurbanIntelligenceLab/MultiCrystalSpectrumSet.

[697] SOC-DGL: Social Interaction Behavior Inspired Dual Graph Learning Framework for Drug-Target Interaction Identification

Xiang Zhao, Ruijie Li, Qiao Ning, Shikai Guo, Hui Li, Qian Ma

Main category: cs.LG

TL;DR: SOC-DGL is a novel model for drug-target interaction prediction, leveraging heterogeneous graphs and dual modules (ADGL and EDGL) to capture multi-scale similarity, outperforming existing methods.

DetailsMotivation: Existing models for drug-target interactions (DTI) lack exploitation of heterogeneous graph similarities, limiting their effectiveness.

Method: SOC-DGL uses ADGL for global similarity and EDGL for higher-order similarity, with an adjustable imbalance loss function for dataset imbalance.

Result: SOC-DGL outperforms state-of-the-art methods on benchmark datasets and successfully predicts drug interactions, including unconfirmed ones.

Conclusion: SOC-DGL advances DTI prediction by effectively capturing multi-scale similarities and addressing dataset imbalance, with practical validation.

Abstract: The identification of drug-target interactions (DTI) is critical for drug discovery and repositioning, as it reveals potential therapeutic uses of existing drugs, accelerating development and reducing costs. However, most existing models focus only on direct similarity in homogeneous graphs, failing to exploit the rich similarity in heterogeneous graphs. To address this gap, inspired by real-world social interaction behaviors, we propose SOC-DGL, which comprises two specialized modules: the Affinity-Driven Graph Learning (ADGL) module, learning global similarity through an affinity-enhanced drug-target graph, and the Equilibrium-Driven Graph Learning (EDGL) module, capturing higher-order similarity by amplifying the influence of even-hop neighbors using an even-polynomial graph filter based on balance theory. This dual approach enables SOC-DGL to effectively capture similarity information across multiple interaction scales within affinity and association matrices. To address the issue of imbalance in DTI datasets, we propose an adjustable imbalance loss function that adjusts the weight of negative samples by the parameter. Extensive experiments on four benchmark datasets demonstrate that SOC-DGL consistently outperforms existing state-of-the-art methods across both balanced and imbalanced scenarios. Moreover, SOC-DGL successfully predicts the top 9 drugs known to bind ABL1, and further analyzed the 10th drug, which has not been experimentally confirmed to interact with ABL1, providing supporting evidence for its potential binding.

[698] Improving Group Robustness on Spurious Correlation via Evidential Alignment

Wenqian Ye, Guangtao Zheng, Aidong Zhang

Main category: cs.LG

TL;DR: The paper introduces Evidential Alignment, a framework using uncertainty quantification to mitigate spurious correlations in deep neural networks without needing group annotations.

DetailsMotivation: Deep neural networks often rely on spurious correlations, harming generalization and robustness. Existing methods require costly annotations or deterministic models, which may not capture all biases.

Method: Proposes Evidential Alignment, leveraging uncertainty quantification via second-order risk minimization and evidential calibration to identify and suppress spurious correlations.

Result: Empirical results show improved group robustness across architectures and data modalities, validating the method’s effectiveness.

Conclusion: Evidential Alignment provides a scalable, principled solution to spurious correlations without requiring annotations, enhancing model trustworthiness.

Abstract: Deep neural networks often learn and rely on spurious correlations, i.e., superficial associations between non-causal features and the targets. For instance, an image classifier may identify camels based on the desert backgrounds. While it can yield high overall accuracy during training, it degrades generalization on more diverse scenarios where such correlations do not hold. This problem poses significant challenges for out-of-distribution robustness and trustworthiness. Existing methods typically mitigate this issue by using external group annotations or auxiliary deterministic models to learn unbiased representations. However, such information is costly to obtain, and deterministic models may fail to capture the full spectrum of biases learned by the models. To address these limitations, we propose Evidential Alignment, a novel framework that leverages uncertainty quantification to understand the behavior of the biased models without requiring group annotations. By quantifying the evidence of model prediction with second-order risk minimization and calibrating the biased models with the proposed evidential calibration technique, Evidential Alignment identifies and suppresses spurious correlations while preserving core features. We theoretically justify the effectiveness of our method as capable of learning the patterns of biased models and debiasing the model without requiring any spurious correlation annotations. Empirical results demonstrate that our method significantly improves group robustness across diverse architectures and data modalities, providing a scalable and principled solution to spurious correlations.

[699] Interpretable Reward Modeling with Active Concept Bottlenecks

Sonia Laguna, Katarzyna Kobalczyk, Julia E. Vogt, Mihaela Van der Schaar

Main category: cs.LG

TL;DR: CB-RM introduces interpretable reward models via concept annotation, using active learning for efficiency, outperforming baselines in interpretability and sample efficiency.

DetailsMotivation: To address the opacity of standard RLHF reward functions by decomposing rewards into human-interpretable concepts.

Method: Decomposes reward prediction into concepts, uses active learning (Expected Information Gain) for efficient label acquisition.

Result: Outperforms baselines on UltraFeedback dataset in interpretability and sample efficiency.

Conclusion: CB-RM advances transparent, auditable, and human-aligned reward models.

Abstract: We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.

[700] Credit Risk Analysis for SMEs Using Graph Neural Networks in Supply Chain

Zizhou Zhang, Qinyan Shen, Zhuohuan Hu, Qianying Liu, Huijie Shen

Main category: cs.LG

TL;DR: A GNN-based framework improves SME credit risk analysis by leveraging transaction and social data, outperforming traditional methods with high AUC scores.

DetailsMotivation: SMEs are crucial to the economy but face challenges in credit risk analysis due to limited data, especially for online lenders.

Method: Uses Graph Neural Networks (GNNs) to analyze SME interactions from transaction and social data for spatial dependency mapping and default risk prediction.

Result: Achieves AUCs of 0.995 (supply chain) and 0.701 (default prediction), outperforming baselines. Helps regulators model supply chain impacts and aids stress testing.

Conclusion: The GNN framework is a scalable, effective tool for SME credit risk assessment, benefiting lenders and regulators.

Abstract: Small and Medium-sized Enterprises (SMEs) are vital to the modern economy, yet their credit risk analysis often struggles with scarce data, especially for online lenders lacking direct credit records. This paper introduces a Graph Neural Network (GNN)-based framework, leveraging SME interactions from transaction and social data to map spatial dependencies and predict loan default risks. Tests on real-world datasets from Discover and Ant Credit (23.4M nodes for supply chain analysis, 8.6M for default prediction) show the GNN surpasses traditional and other GNN baselines, with AUCs of 0.995 and 0.701 for supply chain mining and default prediction, respectively. It also helps regulators model supply chain disruption impacts on banks, accurately forecasting loan defaults from material shortages, and offers Federal Reserve stress testers key data for CCAR risk buffers. This approach provides a scalable, effective tool for assessing SME credit risk.

[701] Can Mental Imagery Improve the Thinking Capabilities of AI Systems?

Slimane Larabi

Main category: cs.LG

TL;DR: The paper proposes a machine thinking framework integrating mental imagery to enhance autonomous reasoning, validated through tests.

DetailsMotivation: Existing models lack autonomous reasoning and struggle with cross-domain knowledge integration, unlike humans who use mental imagery.

Method: A machine thinking framework with a Cognitive thinking unit and three auxiliary units (Input Data, Needs, Mental Imagery) is proposed, using natural language or sketches for data representation.

Result: Validation tests were conducted, with results presented and discussed.

Conclusion: Integrating mental imagery into machine thinking can initiate autonomous reasoning, addressing gaps in current AI models.

Abstract: Although existing models can interact with humans and provide satisfactory responses, they lack the ability to act autonomously or engage in independent reasoning. Furthermore, input data in these models is typically provided as explicit queries, even when some sensory data is already acquired. In addition, AI agents, which are computational entities designed to perform tasks and make decisions autonomously based on their programming, data inputs, and learned knowledge, have shown significant progress. However, they struggle with integrating knowledge across multiple domains, unlike humans. Mental imagery plays a fundamental role in the brain’s thinking process, which involves performing tasks based on internal multisensory data, planned actions, needs, and reasoning capabilities. In this paper, we investigate how to integrate mental imagery into a machine thinking framework and how this could be beneficial in initiating the thinking process. Our proposed machine thinking framework integrates a Cognitive thinking unit supported by three auxiliary units: the Input Data Unit, the Needs Unit, and the Mental Imagery Unit. Within this framework, data is represented as natural language sentences or drawn sketches, serving both informative and decision-making purposes. We conducted validation tests for this framework, and the results are presented and discussed.

[702] Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees

Chuyan Chen, Yutong He, Pengrui Li, Weichen Jia, Kun Yuan

Main category: cs.LG

TL;DR: GreedyLore is a greedy low-rank gradient compression algorithm for distributed learning, combining error feedback and semi-lazy subspace updates to ensure convergence with a linear speedup rate.

DetailsMotivation: Communication overhead in distributed optimization is a bottleneck; existing low-rank gradient compression methods either lack convergence guarantees (greedy) or perform poorly (randomized).

Method: GreedyLore uses error feedback to correct bias and semi-lazy subspace updates to maintain contractive compression, ensuring convergence.

Result: GreedyLore achieves a convergence rate of O(σ/√NT + 1/T), the first linear speedup for low-rank compression.

Conclusion: GreedyLore bridges the gap between empirical performance and theoretical guarantees in low-rank gradient compression.

Abstract: Distributed optimization is pivotal for large-scale signal processing and machine learning, yet communication overhead remains a major bottleneck. Low-rank gradient compression, in which the transmitted gradients are approximated by low-rank matrices to reduce communication, offers a promising remedy. Existing methods typically adopt either randomized or greedy compression strategies: randomized approaches project gradients onto randomly chosen subspaces, introducing high variance and degrading empirical performance; greedy methods select the most informative subspaces, achieving strong empirical results but lacking convergence guarantees. To address this gap, we propose GreedyLore–the first Greedy Low-Rank gradient compression algorithm for distributed learning with rigorous convergence guarantees. GreedyLore incorporates error feedback to correct the bias introduced by greedy compression and introduces a semi-lazy subspace update that ensures the compression operator remains contractive throughout all iterations. With these techniques, we prove that GreedyLore achieves a convergence rate of $\mathcal{O}(\sigma/\sqrt{NT} + 1/T)$ under standard optimizers such as MSGD and Adam–marking the first linear speedup convergence rate for low-rank gradient compression. Extensive experiments are conducted to validate our theoretical findings.

[703] Iceberg: Enhancing HLS Modeling with Synthetic Data

Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong

Main category: cs.LG

TL;DR: Iceberg improves HLS prediction model generalizability using synthetic data augmentation and weak labels, achieving significant accuracy and performance gains.

DetailsMotivation: Deep learning models for HLS struggle with generalization, prompting the need for methods like Iceberg to bridge this gap.

Method: Iceberg uses synthetic data augmentation (LLM-generated programs) and weak labels, integrated with an in-context model for meta-learning.

Result: Iceberg boosts modeling accuracy by 86.4% and improves DSE performance by 2.47x and 1.12x on test datasets.

Conclusion: Iceberg effectively enhances HLS model generalizability and performance, with open-sourced code available.

Abstract: Deep learning-based prediction models for High-Level Synthesis (HLS) of hardware designs often struggle to generalize. In this paper, we study how to close the generalizability gap of these models through pretraining on synthetic data and introduce Iceberg, a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations. Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels. Iceberg improves the geometric mean modeling accuracy by $86.4%$ when adapt to six real-world applications with few-shot examples and achieves a $2.47\times$ and a $1.12\times$ better offline DSE performance when adapting to two different test datasets. Our open-sourced code is here: https://github.com/UCLA-VAST/iceberg

[704] Relative Entropy Pathwise Policy Optimization

Claas Voelcker, Axel Brunnbauer, Marcel Hussing, Michal Nauman, Pieter Abbeel, Eric Eaton, Radu Grosu, Amir-massoud Farahmand, Igor Gilitschenski

Main category: cs.LG

TL;DR: The paper introduces REPPO, an on-policy algorithm combining pathwise policy gradients’ efficiency with on-policy learning’s simplicity, addressing high variance and training stability issues.

DetailsMotivation: Score-function policy gradients suffer from high variance, while pathwise policy gradients require accurate action-conditioned value functions. The paper aims to bridge this gap for stable on-policy learning.

Method: Proposes REPPO, which trains Q-value models purely from on-policy data, balancing exploration and stable updates, and evaluates architectural components for accurate value learning.

Result: REPPO shows strong performance with reduced sample needs, faster training, lower memory use, and robust hyperparameters in benchmarks.

Conclusion: REPPO effectively merges pathwise policy gradients’ efficiency with on-policy learning’s simplicity, offering a practical solution for stable and sample-efficient training.

Abstract: Score-function policy gradients have delivered strong results in game-playing, robotics and language-model fine-tuning. Yet its high-variance often undermines training stability. On the other hand, pathwise policy gradients alleviate the training variance, but are reliable only when driven by an accurate action-conditioned value function which is notoriously hard to train without relying on past off-policy data. In this paper, we discuss how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to balance stochastic policies for exploration with constrained policy updates for stable training, and evaluate important architectural components that facilitate accurate value function learning. Building on these insights, we propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. We demonstrate that REPPO provides strong empirical performance at decreased sample requirements, wall-clock time, memory footprint as well as high hyperparameter robustness in a set of experiments on two standard GPU-parallelized benchmarks.

[705] Composing Linear Layers from Irreducibles

Travis Pence, Daisuke Yamada, Vikas Singh

Main category: cs.LG

TL;DR: The paper explores the compositional structure of linear layers in large models using Clifford algebra, identifying bivectors as geometric primitives and introducing a parameter-efficient rotor-based decomposition method.

DetailsMotivation: To understand and synthesize the fundamental building blocks (geometric primitives) of linear layers in contemporary large models, which remain poorly understood.

Method: Uses Clifford algebra to express linear layers as compositions of bivectors and introduces a differentiable algorithm to decompose them into rotors, reducing parameters from O(d^2) to O(log^2 d).

Result: Rotor-based layers match the performance of strong baselines like block-Hadamard and low-rank approximations in LLM attention layers.

Conclusion: The study provides an algebraic perspective on how geometric primitives compose into higher-level functions in deep models, offering a parameter-efficient alternative.

Abstract: Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors – geometric objects encoding oriented planes – and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O(log^2 d) parameters, versus O(d^2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.

[706] Fake or Real: The Impostor Hunt in Texts for Space Operations

Agata Kaczmarek, Dawid Płudowski, Piotr Wilczyński, Przemysław Biecek, Krzysztof Kotowski, Ramez Shendy, Jakub Nalepa, Artur Janicki, Evridiki Ntagiou

Main category: cs.LG

TL;DR: The Kaggle competition ‘Fake or Real’ challenges participants to detect maliciously modified outputs from Large Language Models (LLMs), addressing AI security threats like data poisoning and overreliance.

DetailsMotivation: The competition stems from real-life AI security threats identified in the ESA-funded 'Assurance for Space Domain AI Applications' project, aiming to tackle underexplored issues in LLM security.

Method: Participants must develop or adapt techniques to distinguish between genuine and maliciously altered LLM outputs.

Result: The competition seeks innovative solutions to a novel problem in AI security.

Conclusion: This initiative highlights the need for advanced methods to secure LLMs against emerging threats.

Abstract: The “Fake or Real” competition hosted on Kaggle (https://www.kaggle.com/competitions/fake-or-real-the-impostor-hunt ) is the second part of a series of follow-up competitions and hackathons related to the “Assurance for Space Domain AI Applications” project funded by the European Space Agency (https://assurance-ai.space-codev.org/ ). The competition idea is based on two real-life AI security threats identified within the project – data poisoning and overreliance in Large Language Models. The task is to distinguish between the proper output from LLM and the output generated under malicious modification of the LLM. As this problem was not extensively researched, participants are required to develop new techniques to address this issue or adjust already existing ones to this problem’s statement.

[707] Bayesian Optimization for Molecules Should Be Pareto-Aware

Anabel Yong, Austin Tripp, Layla Hosseini-Gerami, Brooks Paige

Main category: cs.LG

TL;DR: Multi-objective Bayesian optimization (MOBO) with EHVI outperforms scalarized EI in molecular design, showing better Pareto front coverage, speed, and diversity.

DetailsMotivation: To empirically compare MOBO (using EHVI) with scalarized alternatives (using EI) in molecular optimization tasks.

Method: Benchmarked EHVI against scalarized EI using identical Gaussian Process surrogates and molecular representations in three tasks.

Result: EHVI consistently outperformed scalarized EI in Pareto front coverage, convergence speed, and chemical diversity.

Conclusion: Pareto-aware acquisition (EHVI) is advantageous in molecular optimization, especially with limited evaluation budgets and nontrivial trade-offs.

Abstract: Multi-objective Bayesian optimization (MOBO) provides a principled framework for navigating trade-offs in molecular design. However, its empirical advantages over scalarized alternatives remain underexplored. We benchmark a simple Pareto-based MOBO strategy – Expected Hypervolume Improvement (EHVI) – against a simple fixed-weight scalarized baseline using Expected Improvement (EI), under a tightly controlled setup with identical Gaussian Process surrogates and molecular representations. Across three molecular optimization tasks, EHVI consistently outperforms scalarized EI in terms of Pareto front coverage, convergence speed, and chemical diversity. While scalarization encompasses flexible variants – including random or adaptive schemes – our results show that even strong deterministic instantiations can underperform in low-data regimes. These findings offer concrete evidence for the practical advantages of Pareto-aware acquisition in de novo molecular optimization, especially when evaluation budgets are limited and trade-offs are nontrivial.

cs.MA

[708] Learning to Communicate in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence

Faizan Contractor, Li Li, Ranwa Al Mallah

Main category: cs.MA

TL;DR: The paper proposes a method for cooperative multi-agent reinforcement learning in partially observable cyber environments, where agents learn to communicate and defend against threats using the Differentiable Inter Agent Learning algorithm.

DetailsMotivation: Current methods limit coordinated effects due to independent agent actions during execution. Effective communication can enhance decision-making in cyber defense.

Method: Agents train in the Cyber Operations Research Gym using the Differentiable Inter Agent Learning algorithm to learn tactical policies and minimal-cost communication.

Result: Agents develop tactical policies similar to human experts and learn efficient communication strategies.

Conclusion: The approach improves coordinated defense in cyber environments through learned communication and tactical policies.

Abstract: Popular methods in cooperative Multi-Agent Reinforcement Learning with partially observable environments typically allow agents to act independently during execution, which may limit the coordinated effect of the trained policies. However, by sharing information such as known or suspected ongoing threats, effective communication can lead to improved decision-making in the cyber battle space. We propose a game design where defender agents learn to communicate and defend against imminent cyber threats by playing training games in the Cyber Operations Research Gym, using the Differentiable Inter Agent Learning algorithm adapted to the cyber operational environment. The tactical policies learned by these autonomous agents are akin to those of human experts during incident responses to avert cyber threats. In addition, the agents simultaneously learn minimal cost communication messages while learning their defence tactical policies.

[709] LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

Chengwei Lou, Zekai Jin, Wei Tang, Guangfei Geng, Jin Yang, Lu Zhang

Main category: cs.MA

TL;DR: The paper proposes an LLM-MARL framework for real-time P2P electricity markets, combining large language models with multi-agent reinforcement learning to address challenges like prosumer limitations and grid security.

DetailsMotivation: To overcome scaling challenges in P2P electricity markets, such as diverse decision-making demands and lack of expert guidance, by integrating LLMs and MARL.

Method: An LLM-MARL framework under CTDE, using LLMs as experts for personalized strategies and a differential attention-based critic network for improved convergence.

Result: LLM-generated strategies replace human experts effectively, achieving lower economic costs and voltage violation rates than baselines.

Conclusion: The framework bridges expert knowledge with agent learning, offering a robust solution for real-time P2P market decision-making.

Abstract: Real-time peer-to-peer (P2P) electricity markets dynamically adapt to fluctuations in renewable energy and variations in demand, maximizing economic benefits through instantaneous price responses while enhancing grid flexibility. However, scaling expert guidance for massive personalized prosumers poses critical challenges, including diverse decision-making demands and lack of customized modeling frameworks. This paper proposed an integrated large language model-multi-agent reinforcement learning (LLM-MARL) framework for real-time P2P energy trading to address challenges such as the limited technical capability of prosumers, the lack of expert experience, and security issues of distribution networks. LLMs are introduced as experts to generate personalized strategy, guiding MARL under the centralized training with decentralized execution (CTDE) paradigm through imitation learning. A differential attention-based critic network is designed to enhance convergence performance. Experimental results demonstrate that LLM generated strategies effectively substitute human experts. The proposed multi-agent imitation learning algorithms achieve significantly lower economic costs and voltage violation rates on test sets compared to baselines algorithms, while maintaining robust stability. This work provides an effective solution for real-time P2P electricity market decision-making by bridging expert knowledge with agent learning.

[710] EduThink4AI: Translating Educational Critical Thinking into Multi-Agent LLM Systems

Xinmeng Hou, Zhouquan Lu, Wenli Chen, Hai Hu, Qing Guo

Main category: cs.MA

TL;DR: EDU-Prompting is a multi-agent framework designed to improve LLM-based educational tutoring by enhancing critical thinking and reducing biases in responses.

DetailsMotivation: Current LLM-based educational systems struggle with promoting critical thinking and handling adversarial prompts, leading to biased or incorrect responses.

Method: Proposes EDU-Prompting, a multi-agent framework integrating educational critical thinking theories with LLM design to generate bias-aware explanations and diverse perspectives.

Result: EDU-Prompting improves truthfulness and logical soundness in AI-generated educational responses, as shown in theoretical benchmarks and practical scenarios.

Conclusion: The modular design of EDU-Prompting allows easy integration into existing systems, enhancing critical thinking without major modifications.

Abstract: Large language models (LLMs) have demonstrated significant potential as educational tutoring agents, capable of tailoring hints, orchestrating lessons, and grading with near-human finesse across various academic domains. However, current LLM-based educational systems exhibit critical limitations in promoting genuine critical thinking, failing on over one-third of multi-hop questions with counterfactual premises, and remaining vulnerable to adversarial prompts that trigger biased or factually incorrect responses. To address these gaps, we propose EDU-Prompting, a novel multi-agent framework that bridges established educational critical thinking theories with LLM agent design to generate critical, bias-aware explanations while fostering diverse perspectives. Our systematic evaluation across theoretical benchmarks and practical college-level critical writing scenarios demonstrates that EDU-Prompting significantly enhances both content truthfulness and logical soundness in AI-generated educational responses. The framework’s modular design enables seamless integration into existing prompting frameworks and educational applications, allowing practitioners to directly incorporate critical thinking catalysts that promote analytical reasoning and introduce multiple perspectives without requiring extensive system modifications.

[711] LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, Chi Jin

Main category: cs.MA

TL;DR: The LLM Economist framework uses agent-based modeling to design and evaluate economic policies, combining bounded rational worker agents and a planner agent for hierarchical decision-making, achieving improved social welfare.

DetailsMotivation: To create a credible framework for fiscal experimentation by modeling complex economic systems with realistic agent populations and natural language-based mechanism design.

Method: Uses persona-conditioned worker agents for labor supply decisions and a planner agent for tax policy proposals via in-context reinforcement learning.

Result: The planner converges near Stackelberg equilibria, improving aggregate social welfare compared to Saez solutions, with further gains under decentralized governance.

Conclusion: Large language model-based agents can effectively model, simulate, and govern economic systems, offering a scalable test bed for policy evaluation.

Abstract: We present the LLM Economist, a novel framework that uses agent-based modeling to design and assess economic policies in strategic environments with hierarchical decision-making. At the lower level, bounded rational worker agents – instantiated as persona-conditioned prompts sampled from U.S. Census-calibrated income and demographic statistics – choose labor supply to maximize text-based utility functions learned in-context. At the upper level, a planner agent employs in-context reinforcement learning to propose piecewise-linear marginal tax schedules anchored to the current U.S. federal brackets. This construction endows economic simulacra with three capabilities requisite for credible fiscal experimentation: (i) optimization of heterogeneous utilities, (ii) principled generation of large, demographically realistic agent populations, and (iii) mechanism design – the ultimate nudging problem – expressed entirely in natural language. Experiments with populations of up to one hundred interacting agents show that the planner converges near Stackelberg equilibria that improve aggregate social welfare relative to Saez solutions, while a periodic, persona-level voting procedure furthers these gains under decentralized governance. These results demonstrate that large language model-based agents can jointly model, simulate, and govern complex economic systems, providing a tractable test bed for policy evaluation at the societal scale to help build better civilizations.

[712] Set-Rationalizable Choice and Self-Stability

Felix Brandt, Paul Harrenstein

Main category: cs.MA

TL;DR: Error: OutputParser failed

DetailsMotivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: A common assumption in modern microeconomic theory is that choice should be rationalizable via a binary preference relation, which \citeauthor{Sen71a} showed to be equivalent to two consistency conditions, namely $\alpha$ (contraction) and $\gamma$ (expansion). Within the context of \emph{social} choice, however, rationalizability and similar notions of consistency have proved to be highly problematic, as witnessed by a range of impossibility results, among which Arrow’s is the most prominent. Since choice functions select \emph{sets} of alternatives rather than single alternatives, we propose to rationalize choice functions by preference relations over sets (set-rationalizability). We also introduce two consistency conditions, $\hat\alpha$ and $\hat\gamma$, which are defined in analogy to $\alpha$ and $\gamma$, and find that a choice function is set-rationalizable if and only if it satisfies $\hat\alpha$. Moreover, a choice function satisfies $\hat\alpha$ and $\hat\gamma$ if and only if it is \emph{self-stable}, a new concept based on earlier work by \citeauthor{Dutt88a}. The class of self-stable social choice functions contains a number of appealing Condorcet extensions such as the minimal covering set and the essential set.

[713] Recognizing and Eliciting Weakly Single Crossing Profiles on Trees

Palash Dey

Main category: cs.MA

TL;DR: The paper introduces the weakly single-crossing domain on trees, generalizing the single-crossing domain in social choice theory. It provides algorithms for recognition and elicitation, with complexity bounds.

DetailsMotivation: To generalize the single-crossing domain and address challenges in recognizing and eliciting preferences efficiently, especially when the underlying structure is unknown.

Method: Develops polynomial-time recognition and efficient sequential elicitation algorithms, with proofs of lower bounds on query complexity.

Result: The algorithms are efficient, and lower bounds confirm their optimality, resolving an open question about random query complexity.

Conclusion: The work advances understanding of preference domains in social choice, offering practical tools and theoretical insights.

Abstract: We introduce and study the weakly single-crossing domain on trees which is a generalization of the well-studied single-crossing domain in social choice theory. We design a polynomial-time algorithm for recognizing preference profiles which belong to this domain. We then develop an efficient elicitation algorithm for this domain which works even if the preferences can be accessed only sequentially and the underlying single-crossing tree structure is not known beforehand. We also prove matching lower bound on the query complexity of our elicitation algorithm when the number of voters is large compared to the number of candidates. We also prove a lower bound of $\Omega(m^2\log n)$ on the number of queries that any algorithm needs to ask to elicit single crossing profile when random queries are allowed. This resolves an open question in an earlier paper and proves optimality of their preference elicitation algorithm when random queries are allowed.

[714] DHLight: Multi-agent Policy-based Directed Hypergraph Learning for Traffic Signal Control

Zhen Lei, Zhishu Shen, Kang Wang, Zhenwei Wang, Tiehua Zhang

Main category: cs.MA

TL;DR: DHLight is a novel multi-agent framework combining directed hypergraph learning for adaptive traffic signal control, outperforming traditional graph-based methods.

DetailsMotivation: Traditional graph structures fail to capture complex spatio-temporal traffic dynamics, limiting effectiveness in traffic signal control.

Method: DHLight integrates a directed hypergraph learning module with a dynamic construction mechanism to model evolving traffic relationships.

Result: DHLight outperforms state-of-the-art baselines in experiments across various network datasets.

Conclusion: Directed hypergraphs enhance traffic signal control by better representing spatial relationships, validated by DHLight’s superior performance.

Abstract: Recent advancements in Deep Reinforcement Learning (DRL) and Graph Neural Networks (GNNs) have demonstrated notable promise in the realm of intelligent traffic signal control, facilitating the coordination across multiple intersections. However, the traditional methods rely on standard graph structures often fail to capture the intricate higher-order spatio-temporal correlations inherent in real-world traffic dynamics. Standard graphs cannot fully represent the spatial relationships within road networks, which limits the effectiveness of graph-based approaches. In contrast, directed hypergraphs provide more accurate representation of spatial information to model complex directed relationships among multiple nodes. In this paper, we propose DHLight, a novel multi-agent policy-based framework that synergistically integrates directed hypergraph learning module. This framework introduces a novel dynamic directed hypergraph construction mechanism, which captures complex and evolving spatio-temporal relationships among intersections in road networks. By leveraging the directed hypergraph relational structure, DHLight empowers agents to achieve adaptive decision-making in traffic signal control. The effectiveness of DHLight is validated against state-of-the-art baselines through extensive experiments in various network datasets. We release the code to support the reproducibility of this work at https://github.com/LuckyVoasem/Traffic-Light-control

[715] AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction

Syeda Kisaa Fatima, Tehreem Zubair, Noman Ahmed, Asifullah Khan

Main category: cs.MA

TL;DR: LUCID-MA is an AI framework using multiple agents to analyze crime data, featuring spatiotemporal analysis, feedback refinement, and crime prediction, all offline with self-improvement through agent communication.

DetailsMotivation: To enable autonomous, scalable, and iterative crime analysis while maintaining data privacy and leveraging emergent intelligence from multi-agent collaboration.

Method: Uses three AI agents (analysis, feedback, prediction) with LLaMA-2-13B-Chat-GPTQ, offline execution, and 100 rounds of self-improving communication. Performance is tracked via scoring and visual plots.

Result: Demonstrates enhanced agent performance through collaborative dialogue, emergent intelligence, and scalable crime analysis.

Conclusion: LUCID-MA showcases the potential of AutoGen-style agents for social science applications, emphasizing privacy, scalability, and emergent intelligence.

Abstract: This paper introduces LUCID-MA (Learning and Understanding Crime through Dialogue of Multiple Agents), an innovative AI powered framework where multiple AI agents collaboratively analyze and understand crime data. Our system that consists of three core components: an analysis assistant that highlights spatiotemporal crime patterns; a feedback component that reviews and refines analytical results; and a prediction component that forecasts future crime trends. With a well-designed prompt and the LLaMA-2-13B-Chat-GPTQ model, it runs completely offline and allows the agents undergo self-improvement through 100 rounds of communication with less human interaction. A scoring function is incorporated to evaluate agent performance, providing visual plots to track learning progress. This work demonstrates the potential of AutoGen-style agents for autonomous, scalable, and iterative analysis in social science domains, maintaining data privacy through offline execution. It also showcases a computational model with emergent intelligence, where the system’s global behavior emerges from the interactions of its agents. This emergent behavior manifests as enhanced individual agent performance, driven by collaborative dialogue between the LLM-based agents.

cs.MM

[716] Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling

Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, Zhi Wang

Main category: cs.MM

TL;DR: SoulDance dataset and SoulNet framework address challenges in generating music-aligned holistic dance by providing high-precision data and a method for coordinated motion generation.

DetailsMotivation: The scarcity of holistic 3D dance datasets and the difficulty of cross-modal alignment between music and dance motivate the need for a solution.

Method: SoulNet uses Hierarchical Residual Vector Quantization, a Music-Aligned Generative Model, and a Music-Motion Retrieval Module for synchronized dance generation.

Result: SoulNet outperforms existing methods in producing high-quality, music-coordinated 3D dance sequences.

Conclusion: SoulNet and SoulDance dataset effectively address the challenges of holistic dance generation, offering superior performance.

Abstract: Well-coordinated, music-aligned holistic dance enhances emotional expressiveness and audience engagement. However, generating such dances remains challenging due to the scarcity of holistic 3D dance datasets, the difficulty of achieving cross-modal alignment between music and dance, and the complexity of modeling interdependent motion across the body, hands, and face. To address these challenges, we introduce SoulDance, a high-precision music-dance paired dataset captured via professional motion capture systems, featuring meticulously annotated holistic dance movements. Building on this dataset, we propose SoulNet, a framework designed to generate music-aligned, kinematically coordinated holistic dance sequences. SoulNet consists of three principal components: (1) Hierarchical Residual Vector Quantization, which models complex, fine-grained motion dependencies across the body, hands, and face; (2) Music-Aligned Generative Model, which composes these hierarchical motion units into expressive and coordinated holistic dance; (3) Music-Motion Retrieval Module, a pre-trained cross-modal model that functions as a music-dance alignment prior, ensuring temporal synchronization and semantic coherence between generated dance and input music throughout the generation process. Extensive experiments demonstrate that SoulNet significantly surpasses existing approaches in generating high-quality, music-coordinated, and well-aligned holistic 3D dance sequences.

[717] Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval

Deyu Zhang, Tingting Long, Jinrui Zhang, Ligeng Chen, Ju Ren, Yaoxue Zhang

Main category: cs.MM

TL;DR: ProCLIP is a user-centric framework for efficient text-video retrieval on edge devices, combining prompt-aware frame sampling and two-stage pruning to balance accuracy and computational efficiency.

DetailsMotivation: Existing methods struggle to balance accuracy and efficiency in text-video retrieval, with uniform sampling being computationally costly and salient-frame sampling being query-agnostic.

Method: ProCLIP uses prompt-aware frame sampling to dynamically select relevant frames and a two-stage pruning strategy (coarse filtering + CLIP-powered re-ranking) for efficiency.

Result: ProCLIP reduces latency by 75.3% while maintaining competitive accuracy (R@1=49.0 on MSR-VTT).

Conclusion: ProCLIP effectively addresses the trade-off between accuracy and efficiency in text-video retrieval, making it suitable for edge-end applications.

Abstract: Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at https://github.com/tiffylong/ProCLIP.

[718] Point Cloud Streaming with Latency-Driven Implicit Adaptation using MoQ

Andrew Freeman, Michael Rudolph, Amr Rizk

Main category: cs.MM

TL;DR: The paper proposes using Media Over QUIC for server-side adaptation in point cloud streaming, balancing latency and video quality per client.

DetailsMotivation: Point clouds are high-bitrate, limiting live streaming feasibility. Existing HTTP-based methods require explicit client-side adaptation, which is inefficient.

Method: Leverages Media Over QUIC’s delivery timeout for implicit server-side adaptation based on latency targets.

Result: Demonstrates per-client trade-offs: lower latency requirements yield lower-quality video, and vice versa.

Conclusion: The system enables efficient point cloud streaming by dynamically adjusting quality based on client latency needs.

Abstract: Point clouds are a promising video representation for next-generation multimedia experiences in virtual and augmented reality. Point clouds are notoriously high-bitrate, however, which limits the feasibility of live streaming systems. Prior methods have adopted traditional HTTP-based protocols for point cloud streaming, but they rely on explicit client-side adaptation to maintain low latency under congestion. In this work, we leverage the delivery timeout feature within the Media Over QUIC protocol to perform implicit server-side adaptation based on an application’s latency target. Through experimentation with several publisher and network configurations, we demonstrate that our system unlocks a unique trade-off on a per-client basis: applications with lower latency requirements will receive lower-quality video, while applications with more relaxed latency requirements will receive higher-quality video.

[719] Music Grounding by Short Video

Zijie Xin, Minquan Wang, Jingyu Liu, Ye Ma, Quan Chen, Peng Jiang, Xirong Li

Main category: cs.MM

TL;DR: The paper introduces a new task, Music Grounding by Short Video (MGSV), to address the gap between video-to-music retrieval and practical music moment localization. It proposes a benchmark (MGSV-EC) and a baseline method (MaDe) for this task.

DetailsMotivation: Existing video-to-music retrieval (V2MR) methods require manual trimming of long music tracks to match short videos, which is impractical. The paper aims to automate this process by localizing suitable music moments directly.

Method: The paper introduces the MGSV-EC benchmark with 53k short videos and 35k music moments. It also develops MaDe, an end-to-end deep network for video-to-music matching and moment detection.

Result: Experiments on MGSV-EC demonstrate the challenge of MGSV and establish MaDe as a strong baseline method.

Conclusion: The paper successfully bridges the gap between V2MR and practical needs by proposing MGSV, a benchmark, and a baseline method, MaDe.

Abstract: Adding proper background music helps complete a short video to be shared. Previous work tackles the task by video-to-music retrieval (V2MR), aiming to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53k short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unified end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also set MaDe as a strong baseline.

eess.AS

[720] Towards Accurate Phonetic Error Detection Through Phoneme Similarity Modeling

Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Tejas Prabhune, Shuhe Li, William Li, Rodrigo Ortiz, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

Main category: eess.AS

TL;DR: A framework for phonetic error detection using multi-task training and phoneme similarity modeling, with a new dataset and metrics.

DetailsMotivation: Addressing challenges in phoneme recognition due to speech variability like accents and dysfluencies.

Method: Verbatim phoneme recognition with multi-task training and novel phoneme similarity modeling.

Result: Development of VCTK-accent dataset and two novel metrics for pronunciation assessment.

Conclusion: Sets a new benchmark for phonetic error detection.

Abstract: Phonetic error detection, a core subtask of automatic pronunciation assessment, identifies pronunciation deviations at the phoneme level. Speech variability from accents and dysfluencies challenges accurate phoneme recognition, with current models failing to capture these discrepancies effectively. We propose a verbatim phoneme recognition framework using multi-task training with novel phoneme similarity modeling that transcribes what speakers actually say rather than what they’re supposed to say. We develop and open-source \textit{VCTK-accent}, a simulated dataset containing phonetic errors, and propose two novel metrics for assessing pronunciation differences. Our work establishes a new benchmark for phonetic error detection.

[721] Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications

Satwik Dutta, Shruthigna Chandupatla, John Hansen

Main category: eess.AS

TL;DR: A lightweight Whisper ASR system for child-centered voice apps was developed for Raspberry Pi, achieving a WER of 15.9% (11.8% filtered) with low-rank compression reducing encoder size and improving inference speed.

DetailsMotivation: Addressing privacy and regulatory challenges in cloud-based ASR for child-centered applications by creating an efficient, on-device solution.

Method: Fine-tuning the ’tiny.en’ model using the MyST corpus and applying filtering strategies, along with low-rank compression to reduce model size and speed up inference.

Result: Achieved a WER of 15.9% (11.8% filtered), reduced encoder size by 0.51M, and improved GPU inference speed by 1.26x. Raspberry Pi handled the models with RTF between 0.23-0.41.

Conclusion: The compressed Whisper ASR system is viable for Raspberry Pi, balancing performance and privacy, though small models introduce thermal overhead.

Abstract: Reliability on cloud providers for ASR inference to support child-centered voice-based applications is becoming challenging due to regulatory and privacy challenges. Motivated by a privacy-preserving design, this study aims to develop a lightweight & efficient Whisper ASR system capable of running on a Raspberry Pi. Upon evaluation of the MyST corpus and by examining various filtering strategies to fine-tune the `tiny.en’ model, a Word Error Rate (WER) of 15.9% was achieved (11.8% filtered). A low-rank compression reduces the encoder size by 0.51M with 1.26x faster inference in GPU, with 11% relative WER increase. During inference on Pi, the compressed version required ~2 GFLOPS fewer computations. The RTF for both the models ranged between [0.23-0.41] for various input audio durations. Analyzing the RAM usage and CPU temperature showed that the PI was capable of handling both the tiny models, however it was noticed that small models initiated additional overhead/thermal throttling.

[722] Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

Yu Zhang, Baotong Tian, Zhiyao Duan

Main category: eess.AS

TL;DR: Conan is a zero-shot online voice conversion model addressing real-time constraints, semantic fidelity, and unseen speaker adaptation.

DetailsMotivation: Current VC models struggle with real-time constraints, semantic fidelity, and adapting to unseen speakers.

Method: Conan uses a Stream Content Extractor, Adaptive Style Encoder, and Causal Shuffle Vocoder for low-latency, style-adaptive, and natural-sounding conversion.

Result: Conan outperforms baselines in subjective and objective metrics.

Conclusion: Conan effectively addresses challenges in zero-shot online voice conversion.

Abstract: Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics. Audio samples can be found at https://aaronz345.github.io/ConanDemo.

[723] Parameter-Efficient Fine-Tuning of Foundation Models for CLP Speech Classification

Susmita Bhattacharjee, Jagabandhu Mishra, H. S. Shekhawat, S. R. Mahadeva Prasanna

Main category: eess.AS

TL;DR: The paper proposes parameter-efficient fine-tuning (PEFT) of foundation models for detecting and classifying cleft lip and palate (CLP) severity, achieving significant improvements over baselines.

DetailsMotivation: CLP severity affects speech patterns, and foundation models fine-tuned on domain-specific data may better discriminate these variations.

Method: Experiments compare embeddings from self-supervised models (Wav2Vec2, WavLM, Whisper) with traditional features (eGeMAPS, ComParE), then fine-tune Whisper using PEFT techniques (LoRA, DoRA).

Result: The approach achieves relative F1 score improvements of 26.4%-63.4% on NMCPC and 6.1%-52.9% on AIISH datasets over baselines.

Conclusion: PEFT fine-tuning of foundation models significantly enhances CLP severity detection and classification.

Abstract: We propose the use of parameter-efficient fine-tuning (PEFT) of foundation models for cleft lip and palate (CLP) detection and severity classification. In CLP, nasalization increases with severity due to the abnormal passage between the oral and nasal tracts; this causes oral stops to be replaced by glottal stops and alters formant trajectories and vowel space. Since foundation models are trained for grapheme prediction or long-term quantized representation prediction, they may better discriminate CLP severity when fine-tuned on domain-specific data. We conduct experiments on two datasets: English (NMCPC) and Kannada (AIISH). We perform a comparative analysis using embeddings from self-supervised models Wav2Vec2 and WavLM, and the weakly supervised Whisper, each paired with SVM classifiers, and compare them with traditional handcrafted features eGeMAPS and ComParE. Finally, we fine-tune the best-performing Whisper model using PEFT techniques: Low-Rank Adapter (LoRA) and Decomposed Low-Rank Adapter (DoRA). Our results demonstrate that the proposed approach achieves relative improvements of 26.4% and 63.4% in macro-average F1 score over the best foundation model and handcrafted feature baselines on the NMCPC dataset, and improvements of 6.1% and 52.9% on the AIISH dataset, respectively.

[724] DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

Yinghao Aaron Li, Xilin Jiang, Fei Tao, Cheng Niu, Kaifeng Xu, Juntong Song, Nima Mesgarani

Main category: eess.AS

TL;DR: DMOSpeech 2 extends metric optimization to duration prediction in TTS using reinforcement learning, improving performance and reducing sampling steps.

DetailsMotivation: Prior work optimized speech generation but not duration prediction, limiting perceptual metric optimization.

Method: Uses reinforcement learning (GRPO) for duration prediction and introduces teacher-guided sampling for diversity.

Result: Superior performance across metrics, reduced sampling steps by half, and maintained quality.

Conclusion: DMOSpeech 2 advances metric-optimized TTS by optimizing duration prediction and improving efficiency.

Abstract: Diffusion-based text-to-speech (TTS) systems have made remarkable progress in zero-shot speech synthesis, yet optimizing all components for perceptual metrics remains challenging. Prior work with DMOSpeech demonstrated direct metric optimization for speech generation components, but duration prediction remained unoptimized. This paper presents DMOSpeech 2, which extends metric optimization to the duration predictor through a reinforcement learning approach. The proposed system implements a novel duration policy framework using group relative preference optimization (GRPO) with speaker similarity and word error rate as reward signals. By optimizing this previously unoptimized component, DMOSpeech 2 creates a more complete metric-optimized synthesis pipeline. Additionally, this paper introduces teacher-guided sampling, a hybrid approach leveraging a teacher model for initial denoising steps before transitioning to the student model, significantly improving output diversity while maintaining efficiency. Comprehensive evaluations demonstrate superior performance across all metrics compared to previous systems, while reducing sampling steps by half without quality degradation. These advances represent a significant step toward speech synthesis systems with metric optimization across multiple components. The audio samples, code and pre-trained models are available at https://dmospeech2.github.io/.

[725] Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR

Zhong-Qiu Wang, Ruizhe Pang

Main category: eess.AS

TL;DR: The paper proposes using beamformed mixtures as weak supervision to train DNNs for speech enhancement, improving generalization to real-recorded mixtures.

DetailsMotivation: Beamforming improves SNR and reduces distortion in speech enhancement, but training on simulated mixtures often mismatches real conditions. Using beamformed mixtures as supervision can bridge this gap.

Method: Train DNNs using pairs of real-recorded mixtures and their beamformed counterparts, leveraging the higher SNR of beamformed signals as weak supervision.

Result: Evaluation on the CHiME-4 dataset confirms the algorithm’s effectiveness in enhancing real-recorded mixtures.

Conclusion: The proposed method offers better generalization to real-world conditions by utilizing beamformed mixtures for training, outperforming models trained solely on simulated data.

Abstract: In multi-channel speech enhancement and robust automatic speech recognition (ASR), beamforming can typically improve the signal-to-noise ratio (SNR) of the target speaker and produce reliable enhancement with little distortion to target speech. With this observation, we propose to leverage beamformed mixture, which has a higher SNR of the target speaker than the input mixture, as a weak supervision to train deep neural networks (DNNs) to enhance the input mixture. This way, we can train enhancement models using pairs of real-recorded mixture and its beamformed mixture, and potentially realize better generalization to real mixtures, compared with only training the models on simulated mixtures, which usually mismatch real mixtures. Evaluation results on the real-recorded CHiME-4 dataset show the effectiveness of the proposed algorithm.

[726] Binaural Signal Matching with Wearable Arrays for Near-Field Sources

Sapir Goldring, Zamir Ben Hur, David Lou Alon, Boaz Rafaely

Main category: eess.AS

TL;DR: The paper evaluates the Binaural Signal Matching (BSM) algorithm for near-field sources, showing improved accuracy with near-field modeling.

DetailsMotivation: Existing BSM assumes far-field sources, but its performance for near-field scenarios, common in VR and teleconferencing, is unexplored.

Method: The study analyzes BSM using a semi-circular array around a rigid sphere, comparing far-field and near-field designs.

Result: Far-field BSM works for sources tens of centimeters away, but closer sources increase error. Near-field BSM reduces this error significantly.

Conclusion: Near-field modeling enhances BSM accuracy for very-close sources, benefiting immersive audio applications.

Abstract: Binaural reproduction methods aim to recreate an acoustic scene for a listener over headphones, offering immersive experiences in applications such as Virtual Reality (VR) and teleconferencing. Among the existing approaches, the Binaural Signal Matching (BSM) algorithm has demonstrated high quality reproduction due to its signal-independent formulation and the flexibility of unconstrained array geometry. However, this method assumes far-field sources and has not yet been investigated for near-field scenarios. This study evaluates the performance of BSM for near-field sources. Analysis of a semi-circular array around a rigid sphere, modeling head-mounted devices, show that far-field BSM performs adequately for sources up to approximately tens of centimeters from the array. However, for sources closer than this range, the binaural error increases significantly. Incorporating a near-field BSM design, which accounts for the source distance, significantly reduces the error, particularly for these very-close distances, highlighting the benefits of near-field modeling in improving reproduction accuracy.

[727] Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems

Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg

Main category: eess.AS

TL;DR: Sortformer introduces Sort Loss and a streamlined multi-speaker speech-to-text architecture to improve speaker diarization and transcription accuracy.

DetailsMotivation: To resolve the speaker permutation problem in speaker diarization and enhance multi-speaker transcription accuracy.

Method: Uses Sort Loss (independently or with PIL) and embeds speaker labels into the encoder via sinusoidal kernel functions.

Result: Sort Loss boosts diarization performance, and speaker supervision improves transcription accuracy.

Conclusion: Sortformer enables seamless speaker tagging integration into speech-to-text systems and LLMs, enhancing versatility.

Abstract: Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the proposed Sortformer and multi-speaker architecture will enable the seamless integration of speaker tagging capabilities into foundational speech-to-text systems and multimodal large language models (LLMs), offering an easily adoptable and user-friendly mechanism to enhance their versatility and performance in speaker-aware tasks. The code and trained models are made publicly available through the NVIDIA NeMo Framework.

[728] ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao

Main category: eess.AS

TL;DR: The paper introduces ISDrama, a model for generating immersive spatial drama from multimodal prompts, addressing challenges in spatial and prosody modeling. It includes a novel dataset (MRSDrama) and outperforms baselines.

DetailsMotivation: To tackle the high-cost data collection and lack of existing solutions for generating continuous multi-speaker binaural speech with dramatic prosody in AR/VR applications.

Method: Proposes ISDrama with a Multimodal Pose Encoder (contrastive learning) and Immersive Drama Transformer (flow-based mamba-transformer with Drama-MOE). Uses context-consistent classifier-free guidance.

Result: ISDrama outperforms baselines on objective and subjective metrics.

Conclusion: The work pioneers multimodal immersive spatial drama generation, offering a dataset, model, and evaluation tools for future research.

Abstract: Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components:

  1. Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos are available at https://aaronz345.github.io/ISDramaDemo. We provide the dataset and the evaluation code at https://huggingface.co/datasets/AaronZ345/MRSDrama and https://github.com/AaronZ345/ISDrama.

[729] Unifying Listener Scoring Scales: Comparison Learning Framework for Speech Quality Assessment and Continuous Speech Emotion Recognition

Cheng-Hung Hu, Yusuke Yasuda, Akifumi Yoshimoto, Tomoki Toda

Main category: eess.AS

TL;DR: The paper introduces a method to improve Speech Quality Assessment (SQA) and Continuous Speech Emotion Recognition (CSER) by modeling a unified listener scoring scale, avoiding biases from mean listener approaches.

DetailsMotivation: Listener ratings in SQA and CSER are biased due to individual factors. Mean listener approaches distort ordinal data, and learning multiple scales while inferring from the mean limits effectiveness.

Method: The proposed method models a unified listener scoring scale using comparison scores to capture scoring relationships between utterances.

Result: Experiments show the method improves prediction performance in SQA and CSER, proving its effectiveness and robustness.

Conclusion: The unified listener scoring scale approach outperforms mean listener methods, reducing biases and enhancing performance in speech technology tasks.

Abstract: Speech Quality Assessment (SQA) and Continuous Speech Emotion Recognition (CSER) are two key tasks in speech technology, both relying on listener ratings. However, these ratings are inherently biased due to individual listener factors. Previous approaches have introduced a mean listener scoring scale and modeled all listener scoring scales in the training set. However, the mean listener approach is prone to distortion from averaging ordinal data, leading to potential biases. Moreover, learning multiple listener scoring scales while inferring based only on the mean listener scale limits effectiveness. In contrast, our method focuses on modeling a unified listener scoring scale, using comparison scores to correctly capture the scoring relationships between utterances. Experimental results show that our method effectively improves prediction performance in both SQA and CSER tasks, proving its effectiveness and robustness.

eess.IV

[730] MiDeSeC: A Dataset for Mitosis Detection and Segmentation in Breast Cancer Histopathology Images

Refik Samet, Nooshin Nemati, Emrah Hancer, Serpil Sak, Bilge Ayca Kirmizi, Zeynep Yildirim

Main category: eess.IV

TL;DR: The MiDeSeC dataset is a collection of H&E stained breast carcinoma slides from 25 patients, featuring 50 regions with over 500 mitoses, split into training and testing sets.

DetailsMotivation: To address the variability in mitosis shapes by creating a comprehensive dataset for accurate detection and analysis.

Method: Slides were scanned using 3D Histech Panoramic p250 Flash-3 scanner and Olympus BX50 microscope, with 50 regions (1024x1024 pixels) selected per patient.

Result: The dataset includes over 500 mitoses, with two-thirds of regions for training and one-third for testing.

Conclusion: The MiDeSeC dataset provides a robust resource for mitosis detection in breast carcinoma research.

Abstract: The MiDeSeC dataset is created through H&E stained invasive breast carcinoma, no special type (NST) slides of 25 different patients captured at 40x magnification from the Department of Medical Pathology at Ankara University. The slides have been scanned by 3D Histech Panoramic p250 Flash-3 scanner and Olympus BX50 microscope. As several possible mitosis shapes exist, it is crucial to have a large dataset to cover all the cases. Accordingly, a total of 50 regions is selected from glass slides for 25 patients, each of regions with a size of 1024*1024 pixels. There are more than 500 mitoses in total in these 50 regions. Two-thirds of the regions are reserved for training, the other third for testing.

[731] NuSeC: A Dataset for Nuclei Segmentation in Breast Cancer Histopathology Images

Refik Samet, Nooshin Nemati, Emrah Hancer, Serpil Sak, Bilge Ayca Kirmizi

Main category: eess.IV

TL;DR: The NuSeC dataset consists of 100 images from 25 patients, split into 75% training (75 images) and 25% testing (25 images) sets for consistent comparative analysis.

DetailsMotivation: To enable consistent comparative analysis of future methods developed using the NuSeC dataset.

Method: Images (1024x1024 pixels) were selected from patient slides, and the dataset was split randomly into training (75 images) and testing (25 images) sets.

Result: The training set contains ~30,000 nuclei structures, and the testing set contains ~6,000 nuclei structures.

Conclusion: The NuSeC dataset is structured to facilitate reliable evaluation of future research methods.

Abstract: The NuSeC dataset is created by selecting 4 images with the size of 1024*1024 pixels from the slides of each patient among 25 patients. Therefore, there are a total of 100 images in the NuSeC dataset. To carry out a consistent comparative analysis between the methods that will be developed using the NuSeC dataset by the researchers in the future, we divide the NuSeC dataset 75% as the training set and 25% as the testing set. In detail, an image is randomly selected from 4 images of each patient among 25 patients to build the testing set, and then the remaining images are reserved for the training set. While the training set includes 75 images with around 30000 nuclei structures, the testing set includes 25 images with around 6000 nuclei structures.

[732] Self-Supervised Joint Reconstruction and Denoising of T2-Weighted PROPELLER MRI of the Lungs at 0.55T

Jingjia Chen, Haoyang Pei, Christoph Maier, Mary Bruno, Qiuting Wen, Seon-Hi Shin, William Moore, Hersh Chandarana, Li Feng

Main category: eess.IV

TL;DR: A self-supervised model improves 0.55T T2-weighted lung MRI by jointly reconstructing and denoising, outperforming traditional methods.

DetailsMotivation: To enhance lung MRI clarity and reduce scan time without needing clean targets, leveraging intrinsic k-space redundancies.

Method: Self-supervised learning splits PROPELLER blades into partitions for training and loss calculation, compared to MPPCA denoising.

Result: Improved image clarity and alignment with CT scans, reduced scan time, and outperformed MPPCA (p<0.001).

Conclusion: The model effectively reconstructs and denoises lung MRI using self-supervised learning, proving superior to conventional methods.

Abstract: Purpose: This study aims to improve 0.55T T2-weighted PROPELLER lung MRI through a self-supervised joint reconstruction and denoising model. Methods: T2-weighted 0.55T lung MRI dataset including 44 patients with previous covid infection were used. A self-supervised learning framework was developed, where each blade of the PROPELLER acquisition was split along the readout direction into two partitions. One subset trains the unrolled reconstruction network, while the other subset is used for loss calculation, enabling self-supervised training without clean targets and leveraging matched noise statistics for denoising. For comparison, Marchenko-Pastur Principal Component Analysis (MPPCA) was performed along the coil dimension, followed by conventional parallel imaging reconstruction. The quality of the reconstructed lung MRI was assessed visually by two experienced radiologists independently. Results: The proposed self-supervised model improved the clarity and structural integrity of the lung images. For cases with available CT scans, the reconstructed images demonstrated strong alignment with corresponding CT images. Additionally, the proposed model enables further scan time reduction by requiring only half the number of blades. Reader evaluations confirmed that the proposed method outperformed MPPCA-denoised images across all categories (Wilcoxon signed-rank test, p<0.001), with moderate inter-reader agreement (weighted Cohen’s kappa=0.55; percentage of exact and within +/-1 point agreement=91%). Conclusion: By leveraging intrinsic structural redundancies between two disjoint splits of k-space subsets, the proposed self-supervised learning model effectively reconstructs the image while suppressing the noise for 0.55T T2-weighted lung MRI with PROPELLER sampling.

[733] Classification of Histopathology Slides with Persistence Homology Convolutions

Shrunal Pothagoni, Benjamin Schweinhart

Main category: eess.IV

TL;DR: A novel method, Persistent Homology Convolutions, improves CNN performance in histopathology by capturing local topological features, outperforming conventional models.

DetailsMotivation: Typical CNNs lose topological information, crucial in domains like histopathology where tissue shape distinguishes diseases. Global topological summaries lack locality details.

Method: Introduces Persistent Homology Convolutions, a modified convolution operator, to generate local persistent homology-based data, preserving locality and translation invariance.

Result: Models with persistent homology convolutions outperform conventional ones, showing less hyperparameter sensitivity and better geometric information extraction.

Conclusion: Persistent Homology Convolutions effectively capture meaningful topological features, enhancing diagnostics in histopathology.

Abstract: Convolutional neural networks (CNNs) are a standard tool for computer vision tasks such as image classification. However, typical model architectures may result in the loss of topological information. In specific domains such as histopathology, topology is an important descriptor that can be used to distinguish between disease-indicating tissue by analyzing the shape characteristics of cells. Current literature suggests that reintroducing topological information using persistent homology can improve medical diagnostics; however, previous methods utilize global topological summaries which do not contain information about the locality of topological features. To address this gap, we present a novel method that generates local persistent homology-based data using a modified version of the convolution operator called Persistent Homology Convolutions. This method captures information about the locality and translation invariance of topological features. We perform a comparative study using various representations of histopathology slides and find that models trained with persistent homology convolutions outperform conventionally trained models and are less sensitive to hyperparameters. These results indicate that persistent homology convolutions extract meaningful geometric information from the histopathology slides.

[734] Spatiotemporal Maps for Dynamic MRI Reconstruction

Rodrigo A. Lobos, Xiaokai Wang, Rex T. L. Fung, Yongli He, David Frey, Dinank Gupta, Zhongming Liu, Jeffrey A. Fessler, Douglas C. Noll

Main category: eess.IV

TL;DR: The paper introduces spatiotemporal maps (STMs) to address limitations of the partially separable functions (PSF) model in dynamic MRI reconstruction, offering improved representation for varying voxel characteristics.

DetailsMotivation: The PSF model's reduced effectiveness in scenarios with spatially varying temporal/spectral characteristics motivates the need for a more adaptable model.

Method: The STM model decomposes MRI signals into spatial and spatially-dependent temporal components, leveraging autoregressive properties and advanced signal processing techniques.

Result: STM-based reconstruction is demonstrated on 2D single-channel and 3D multichannel MRI data, proving its feasibility.

Conclusion: STMs extend the PSF model, providing a versatile and efficient framework for dynamic MRI reconstruction.

Abstract: The partially separable functions (PSF) model is commonly adopted in dynamic MRI reconstruction, as is the underlying signal model in many reconstruction methods including the ones relying on low-rank assumptions. Even though the PSF model offers a parsimonious representation of the dynamic MRI signal in several applications, its representation capabilities tend to decrease in scenarios where voxels present different temporal/spectral characteristics at different spatial locations. In this work we account for this limitation by proposing a new model, called spatiotemporal maps (STMs), that leverages autoregressive properties of (k, t)-space. The STM model decomposes the spatiotemporal MRI signal into a sum of components, each one consisting of a product between a spatial function and a temporal function that depends on the spatial location. The proposed model can be interpreted as an extension of the PSF model whose temporal functions are independent of the spatial location. We show that spatiotemporal maps can be efficiently computed from autocalibration data by using advanced signal processing and randomized linear algebra techniques, enabling STMs to be used as part of many reconstruction frameworks for accelerated dynamic MRI. As proof-of-concept illustrations, we show that STMs can be used to reconstruct both 2D single-channel animal gastrointestinal MRI data and 3D multichannel human functional MRI data.

[735] QUTCC: Quantile Uncertainty Training and Conformal Calibration for Imaging Inverse Problems

Cassandra Tong Ye, Shamus Li, Tyler King, Kristina Monakhova

Main category: eess.IV

TL;DR: QUTCC is a quantile uncertainty training and calibration technique for deep learning models, improving reliability in medical imaging tasks by providing tighter, nonlinear uncertainty bounds.

DetailsMotivation: Deep learning models hallucinate in critical tasks like MRI denoising, where accuracy is vital. Existing uncertainty methods use linear scaling, leading to less informative bounds.

Method: QUTCC uses a U-Net with quantile embedding to predict full conditional quantile distributions. It iteratively refines bounds during calibration for tighter intervals.

Result: QUTCC outperforms prior methods, pinpointing hallucinations and achieving tighter uncertainty intervals while maintaining statistical coverage.

Conclusion: QUTCC enhances reliability in medical imaging by providing more precise uncertainty quantification, addressing the limitations of linear scaling methods.

Abstract: Deep learning models often hallucinate, producing realistic artifacts that are not truly present in the sample. This can have dire consequences for scientific and medical inverse problems, such as MRI and microscopy denoising, where accuracy is more important than perceptual quality. Uncertainty quantification techniques, such as conformal prediction, can pinpoint outliers and provide guarantees for image regression tasks, improving reliability. However, existing methods utilize a linear constant scaling factor to calibrate uncertainty bounds, resulting in larger, less informative bounds. We propose QUTCC, a quantile uncertainty training and calibration technique that enables nonlinear, non-uniform scaling of quantile predictions to enable tighter uncertainty estimates. Using a U-Net architecture with a quantile embedding, QUTCC enables the prediction of the full conditional distribution of quantiles for the imaging task. During calibration, QUTCC generates uncertainty bounds by iteratively querying the network for upper and lower quantiles, progressively refining the bounds to obtain a tighter interval that captures the desired coverage. We evaluate our method on several denoising tasks as well as compressive MRI reconstruction. Our method successfully pinpoints hallucinations in image estimates and consistently achieves tighter uncertainty intervals than prior methods while maintaining the same statistical coverage.

[736] PET Image Reconstruction Using Deep Diffusion Image Prior

Fumio Hashimoto, Kuang Gong

Main category: eess.IV

TL;DR: A diffusion model-based method for PET image reconstruction, guided by anatomical priors, improves generalization across tracers and reduces computational demands.

DetailsMotivation: Addressing tracer-specific contrast variability and high computational demands in PET imaging using diffusion models.

Method: Combines diffusion sampling and model fine-tuning guided by PET sinogram, with HQS for efficiency.

Result: Robust generalization across tracer distributions and scanner types, validated on simulation and clinical datasets.

Conclusion: Efficient and versatile framework for low-dose PET imaging.

Abstract: Diffusion models have shown great promise in medical image denoising and reconstruction, but their application to Positron Emission Tomography (PET) imaging remains limited by tracer-specific contrast variability and high computational demands. In this work, we proposed an anatomical prior-guided PET image reconstruction method based on diffusion models, inspired by the deep diffusion image prior (DDIP) framework. The proposed method alternated between diffusion sampling and model fine-tuning guided by the PET sinogram, enabling the reconstruction of high-quality images from various PET tracers using a score function pretrained on a dataset of another tracer. To improve computational efficiency, the half-quadratic splitting (HQS) algorithm was adopted to decouple network optimization from iterative PET reconstruction. The proposed method was evaluated using one simulation and two clinical datasets. For the simulation study, a model pretrained on [$^{18}$F]FDG data was tested on amyloid-negative PET data to assess out-of-distribution (OOD) performance. For the clinical-data validation, ten low-dose [$^{18}$F]FDG datasets and one [$^{18}$F]Florbetapir dataset were tested on a model pretrained on data from another tracer. Experiment results show that the proposed PET reconstruction method can generalize robustly across tracer distributions and scanner types, providing an efficient and versatile reconstruction framework for low-dose PET imaging.

[737] Performance Analysis of Post-Training Quantization for CNN-based Conjunctival Pallor Anemia Detection

Sebastian A. Cruz Romero, Wilfredo E. Lugo Beauchamp

Main category: eess.IV

TL;DR: A deep learning model using MobileNet architecture achieves high accuracy in detecting anemia through conjunctival pallor, with potential for mobile deployment via quantization.

DetailsMotivation: Traditional anemia detection methods are costly and require expertise, limiting accessibility in low-resource settings.

Method: The study uses the CP-AnemiC dataset and MobileNet, fine-tuned with data augmentation and cross-validation, followed by post-training quantization for edge deployment.

Result: The model achieved 0.9313 accuracy, 0.9374 precision, and 0.9773 F1 score, with FP16 quantization maintaining strong performance.

Conclusion: Quantization and hardware optimizations should be explored to balance model size, speed, and accuracy for mobile healthcare.

Abstract: Anemia is a widespread global health issue, particularly among young children in low-resource settings. Traditional methods for anemia detection often require expensive equipment and expert knowledge, creating barriers to early and accurate diagnosis. To address these challenges, we explore the use of deep learning models for detecting anemia through conjunctival pallor, focusing on the CP-AnemiC dataset, which includes 710 images from children aged 6-59 months. The dataset is annotated with hemoglobin levels, gender, age and other demographic data, enabling the development of machine learning models for accurate anemia detection. We use the MobileNet architecture as a backbone, known for its efficiency in mobile and embedded vision applications, and fine-tune our model end-to-end using data augmentation techniques and a cross-validation strategy. Our model implementation achieved an accuracy of 0.9313, a precision of 0.9374, and an F1 score of 0.9773 demonstrating strong performance on the dataset. To optimize the model for deployment on edge devices, we performed post-training quantization, evaluating the impact of different bit-widths (FP32, FP16, INT8, and INT4) on model performance. Preliminary results suggest that while FP16 quantization maintains high accuracy (0.9250), precision (0.9370), and F1 Score (0.9377), more aggressive quantization (INT8 and INT4) leads to significant performance degradation. Overall, our study supports further exploration of quantization schemes and hardware optimizations to assess trade-offs between model size, inference time, and diagnostic accuracy in mobile healthcare applications.

[738] A Study of Anatomical Priors for Deep Learning-Based Segmentation of Pheochromocytoma in Abdominal CT

Tanjin Taher Toma, Tejas Sudharshan Mathai, Bikash Santra, Pritam Mukherjee, Jianfei Liu, Wesley Jong, Darwish Alabyad, Vivek Batheja, Abhishek Jha, Mayank Patel, Darko Pucar, Jayadira del Rivero, Karel Pacak, Ronald M. Summers

Main category: eess.IV

TL;DR: The study evaluates anatomical priors for improving deep learning-based PCC segmentation in CT scans, finding the Tumor + Kidney + Aorta (TKA) strategy most effective.

DetailsMotivation: Accurate PCC segmentation aids tumor burden estimation, prognosis, treatment planning, and genetic cluster inference, reducing reliance on costly tests.

Method: The nnU-Net framework tested 11 annotation strategies, including novel multi-class schemes based on organ-specific anatomical priors, on 105 CT scans. Performance was measured using DSC, NSD, and F1 scores.

Result: TKA annotation outperformed others, showing higher accuracy (DSC, NSD, F1) and better tumor burden quantification (R^2 = 0.968). It was robust across genetic subtypes and cross-validation.

Conclusion: Incorporating relevant anatomical context (e.g., TKA) enhances PCC segmentation precision, supporting clinical applications.

Abstract: Accurate segmentation of pheochromocytoma (PCC) in abdominal CT scans is essential for tumor burden estimation, prognosis, and treatment planning. It may also help infer genetic clusters, reducing reliance on expensive testing. This study systematically evaluates anatomical priors to identify configurations that improve deep learning-based PCC segmentation. We employed the nnU-Net framework to evaluate eleven annotation strategies for accurate 3D segmentation of pheochromocytoma, introducing a set of novel multi-class schemes based on organ-specific anatomical priors. These priors were derived from adjacent organs commonly surrounding adrenal tumors (e.g., liver, spleen, kidney, aorta, adrenal gland, and pancreas), and were compared against a broad body-region prior used in previous work. The framework was trained and tested on 105 contrast-enhanced CT scans from 91 patients at the NIH Clinical Center. Performance was measured using Dice Similarity Coefficient (DSC), Normalized Surface Distance (NSD), and instance-wise F1 score. Among all strategies, the Tumor + Kidney + Aorta (TKA) annotation achieved the highest segmentation accuracy, significantly outperforming the previously used Tumor + Body (TB) annotation across DSC (p = 0.0097), NSD (p = 0.0110), and F1 score (25.84% improvement at an IoU threshold of 0.5), measured on a 70-30 train-test split. The TKA model also showed superior tumor burden quantification (R^2 = 0.968) and strong segmentation across all genetic subtypes. In five-fold cross-validation, TKA consistently outperformed TB across IoU thresholds (0.1 to 0.5), reinforcing its robustness and generalizability. These findings highlight the value of incorporating relevant anatomical context in deep learning models to achieve precise PCC segmentation, supporting clinical assessment and longitudinal monitoring.

[739] Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI with Explicit Cardiac Motion Modeling

Yilin Lyu, Fan Yang, Xiaoyue Liu, Zichen Jiang, Joshua Dillon, Debbie Zhao, Martyn Nash, Charlene Mauger, Alistair Young, Ching-Hui Sia, Mark YY Chan, Lei Li

Main category: eess.IV

TL;DR: Proposes a contrast-free method for 3D myocardial infarct reconstruction using cine MRI, leveraging motion patterns for accuracy.

DetailsMotivation: LGE MRI, the current standard for infarct detection, requires contrast agents and has limited spatial resolution. A contrast-free, high-fidelity alternative is needed.

Method: Uses a deep shape fitting model (biv-me) for 4D biventricular mesh reconstruction from cine MRI, followed by CMotion2Infarct-Net to localize infarct regions using motion patterns.

Result: Evaluated on 205 cine MRI scans, the method agrees reasonably with manual delineation.

Conclusion: Demonstrates feasibility of contrast-free, motion-driven 3D infarct reconstruction, enabling efficient digital twin modeling for MI patients.

Abstract: Accurate representation of myocardial infarct geometry is crucial for patient-specific cardiac modeling in MI patients. While Late gadolinium enhancement (LGE) MRI is the clinical gold standard for infarct detection, it requires contrast agents, introducing side effects and patient discomfort. Moreover, infarct reconstruction from LGE often relies on sparsely sampled 2D slices, limiting spatial resolution and accuracy. In this work, we propose a novel framework for automatically reconstructing high-fidelity 3D myocardial infarct geometry from 2D clinically standard cine MRI, eliminating the need for contrast agents. Specifically, we first reconstruct the 4D biventricular mesh from multi-view cine MRIs via an automatic deep shape fitting model, biv-me. Then, we design a infarction reconstruction model, CMotion2Infarct-Net, to explicitly utilize the motion patterns within this dynamic geometry to localize infarct regions. Evaluated on 205 cine MRI scans from 126 MI patients, our method shows reasonable agreement with manual delineation. This study demonstrates the feasibility of contrast-free, cardiac motion-driven 3D infarct reconstruction, paving the way for efficient digital twin of MI.

[740] Once-for-All: Controllable Generative Image Compression with Dynamic Granularity Adaptation

Anqi Li, Feng Li, Yuxi Liu, Runmin Cong, Yao Zhao, Huihui Bai

Main category: eess.IV

TL;DR: The paper introduces Control-GIC, a framework for flexible and high-fidelity generative image compression with fine-grained bitrate adaption.

DetailsMotivation: Addressing the challenge of flexible rate adaption in generative image compression methods to meet diverse compression needs.

Method: Uses a VQGAN framework with variable-length codes (VQ-indices) and correlates patch information density with granular representations for dynamic bitrate adjustment. Includes a probabilistic conditional decoder for realistic reconstruction.

Result: Control-GIC achieves superior performance in flexible bitrate adaption compared to state-of-the-art methods.

Conclusion: Control-GIC effectively addresses the rate adaption challenge in generative image compression, offering high-fidelity and generality.

Abstract: Although recent generative image compression methods have demonstrated impressive potential in optimizing the rate-distortion-perception trade-off, they still face the critical challenge of flexible rate adaption to diverse compression necessities and scenarios. To overcome this challenge, this paper proposes a Controllable Generative Image Compression framework, termed Control-GIC, the first capable of fine-grained bitrate adaption across a broad spectrum while ensuring high-fidelity and generality compression. Control-GIC is grounded in a VQGAN framework that encodes an image as a sequence of variable-length codes (i.e. VQ-indices), which can be losslessly compressed and exhibits a direct positive correlation with the bitrates. Drawing inspiration from the classical coding principle, we correlate the information density of local image patches with their granular representations. Hence, we can flexibly determine a proper allocation of granularity for the patches to achieve dynamic adjustment for VQ-indices, resulting in desirable compression rates. We further develop a probabilistic conditional decoder capable of retrieving historic encoded multi-granularity representations according to transmitted codes, and then reconstruct hierarchical granular features in the formalization of conditional probability, enabling more informative aggregation to improve reconstruction realism. Our experiments show that Control-GIC allows highly flexible and controllable bitrate adaption where the results demonstrate its superior performance over recent state-of-the-art methods. Code is available at https://github.com/lianqi1008/Control-GIC.

[741] Personalized 4D Whole Heart Geometry Reconstruction from Cine MRI for Cardiac Digital Twins

Xiaoyue Liu, Xicheng Sheng, Xiahai Zhuang, Vicente Grau, Mark YY Chan, Ching-Hui Sia, Lei Li

Main category: eess.IV

TL;DR: A weakly supervised learning model reconstructs 4D heart meshes from multi-view 2D cardiac cine MRIs, enabling personalized cardiac digital twins for precision medicine.

DetailsMotivation: Whole-heart cardiac digital twins (CDTs) simulating full organ-scale electromechanics are limited, necessitating a method to generate personalized 4D heart models from cardiac MRIs.

Method: A weakly supervised learning model maps multi-view 2D cardiac cine MRIs to 4D heart meshes, allowing automatic extraction of key cardiac variables.

Result: The model successfully generates personalized 4D heart meshes, facilitating high-temporal-resolution extraction of cardiac variables like ejection fraction and chamber volume changes.

Conclusion: The study demonstrates the feasibility of inferring 4D heart models from MRIs, advancing efficient CDT platforms for precision medicine.

Abstract: Cardiac digital twins (CDTs) provide personalized in-silico cardiac representations and hold great potential for precision medicine in cardiology. However, whole-heart CDT models that simulate the full organ-scale electromechanics of all four heart chambers remain limited. In this work, we propose a weakly supervised learning model to reconstruct 4D (3D+t) heart mesh directly from multi-view 2D cardiac cine MRIs. This is achieved by learning a self-supervised mapping between cine MRIs and 4D cardiac meshes, enabling the generation of personalized heart models that closely correspond to input cine MRIs. The resulting 4D heart meshes can facilitate the automatic extraction of key cardiac variables, including ejection fraction and dynamic chamber volume changes with high temporal resolution. It demonstrates the feasibility of inferring personalized 4D heart models from cardiac MRIs, paving the way for an efficient CDT platform for precision medicine. The code will be publicly released once the manuscript is accepted.

[742] EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Contro

An Wanga, Rulin Zhou, Mengya Xu, Yiru Ye, Longfei Gou, Yiting Chang, Hao Chen, Chwee Ming Lim, Jiankun Wang, Hongliang Ren

Main category: eess.IV

TL;DR: EndoControlMag is a training-free, Lagrangian-based framework for magnifying subtle vascular motions in endoscopic surgery, featuring a PRR scheme and HTM framework for robust performance.

DetailsMotivation: Visualizing subtle vascular motions in endoscopic surgery is challenging due to dynamic scenes, requiring precise and robust solutions.

Method: The framework includes Periodic Reference Resetting (PRR) for error-free temporal coherence and Hierarchical Tissue-aware Magnification (HTM) with dual-mode mask dilation for adaptive magnification.

Result: EndoControlMag outperforms existing methods in accuracy and visual quality, validated on the EndoVMM24 dataset across diverse surgical scenarios.

Conclusion: EndoControlMag provides a robust and accurate solution for vascular motion magnification in endoscopic surgery, with potential for clinical impact.

Abstract: Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.

[743] MedSR-Impact: Transformer-Based Super-Resolution for Lung CT Segmentation, Radiomics, Classification, and Prognosis

Marc Boubnovski Martell, Kristofer Linton-Reid, Mitchell Chen, Sumeet Hindocha, Benjamin Hunter, Marco A. Calzado, Richard Lee, Joram M. Posma, Eric O. Aboagye

Main category: eess.IV

TL;DR: TVSRN-V2, a transformer-based super-resolution framework, improves lung CT analysis by enhancing resolution and integrating with clinical workflows, showing significant gains in segmentation, radiomics, and prognosis.

DetailsMotivation: High-resolution CT is crucial for thoracic disease diagnosis but limited by radiation dose and cost. TVSRN-V2 aims to address these limitations.

Method: Uses Through-Plane Attention Blocks and Swin Transformer V2 for super-resolution, with pseudo-low-resolution augmentation for robustness.

Result: Improves segmentation (+4% Dice), radiomic reproducibility, and predictive performance (+0.06 C-index/AUC).

Conclusion: TVSRN-V2 is a clinically viable solution for dose-efficient CT imaging and analysis.

Abstract: High-resolution volumetric computed tomography (CT) is essential for accurate diagnosis and treatment planning in thoracic diseases; however, it is limited by radiation dose and hardware costs. We present the Transformer Volumetric Super-Resolution Network (\textbf{TVSRN-V2}), a transformer-based super-resolution (SR) framework designed for practical deployment in clinical lung CT analysis. Built from scalable components, including Through-Plane Attention Blocks (TAB) and Swin Transformer V2 – our model effectively reconstructs fine anatomical details in low-dose CT volumes and integrates seamlessly with downstream analysis pipelines. We evaluate its effectiveness on three critical lung cancer tasks – lobe segmentation, radiomics, and prognosis – across multiple clinical cohorts. To enhance robustness across variable acquisition protocols, we introduce pseudo-low-resolution augmentation, simulating scanner diversity without requiring private data. TVSRN-V2 demonstrates a significant improvement in segmentation accuracy (+4% Dice), higher radiomic feature reproducibility, and enhanced predictive performance (+0.06 C-index and AUC). These results indicate that SR-driven recovery of structural detail significantly enhances clinical decision support, positioning TVSRN-V2 as a well-engineered, clinically viable system for dose-efficient imaging and quantitative analysis in real-world CT workflows.

[744] Latent Space Synergy: Text-Guided Data Augmentation for Direct Diffusion Biomedical Segmentation

Muhammad Aqeel, Maham Nazir, Zanxi Ruan, Francesco Setti

Main category: eess.IV

TL;DR: SynDiff combines text-guided synthetic data generation with efficient diffusion-based segmentation to address data scarcity in medical image segmentation, achieving high accuracy and real-time performance.

DetailsMotivation: Medical image segmentation, especially polyp detection, faces data scarcity due to the need for specialized annotation expertise.

Method: Uses latent diffusion models for text-conditioned inpainting to generate realistic synthetic polyps, with direct latent estimation for single-step inference.

Result: Achieves 96.0% Dice and 92.9% IoU on CVC-ClinicDB with real-time capability.

Conclusion: SynDiff provides an efficient solution for deploying deep learning models in resource-limited medical settings by improving segmentation robustness without distribution shift.

Abstract: Medical image segmentation suffers from data scarcity, particularly in polyp detection where annotation requires specialized expertise. We present SynDiff, a framework combining text-guided synthetic data generation with efficient diffusion-based segmentation. Our approach employs latent diffusion models to generate clinically realistic synthetic polyps through text-conditioned inpainting, augmenting limited training data with semantically diverse samples. Unlike traditional diffusion methods requiring iterative denoising, we introduce direct latent estimation enabling single-step inference with T x computational speedup. On CVC-ClinicDB, SynDiff achieves 96.0% Dice and 92.9% IoU while maintaining real-time capability suitable for clinical deployment. The framework demonstrates that controlled synthetic augmentation improves segmentation robustness without distribution shift. SynDiff bridges the gap between data-hungry deep learning models and clinical constraints, offering an efficient solution for deployment in resourcelimited medical settings.

[745] A Steel Surface Defect Detection Method Based on Lightweight Convolution Optimization

Cong Chen, Ming Chen, Hoileong Lee, Yan Li, Jiyang Yu

Main category: eess.IV

TL;DR: A deep learning framework (YOLOv9s with C3Ghost, SCConv, and CARAFE) improves steel surface defect detection accuracy and robustness.

DetailsMotivation: Traditional methods struggle with multi-scale defects due to insufficient accuracy and high miss-detection rates.

Method: Combines YOLOv9s with SCConv for feature optimization, C3Ghost for efficient feature extraction, and CARAFE for precise upsampling.

Result: The proposed model outperforms other methods in accuracy and robustness for defect detection.

Conclusion: The framework effectively addresses challenges in steel surface defect detection, enhancing performance.

Abstract: Surface defect detection of steel, especially the recognition of multi-scale defects, has always been a major challenge in industrial manufacturing. Steel surfaces not only have defects of various sizes and shapes, which limit the accuracy of traditional image processing and detection methods in complex environments. However, traditional defect detection methods face issues of insufficient accuracy and high miss-detection rates when dealing with small target defects. To address this issue, this study proposes a detection framework based on deep learning, specifically YOLOv9s, combined with the C3Ghost module, SCConv module, and CARAFE upsampling operator, to improve detection accuracy and model performance. First, the SCConv module is used to reduce feature redundancy and optimize feature representation by reconstructing the spatial and channel dimensions. Second, the C3Ghost module is introduced to enhance the model’s feature extraction ability by reducing redundant computations and parameter volume, thereby improving model efficiency. Finally, the CARAFE upsampling operator, which can more finely reorganize feature maps in a content-aware manner, optimizes the upsampling process and ensures detailed restoration of high-resolution defect regions. Experimental results demonstrate that the proposed model achieves higher accuracy and robustness in steel surface defect detection tasks compared to other methods, effectively addressing defect detection problems.

[746] DeSamba: Decoupled Spectral Adaptive Framework for 3D Multi-Sequence MRI Lesion Classification

Dezhen Wang, Sheng Miao, Rongxin Chai, Jiufa Cui

Main category: eess.IV

TL;DR: DeSamba, a novel framework for 3D lesion classification in multi-sequence MRI, outperforms SOTA methods by decoupling and adaptively fusing spatial and spectral features.

DetailsMotivation: Effective integration of multi-sequence MRI data for robust 3D lesion classification is challenging.

Method: DeSamba uses a Decoupled Representation Learning Module (DRLM) for feature decoupling and a Spectral Adaptive Modulation Block (SAMB) for dynamic fusion of spectral and spatial features.

Result: Achieves 62.10% Top-1 accuracy, 63.62% F1-score, and 87.71% AUC on a spinal metastasis dataset, and 70.00%/64.52% accuracy on a spondylitis dataset.

Conclusion: DeSamba is a generalizable and effective solution for 3D lesion classification in multi-sequence medical imaging.

Abstract: Magnetic Resonance Imaging (MRI) sequences provide rich spatial and frequency domain information, which is crucial for accurate lesion classification in medical imaging. However, effectively integrating multi-sequence MRI data for robust 3D lesion classification remains a challenge. In this paper, we propose DeSamba (Decoupled Spectral Adaptive Network and Mamba-Based Model), a novel framework designed to extract decoupled representations and adaptively fuse spatial and spectral features for lesion classification. DeSamba introduces a Decoupled Representation Learning Module (DRLM) that decouples features from different MRI sequences through self-reconstruction and cross-reconstruction, and a Spectral Adaptive Modulation Block (SAMB) within the proposed SAMNet, enabling dynamic fusion of spectral and spatial information based on lesion characteristics. We evaluate DeSamba on two clinically relevant 3D datasets. On a six-class spinal metastasis dataset (n=1,448), DeSamba achieves 62.10% Top-1 accuracy, 63.62% F1-score, 87.71% AUC, and 93.55% Top-3 accuracy on an external validation set (n=372), outperforming all state-of-the-art (SOTA) baselines. On a spondylitis dataset (n=251) involving a challenging binary classification task, DeSamba achieves 70.00%/64.52% accuracy and 74.75/73.88 AUC on internal and external validation sets, respectively. Ablation studies demonstrate that both DRLM and SAMB significantly contribute to overall performance, with over 10% relative improvement compared to the baseline. Our results highlight the potential of DeSamba as a generalizable and effective solution for 3D lesion classification in multi-sequence medical imaging.

[747] RARE-UNet: Resolution-Aligned Routing Entry for Adaptive Medical Image Segmentation

Simon Winther Albertsen, Hjalte Svaneborg Bjørnstrup, Mostafa Mehdipour Ghazi

Main category: eess.IV

TL;DR: RARE-UNet is a resolution-aware segmentation model that dynamically adapts to input resolution, outperforming existing methods in accuracy and efficiency.

DetailsMotivation: Existing models degrade with lower-resolution inputs, limiting real-world clinical applications.

Method: Proposes RARE-UNet with multi-scale blocks, resolution-aware routing, and consistency-driven training.

Result: Achieves highest Dice scores (0.84 and 0.65) and reduced inference time at lower resolutions.

Conclusion: RARE-UNet is effective and scalable for resolution-robust segmentation.

Abstract: Accurate segmentation is crucial for clinical applications, but existing models often assume fixed, high-resolution inputs and degrade significantly when faced with lower-resolution data in real-world scenarios. To address this limitation, we propose RARE-UNet, a resolution-aware multi-scale segmentation architecture that dynamically adapts its inference path to the spatial resolution of the input. Central to our design are multi-scale blocks integrated at multiple encoder depths, a resolution-aware routing mechanism, and consistency-driven training that aligns multi-resolution features with full-resolution representations. We evaluate RARE-UNet on two benchmark brain imaging tasks for hippocampus and tumor segmentation. Compared to standard UNet, its multi-resolution augmented variant, and nnUNet, our model achieves the highest average Dice scores of 0.84 and 0.65 across resolution, while maintaining consistent performance and significantly reduced inference time at lower resolutions. These results highlight the effectiveness and scalability of our architecture in achieving resolution-robust segmentation. The codes are available at: https://github.com/simonsejse/RARE-UNet.

[748] Efficient onboard multi-task AI architecture based on self-supervised learning

Gabriele Inzerillo, Diego Valsesia, Enrico Magli

Main category: eess.IV

TL;DR: A blueprint for designing modular, efficient deep learning payloads for onboard satellite AI, featuring a self-supervised lightweight backbone and task-specific heads, achieving competitive results on embedded systems.

DetailsMotivation: Address the need for quick analysis and rapid response to critical events (e.g., natural disasters) using AI onboard satellites.

Method: Develop a self-supervised lightweight backbone for feature extraction, paired with efficient task-specific heads, reducing labeling requirements.

Result: Competitive performance on cloud segmentation, flood detection, and marine debris classification, with high throughput (8 Mpx/s) on a 7W embedded system.

Conclusion: The proposed modular design enables efficient, high-quality onboard inference for multiple tasks, suitable for satellite applications.

Abstract: There is growing interest towards the use of AI directly onboard satellites for quick analysis and rapid response to critical events such as natural disasters. This paper presents a blueprint to the mission designer for the development of a modular and efficient deep learning payload to address multiple onboard inference tasks. In particular, we design a self-supervised lightweight backbone that provides features to efficient task-specific heads. The latter can be developed independently and with reduced data labeling requirements thanks to the frozen backbone. Experiments on three sample tasks of cloud segmentation, flood detection, and marine debris classification on a 7W embedded system show competitive results with inference quality close to high-complexity state-of-the-art models and high throughput in excess of 8 Mpx/s.

[749] Coupling AI and Citizen Science in Creation of Enhanced Training Dataset for Medical Image Segmentation

Amir Syahmi, Xiangrong Lu, Yinxuan Li, Haoxuan Yao, Hanjun Jiang, Ishita Acharya, Shiyi Wang, Yang Nan, Xiaodan Xing, Guang Yang

Main category: eess.IV

TL;DR: A framework combining AI and crowdsourcing improves medical image annotation quality and quantity, enhancing DL model training.

DetailsMotivation: Overcome the limitations of manual medical image annotation by leveraging AI and crowdsourcing for scalable, high-quality datasets.

Method: Integrates MedSAM segmentation AI and crowdsourcing via an online platform, uses pix2pixGAN for synthetic data, and merges crowd labels for quality.

Result: Significantly boosts DL model performance, especially with limited training data.

Conclusion: The framework provides a scalable, efficient solution for enhancing medical image datasets and DL model training.

Abstract: Recent advancements in medical imaging and artificial intelligence (AI) have greatly enhanced diagnostic capabilities, but the development of effective deep learning (DL) models is still constrained by the lack of high-quality annotated datasets. The traditional manual annotation process by medical experts is time- and resource-intensive, limiting the scalability of these datasets. In this work, we introduce a robust and versatile framework that combines AI and crowdsourcing to improve both the quality and quantity of medical image datasets across different modalities. Our approach utilises a user-friendly online platform that enables a diverse group of crowd annotators to label medical images efficiently. By integrating the MedSAM segmentation AI with this platform, we accelerate the annotation process while maintaining expert-level quality through an algorithm that merges crowd-labelled images. Additionally, we employ pix2pixGAN, a generative AI model, to expand the training dataset with synthetic images that capture realistic morphological features. These methods are combined into a cohesive framework designed to produce an enhanced dataset, which can serve as a universal pre-processing pipeline to boost the training of any medical deep learning segmentation model. Our results demonstrate that this framework significantly improves model performance, especially when training data is limited.

[750] DualSwinUnet++: An Enhanced Swin-Unet Architecture With Dual Decoders For PTMC Segmentation

Maryam Dialameh, Hossein Rajabzadeh, Moslem Sadeghi-Goughari, Jung Suk Sim, Hyock Ju Kwon

Main category: eess.IV

TL;DR: DualSwinUnet++ is a transformer-based model for precise PTMC segmentation in ultrasound-guided RFA, leveraging thyroid gland context and achieving superior performance with real-time capability.

DetailsMotivation: Accurate PTMC segmentation is challenging due to artifacts, small lesion size, and anatomical variability, necessitating improved methods for effective treatment.

Method: DualSwinUnet++ uses a dual-decoder architecture with independent linear projection heads and residual information flow to incorporate thyroid gland context without gradient interference.

Result: The model outperforms state-of-the-art methods in Dice and Jaccard scores and maintains sub-200ms inference latency.

Conclusion: DualSwinUnet++ is effective for real-time surgical assistance and improves segmentation accuracy in challenging PTMC cases.

Abstract: Precise segmentation of papillary thyroid microcarcinoma (PTMC) during ultrasound-guided radiofrequency ablation (RFA) is critical for effective treatment but remains challenging due to acoustic artifacts, small lesion size, and anatomical variability. In this study, we propose DualSwinUnet++, a dual-decoder transformer-based architecture designed to enhance PTMC segmentation by incorporating thyroid gland context. DualSwinUnet++ employs independent linear projection heads for each decoder and a residual information flow mechanism that passes intermediate features from the first (thyroid) decoder to the second (PTMC) decoder via concatenation and transformation. These design choices allow the model to condition tumor prediction explicitly on gland morphology without shared gradient interference. Trained on a clinical ultrasound dataset with 691 annotated RFA images and evaluated against state-of-the-art models, DualSwinUnet++ achieves superior Dice and Jaccard scores while maintaining sub-200ms inference latency. The results demonstrate the model’s suitability for near real-time surgical assistance and its effectiveness in improving segmentation accuracy in challenging PTMC cases.

[751] CABLD: Contrast-Agnostic Brain Landmark Detection with Consistency-Based Regularization

Soorena Salari, Arash Harirpoush, Hassan Rivaz, Yiming Xiao

Main category: eess.IV

TL;DR: CABLD is a self-supervised DL framework for 3D brain landmark detection in unlabeled scans, requiring only one reference example. It outperforms state-of-the-art methods in accuracy and generalizes well across imaging contrasts.

DetailsMotivation: Manual landmark annotation is time-consuming and expertise-intensive, while existing DL methods need large annotated datasets. CABLD addresses these limitations by reducing annotation dependency.

Method: Uses inter-subject landmark consistency loss, image registration loss, 3D convolution-based contrast augmentation, and an adaptive mixed loss function for optimal performance.

Result: Outperforms state-of-the-art methods in mean radial errors (MREs) and success detection rates (SDRs) across diverse datasets.

Conclusion: CABLD provides a robust, accurate solution for anatomical landmark detection, reducing reliance on annotated datasets and generalizing well across imaging contrasts.

Abstract: Anatomical landmark detection in medical images is essential for various clinical and research applications, including disease diagnosis and surgical planning. However, manual landmark annotation is time-consuming and requires significant expertise. Existing deep learning (DL) methods often require large amounts of well-annotated data, which are costly to acquire. In this paper, we introduce CABLD, a novel self-supervised DL framework for 3D brain landmark detection in unlabeled scans with varying contrasts by using only a single reference example. To achieve this, we employed an inter-subject landmark consistency loss with an image registration loss while introducing a 3D convolution-based contrast augmentation strategy to promote model generalization to new contrasts. Additionally, we utilize an adaptive mixed loss function to schedule the contributions of different sub-tasks for optimal outcomes. We demonstrate the proposed method with the intricate task of MRI-based 3D brain landmark detection. With comprehensive experiments on four diverse clinical and public datasets, including both T1w and T2w MRI scans at different MRI field strengths, we demonstrate that CABLD outperforms the state-of-the-art methods in terms of mean radial errors (MREs) and success detection rates (SDRs). Our framework provides a robust and accurate solution for anatomical landmark detection, reducing the need for extensively annotated datasets and generalizing well across different imaging contrasts. Our code is publicly available at https://github.com/HealthX-Lab/CABLD.

[752] DGSSA: Domain generalization with structural and stylistic augmentation for retinal vessel segmentation

Bo Liu, Yudong Zhang, Shuihua Wang, Siyue Li, Jin Hong

Main category: eess.IV

TL;DR: The paper introduces DGSSA, a novel method for retinal vessel segmentation, combining structural and style augmentation to improve generalization across diverse datasets.

DetailsMotivation: Accurate retinal vessel segmentation is vital for diagnosing diseases, but traditional methods fail due to domain shifts from varying imaging devices and patient demographics.

Method: DGSSA uses a space colonization algorithm to generate vascular-like structures and Pix2Pix for pseudo-retinal images. PixMix adds photometric augmentations and uncertainty perturbations for style diversity.

Result: Evaluated on DRIVE, CHASEDB, HRF, and STARE datasets, DGSSA achieves state-of-the-art performance.

Conclusion: DGSSA’s effectiveness in handling domain shifts makes it promising for clinical retinal vessel analysis.

Abstract: Retinal vascular morphology is crucial for diagnosing diseases such as diabetes, glaucoma, and hypertension, making accurate segmentation of retinal vessels essential for early intervention. Traditional segmentation methods assume that training and testing data share similar distributions, which can lead to poor performance on unseen domains due to domain shifts caused by variations in imaging devices and patient demographics. This paper presents a novel approach, DGSSA, for retinal vessel image segmentation that enhances model generalization by combining structural and style augmentation strategies. We utilize a space colonization algorithm to generate diverse vascular-like structures that closely mimic actual retinal vessels, which are then used to generate pseudo-retinal images with an improved Pix2Pix model, allowing the segmentation model to learn a broader range of structure distributions. Additionally, we utilize PixMix to implement random photometric augmentations and introduce uncertainty perturbations, thereby enriching stylistic diversity and significantly enhancing the model’s adaptability to varying imaging conditions. Our framework has been rigorously evaluated on four challenging datasets-DRIVE, CHASEDB, HRF, and STARE-demonstrating state-of-the-art performance that surpasses existing methods. This validates the effectiveness of our proposed approach, highlighting its potential for clinical application in automated retinal vessel analysis.

[753] Influence of High-Performance Image-to-Image Translation Networks on Clinical Visual Assessment and Outcome Prediction: Utilizing Ultrasound to MRI Translation in Prostate Cancer

Mohammad R. Salmanpour, Amin Mousavi, Yixi Xu, William B Weeks, Ilker Hacihaliloglu

Main category: eess.IV

TL;DR: The study evaluates image-to-image translation (I2I) networks for clinical use, finding 2D-Pix2Pix superior in performance but needing improvement in low-level feature recognition. Synthetic MRI data improved classification accuracy.

DetailsMotivation: To assess the effectiveness and adaptability of I2I networks in clinical settings, particularly for prostate cancer diagnosis.

Method: Analyzed data from 794 prostate cancer patients using 10 I2I networks, introduced RF analysis, and evaluated synthetic images with physicians. Tested synthetic MRI data on ML/DL methods.

Result: 2D-Pix2Pix outperformed others (SSIM~0.855), identified 76/186 RFs, but lost half during translation. Synthetic image-based classification achieved ~0.93 accuracy/AUC.

Conclusion: 2D-Pix2Pix leads in performance but needs low-level feature improvement. Synthetic images enhance classification over original US images.

Abstract: Purpose: This study examines the core traits of image-to-image translation (I2I) networks, focusing on their effectiveness and adaptability in everyday clinical settings. Methods: We have analyzed data from 794 patients diagnosed with prostate cancer (PCa), using ten prominent 2D/3D I2I networks to convert ultrasound (US) images into MRI scans. We also introduced a new analysis of Radiomic features (RF) via the Spearman correlation coefficient to explore whether networks with high performance (SSIM>85%) could detect subtle RFs. Our study further examined synthetic images by 7 invited physicians. As a final evaluation study, we have investigated the improvement that are achieved using the synthetic MRI data on two traditional machine learning and one deep learning method. Results: In quantitative assessment, 2D-Pix2Pix network substantially outperformed the other 7 networks, with an average SSIM0.855. The RF analysis revealed that 76 out of 186 RFs were identified using the 2D-Pix2Pix algorithm alone, although half of the RFs were lost during the translation process. A detailed qualitative review by 7 medical doctors noted a deficiency in low-level feature recognition in I2I tasks. Furthermore, the study found that synthesized image-based classification outperformed US image-based classification with an average accuracy and AUC0.93. Conclusion: This study showed that while 2D-Pix2Pix outperformed cutting-edge networks in low-level feature discovery and overall error and similarity metrics, it still requires improvement in low-level feature performance, as highlighted by Group 3. Further, the study found using synthetic image-based classification outperformed original US image-based methods.

[754] OncoReg: Medical Image Registration for Oncological Challenges

Wiebke Heyer, Yannic Elser, Lennart Berkel, Xinrui Song, Xuanang Xu, Pingkun Yan, Xi Jia, Jinming Duan, Zi Li, Tony C. W. Mok, BoWen LI, Tim Hable, Christian Staackmann, Christoph Großbröhmer, Lasse Hansen, Alessa Hering, Malte M. Sieren, Mattias P. Heinrich

Main category: eess.IV

TL;DR: The OncoReg Challenge tackles underutilized medical data in cancer research by enabling privacy-preserving AI model development for image registration, focusing on CBCT and FBCT alignment in radiotherapy.

DetailsMotivation: Patient privacy concerns limit the use of medical data in cancer research. The challenge aims to develop generalizable AI models while ensuring privacy.

Method: A two-phase framework: phase one uses public data, phase two trains models on private data within secure networks. Focuses on CBCT-FBCT registration.

Result: Feature extraction is key. A new versatile method emerged, while established techniques remain competitive. Combining deep learning and classical methods works best.

Conclusion: The OncoReg Challenge highlights the importance of feature extraction and hybrid approaches in image registration, advancing privacy-preserving AI in oncology.

Abstract: In modern cancer research, the vast volume of medical data generated is often underutilised due to challenges related to patient privacy. The OncoReg Challenge addresses this issue by enabling researchers to develop and validate image registration methods through a two-phase framework that ensures patient privacy while fostering the development of more generalisable AI models. Phase one involves working with a publicly available dataset, while phase two focuses on training models on a private dataset within secure hospital networks. OncoReg builds upon the foundation established by the Learn2Reg Challenge by incorporating the registration of interventional cone-beam computed tomography (CBCT) with standard planning fan-beam CT (FBCT) images in radiotherapy. Accurate image registration is crucial in oncology, particularly for dynamic treatment adjustments in image-guided radiotherapy, where precise alignment is necessary to minimise radiation exposure to healthy tissues while effectively targeting tumours. This work details the methodology and data behind the OncoReg Challenge and provides a comprehensive analysis of the competition entries and results. Findings reveal that feature extraction plays a pivotal role in this registration task. A new method emerging from this challenge demonstrated its versatility, while established approaches continue to perform comparably to newer techniques. Both deep learning and classical approaches still play significant roles in image registration, with the combination of methods, particularly in feature extraction, proving most effective.

[755] HER-Seg: Holistically Efficient Segmentation for High-Resolution Medical Images

Qing Xu, Zhenye Lou, Chenxin Li, Yue Li, Xiangjian He, Tesema Fiseha Berhanu, Rong Qu, Wenting Duan, Zhen Chen

Main category: eess.IV

TL;DR: HER-Seg is a computationally and memory-efficient framework for high-resolution medical image segmentation, outperforming state-of-the-art methods with reduced resource usage.

DetailsMotivation: Existing hierarchical encoder-decoder frameworks for medical segmentation are resource-intensive, limiting their use in foundation models and clinical settings.

Method: HER-Seg introduces a computation-efficient encoder (CE-Encoder) with dual-gated linear attention (DLA) and a memory-efficient decoder (ME-Decoder) for cross-scale decoding.

Result: HER-Seg achieves superior performance in 2D, 3D, and video segmentation tasks, using only 0.59GB GPU memory and 9.39G FLOPs per 1024x1024 image.

Conclusion: HER-Seg addresses efficiency limitations in high-resolution medical segmentation, offering a practical solution for real-world applications.

Abstract: High-resolution segmentation is critical for precise disease diagnosis by extracting fine-grained morphological details. Existing hierarchical encoder-decoder frameworks have demonstrated remarkable adaptability across diverse medical segmentation tasks. While beneficial, they usually require the huge computation and memory cost when handling large-size segmentation, which limits their applications in foundation model building and real-world clinical scenarios. To address this limitation, we propose a holistically efficient framework for high-resolution medical image segmentation, called HER-Seg. Specifically, we first devise a computation-efficient image encoder (CE-Encoder) to model long-range dependencies with linear complexity while maintaining sufficient representations. In particular, we introduce the dual-gated linear attention (DLA) mechanism to perform cascaded token filtering, selectively retaining important tokens while ignoring irrelevant ones to enhance attention computation efficiency. Then, we introduce a memory-efficient mask decoder (ME-Decoder) to eliminate the demand for the hierarchical structure by leveraging cross-scale segmentation decoding. Extensive experiments reveal that HER-Seg outperforms state-of-the-arts in high-resolution medical 2D, 3D and video segmentation tasks. In particular, our HER-Seg requires only 0.59GB training GPU memory and 9.39G inference FLOPs per 1024$\times$1024 image, demonstrating superior memory and computation efficiency. The code is available at https://github.com/xq141839/HER-Seg.

[756] Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography

Zhijin He, Alan B. McMillan

Main category: eess.IV

TL;DR: The study evaluates radiomics and deep learning models for disease detection in chest radiography, finding deep learning superior in performance and scalability, while radiomics is useful in low-data scenarios.

DetailsMotivation: To compare the effectiveness of radiomics-based and deep learning-based AI models for diagnosing diseases like COVID-19, lung opacity, and viral pneumonia in chest radiography.

Method: Systematic comparison of radiomics models (Decision Trees, Gradient Boosting, Random Forests, SVM, MLP) and deep learning models (InceptionV3, EfficientNetL, ConvNeXtXLarge) across varying sample sizes, with performance metrics like AUC evaluated.

Result: Deep learning models (e.g., InceptionV3, EfficientNetL) outperformed radiomics models, especially with larger datasets (AUC up to 0.996). Radiomics models (e.g., SVM, Random Forest) were competitive in low-data scenarios (AUC 0.762-0.885).

Conclusion: Deep learning models are recommended for high-data settings due to superior performance, while radiomics models remain viable for low-data environments, providing practical guidance for AI deployment in diagnostics.

Abstract: The application of artificial intelligence (AI) in medical imaging has revolutionized diagnostic practices, enabling advanced analysis and interpretation of radiological data. This study presents a comprehensive evaluation of radiomics-based and deep learning-based approaches for disease detection in chest radiography, focusing on COVID-19, lung opacity, and viral pneumonia. While deep learning models, particularly convolutional neural networks and vision transformers, learn directly from image data, radiomics-based models extract handcrafted features, offering potential advantages in data-limited scenarios. We systematically compared the diagnostic performance of various AI models, including Decision Trees, Gradient Boosting, Random Forests, Support Vector Machines, and Multi-Layer Perceptrons for radiomics, against state-of-the-art deep learning models such as InceptionV3, EfficientNetL, and ConvNeXtXLarge. Performance was evaluated across multiple sample sizes. At 24 samples, EfficientNetL achieved an AUC of 0.839, outperforming SVM (AUC = 0.762). At 4000 samples, InceptionV3 achieved the highest AUC of 0.996, compared to 0.885 for Random Forest. A Scheirer-Ray-Hare test confirmed significant main and interaction effects of model type and sample size on all metrics. Post hoc Mann-Whitney U tests with Bonferroni correction further revealed consistent performance advantages for deep learning models across most conditions. These findings provide statistically validated, data-driven recommendations for model selection in diagnostic AI. Deep learning models demonstrated higher performance and better scalability with increasing data availability, while radiomics-based models may remain useful in low-data contexts. This study addresses a critical gap in AI-based diagnostic research by offering practical guidance for deploying AI models across diverse clinical environments.

[757] OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates

Jinpei Guo, Yifei Ji, Zheng Chen, Kai Liu, Min Liu, Wang Rao, Wenbo Li, Yong Guo, Yulun Zhang

Main category: eess.IV

TL;DR: OSCAR is a one-step diffusion codec for image compression across multiple bit-rates, improving efficiency and reducing training costs.

DetailsMotivation: Existing diffusion-based methods are computationally intensive and require separate models for different bit-rates, leading to high costs.

Method: OSCAR models compressed latents as noisy variants of original latents, mapping bit-rates to pseudo diffusion timesteps for one-step denoising.

Result: OSCAR achieves superior performance in quantitative and visual quality metrics with improved inference efficiency.

Conclusion: OSCAR offers an efficient, high-quality solution for diffusion-based image compression with reduced computational and storage overhead.

Abstract: Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models will be released at https://github.com/jp-guo/OSCAR.

[758] Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, Tal Arbel

Main category: eess.IV

TL;DR: The paper investigates calibration biases and demographic unfairness in MLLMs for medical image classification, introducing CALIN, an inference-time calibration method to mitigate biases and improve accuracy.

DetailsMotivation: Safe deployment of MLLMs in clinical practice requires analyzing prediction accuracies and calibration errors across demographic subgroups.

Method: CALIN, a bi-level calibration method, estimates and applies calibration matrices from population to subgroup levels during inference.

Result: CALIN improves prediction accuracy and ensures fair confidence calibration across three medical imaging datasets with minimal fairness-utility trade-off.

Conclusion: CALIN effectively addresses calibration biases and demographic unfairness in MLLMs, enhancing their reliability for clinical use.

Abstract: Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs’ predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN’s effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack