Daily arXiv Papers - 2025-09-08

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance

Shisong Chen, Qian Zhu, Wenyan Yang, Chengyi Yang, Zhong Wang, Ping Wang, Xuan Lin, Bo Xu, Daqian Li, Chao Yuan, Licai Qi, Wanqing Xu, sun zhenxing, Xin Lu, Shiqiang Xiong, Chao Chen, Haixiang Hu, Yanghua Xiao

Main category: cs.CL

TL;DR: INSEva is a comprehensive Chinese benchmark for evaluating AI systems’ insurance knowledge, featuring 38,704 evaluation examples across multiple dimensions and showing significant performance gaps in complex insurance scenarios.

DetailsMotivation: Existing AI benchmarks fail to capture the unique characteristics and requirements of the insurance domain, which demands high accuracy and reliability in financial applications.

Method: Developed a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimensions with 38,704 high-quality examples from authoritative materials, implementing tailored evaluation methods for faithfulness and completeness.

Result: Evaluation of 8 state-of-the-art LLMs showed basic insurance competency with average scores above 80, but revealed substantial performance gaps in handling complex, real-world insurance scenarios.

Conclusion: INSEva addresses the critical need for domain-specific AI evaluation in insurance, demonstrating that while general LLMs have basic competency, significant improvements are needed for complex insurance applications.

Abstract: Insurance, as a critical component of the global financial system, demands high standards of accuracy and reliability in AI applications. While existing benchmarks evaluate AI capabilities across various domains, they often fail to capture the unique characteristics and requirements of the insurance domain. To address this gap, we present INSEva, a comprehensive Chinese benchmark specifically designed for evaluating AI systems’ knowledge and capabilities in insurance. INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension, comprising 38,704 high-quality evaluation examples sourced from authoritative materials. Our benchmark implements tailored evaluation methods for assessing both faithfulness and completeness in open-ended responses. Through extensive evaluation of 8 state-of-the-art Large Language Models (LLMs), we identify significant performance variations across different dimensions. While general LLMs demonstrate basic insurance domain competency with average scores above 80, substantial gaps remain in handling complex, real-world insurance scenarios. The benchmark will be public soon.

[2] Mentalic Net: Development of RAG-based Conversational AI and Evaluation Framework for Mental Health Support

Anandi Dutta, Shivani Mruthyunjaya, Jessica Saddington, Kazi Sifatul Islam

Main category: cs.CL

TL;DR: Mental health support chatbot using RAG framework and fine-tuned LLM achieves high BERT Score (0.898) with focus on safety, accuracy, and responsible AI development.

DetailsMotivation: Address the challenges and opportunities of large language models in mental healthcare by developing a safe, meaningful chatbot to augment professional healthcare services.

Method: Employed retrieval-augmented generation (RAG) framework, integrated prompt engineering, and fine-tuned a pre-trained model on novel datasets with rigorous evaluation covering accuracy, empathy, trustworthiness, privacy, and bias.

Result: Developed Mentalic Net Conversational AI system achieving a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges.

Conclusion: Advocates for human-in-the-loop approach and long-term responsible strategy in developing transformative mental health technologies, recognizing both their life-changing potential and associated risks.

Abstract: The emergence of large language models (LLMs) has unlocked boundless possibilities, along with significant challenges. In response, we developed a mental health support chatbot designed to augment professional healthcare, with a strong emphasis on safe and meaningful application. Our approach involved rigorous evaluation, covering accuracy, empathy, trustworthiness, privacy, and bias. We employed a retrieval-augmented generation (RAG) framework, integrated prompt engineering, and fine-tuned a pre-trained model on novel datasets. The resulting system, Mentalic Net Conversational AI, achieved a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges. We advocate for a human-in-the-loop approach and a long-term, responsible strategy in developing such transformative technologies, recognizing both their potential to change lives and the risks they may pose if not carefully managed.

[3] Do MLLMs Really Understand the Charts?

Xiao Zhang, Dongyuan Li, Liuyu Xiang, Yao Zhang, Cheng Zhong, Zhaofeng He

Main category: cs.CL

TL;DR: MLLMs show poor performance on non-annotated charts due to reliance on recognition rather than reasoning. ChartReasoner is proposed to mimic human visual reasoning and achieves superior performance even compared to GPT-4o and Gemini-2.5-Flash.

DetailsMotivation: MLLMs exhibit alarming hallucinations and performance degradation when handling non-annotated charts, raising questions about whether they truly understand charts like humans do through visual reasoning.

Method: Propose ChartReasoner that mimics human behavior by grounding estimation in chart understanding, and establish a comprehensive Chart Reasoning Benchmark CRBench to evaluate visual reasoning abilities.

Result: ChartReasoner-3B/7B achieves superior performance in chart reasoning compared to GPT-4o and Gemini-2.5-Flash, and demonstrates strong visual reasoning abilities in general chart comprehension with significant performance gains.

Conclusion: ChartReasoner enables MLLMs to rationally understand charts through visual reasoning rather than just recognition, addressing the fundamental limitation in current MLLM chart understanding capabilities.

Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. Therefore, a question arises: Do MLLMs really understand the charts? Since a human is capable of understanding charts and estimating the values by visual reasoning, we first carefully establish a comprehensive Chart Reasoning Benchmark CRBench to rigorously evaluate the visual reasoning abilities of MLLMs on non-annotated charts. We argue that MLLMs are primarily relying on recognition rather than reasoning to interpret the charts. To steer MLLMs to reasonable chart understanding, we propose ChartReasoner that mimics human behavior by grounding their estimation in chart understanding. Extensive results on the proposed CRBench show that ChartReasnoner-3B/7B achieves superior performance in chart reasoning, even compared to GPT-4o and Gemini-2.5-Flash. More importantly, ChartReasnoner also demonstrates the visual reasoning abilities in general chart comprehension on public benchmarks, leading to significant performance gains and enabling MLLMs to rationally understand the charts. The code and dataset will be publicly available upon publication.

Daniel B. Hier, Steven Keith Platt, Tayo Obafemi-Ajayi

Main category: cs.CL

TL;DR: LLMs struggle with ontology term-to-ID linking; identifier exposure is the strongest predictor of success

DetailsMotivation: Large language models perform well on biomedical NLP tasks but often fail to correctly link ontology terms to their identifiers, requiring investigation into why these failures occur

Method: Analyzed predictions across Human Phenotype Ontology and Gene Ontology using GPT-4o and LLaMa 3.1 405B models, evaluating nine features related to term familiarity, identifier usage, morphology, and ontology structure through univariate and multivariate analyses

Result: Exposure to ontology identifiers is the strongest predictor of linking success

Conclusion: The models’ ability to correctly link ontology terms to identifiers depends heavily on their prior exposure to those specific identifiers during training

Abstract: Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. We investigate why these failures occur by analyzing predictions across two major ontologies, Human Phenotype Ontology and Gene Ontology, and two high-performing models, GPT-4o and LLaMa 3.1 405B. We evaluate nine candidate features related to term familiarity, identifier usage, morphology, and ontology structure. Univariate and multivariate analyses show that exposure to ontology identifiers is the strongest predictor of linking success.

[5] RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

Main category: cs.CL

TL;DR: RAVEN is a multimodal QA architecture with QuART module that scores token relevance across modalities to handle modality disagreements, achieving significant accuracy gains over SOTA models.

DetailsMotivation: Address modality disagreements in multimodal QA where off-camera speech, background noise, or motion outside field of view mislead fusion models that weight all streams equally.

Method: Three-stage training pipeline: unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning. Uses QuART module to assign relevance scores to tokens across modalities before fusion.

Result: Achieves up to 14.5% and 8.0% accuracy gains over SOTA multimodal LLMs. Sensor data provides additional 16.4% boost. Robust under modality corruption, outperforming baselines by 50.23%.

Conclusion: RAVEN effectively handles modality disagreements through relevance scoring and staged training, demonstrating superior performance and robustness across multiple multimodal QA benchmarks.

Abstract: Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning – each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio–Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks – including egocentric and exocentric tasks – show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.

[6] Uncertainty-Aware Collaborative System of Large and Small Models for Multimodal Sentiment Analysis

Shiqin Han, Manning Gao, Menghua Jiang, Yuncheng Jiang, Haifeng Hu, Sijie Mai

Main category: cs.CL

TL;DR: Proposes U-ACS, an uncertainty-aware collaborative system that combines a powerful MLLM with a lightweight model for multimodal sentiment analysis, using uncertainty to selectively escalate difficult samples to the MLLM for computational efficiency.

DetailsMotivation: Address the performance-efficiency trade-off in multimodal learning where large MLLMs have high computational demands while smaller models sacrifice performance.

Method: Uncertainty-driven cascade mechanism where lightweight model filters easy samples, and only high-uncertainty samples are escalated to MLLM. Includes weighted averaging and prompt-based cross-verification for ambiguous predictions.

Result: Achieves state-of-the-art performance while requiring only a fraction of computational resources compared to standalone MLLM.

Conclusion: The proposed U-ACS system effectively balances performance and efficiency through intelligent sample-difficulty-aware resource allocation.

Abstract: The advent of Multimodal Large Language Models (MLLMs) has significantly advanced the state-of-the-art in multimodal machine learning, yet their substantial computational demands present a critical barrier to real-world deployment. Conversely, smaller, specialized models offer high efficiency but often at the cost of performance. To reconcile this performance-efficiency trade-off, we propose a novel Uncertainty-Aware Collaborative System (U-ACS) that synergistically orchestrates a powerful MLLM (e.g., HumanOmni) and a lightweight baseline model for multimodal sentiment analysis. The core of our system is an uncertainty-driven cascade mechanism, where the efficient small model first acts as a rapid filter for all input samples. Only those samples yielding high predictive uncertainty, thereby indicating greater difficulty, are selectively escalated to the MLLM for more sophisticated analysis. Furthermore, our system introduces advanced strategies to handle ambiguous or conflicting predictions, including weighted averaging for predictions of similar polarity and a prompt-based cross-verification to resolve conflicting predictions when both models exhibit high uncertainty. This sample-difficulty-aware approach allows for a dynamic allocation of computational resources, drastically reducing inference costs while retaining the high accuracy of MLLM. Extensive experiments on benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance, while requiring only a fraction of the computational resources compared to using a standalone MLLM.

[7] Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition

Hao Shi, Yusuke Fujita, Tomoya Mizumoto, Lianbo Liu, Atsushi Kojima, Yui Sudo

Main category: cs.CL

TL;DR: Proposes serialized output prompts (SOP) to enhance LLM-based multi-talker ASR systems by explicitly guiding LLMs with structured prompts, improving performance in complex scenarios.

DetailsMotivation: Existing LLM-based multi-talker ASR systems either omit prompts or use simple task-definition prompts, with no prior work exploring prompt design to enhance performance in complex multi-talker scenarios.

Method: Extracts serialized output prompts (SOP) using a Separator and serialized CTC layers after speech encoder to separate mixed speech content. Uses three-stage training: SOT fine-tuning, serialized speech information extraction, and SOP-based adaptation.

Result: Experimental results on LibriMix dataset show SOP approach significantly improved performance under both two- and three-talker conditions, overcoming limitations of standard LLM-based SOT models in complex scenarios.

Conclusion: SOP-based prompting effectively guides LLMs for multi-talker ASR, demonstrating substantial performance gains in challenging multi-speaker environments where previous approaches failed to fully leverage LLM capabilities.

Abstract: Prompts are crucial for task definition and for improving the performance of large language models (LLM)-based systems. However, existing LLM-based multi-talker (MT) automatic speech recognition (ASR) systems either omit prompts or rely on simple task-definition prompts, with no prior work exploring the design of prompts to enhance performance. In this paper, we propose extracting serialized output prompts (SOP) and explicitly guiding the LLM using structured prompts to improve system performance (SOP-MT-ASR). A Separator and serialized Connectionist Temporal Classification (CTC) layers are inserted after the speech encoder to separate and extract MT content from the mixed speech encoding in a first-speaking-first-out manner. Subsequently, the SOP, which serves as a prompt for LLMs, is obtained by decoding the serialized CTC outputs using greedy search. To train the model effectively, we design a three-stage training strategy, consisting of serialized output training (SOT) fine-tuning, serialized speech information extraction, and SOP-based adaptation. Experimental results on the LibriMix dataset show that, although the LLM-based SOT model performs well in the two-talker scenario, it fails to fully leverage LLMs under more complex conditions, such as the three-talker scenario. The proposed SOP approach significantly improved performance under both two- and three-talker conditions.

[8] CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection

Yihan Chen, Jiawei Chen, Guozhao Mo, Xuanang Chen, Ben He, Xianpei Han, Le Sun

Main category: cs.CL

TL;DR: Proposes CoCoNUTS benchmark and CoCoDet detector for content-based AI-generated peer review detection, moving beyond style-based approaches to address fairness issues in scholarly evaluation.

DetailsMotivation: Address risks to fairness and reliability in peer review from LLM integration, as existing style-based detectors fail to distinguish between permissible language refinement and deceptive AI-generated content.

Method: Develop CoCoNUTS benchmark with fine-grained dataset covering 6 human-AI collaboration modes, and create CoCoDet detector using multi-task learning framework for content-based detection.

Result: Provides a practical foundation for evaluating LLM use in peer review and enables more accurate, robust detection of AI involvement in review content.

Conclusion: Shifts detection paradigm from style-based to content-based approaches, contributing to more precise, equitable, and reliable detection methods for scholarly applications.

Abstract: The growing integration of large language models (LLMs) into the peer review process presents potential risks to the fairness and reliability of scholarly evaluation. While LLMs offer valuable assistance for reviewers with language refinement, there is growing concern over their use to generate substantive review content. Existing general AI-generated text detectors are vulnerable to paraphrasing attacks and struggle to distinguish between surface language refinement and substantial content generation, suggesting that they primarily rely on stylistic cues. When applied to peer review, this limitation can result in unfairly suspecting reviews with permissible AI-assisted language enhancement, while failing to catch deceptively humanized AI-generated reviews. To address this, we propose a paradigm shift from style-based to content-based detection. Specifically, we introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews, covering six distinct modes of human-AI collaboration. Furthermore, we develop CoCoDet, an AI review detector via a multi-task learning framework, designed to achieve more accurate and robust detection of AI involvement in review content. Our work offers a practical foundation for evaluating the use of LLMs in peer review, and contributes to the development of more precise, equitable, and reliable detection methods for real-world scholarly applications. Our code and data will be publicly available at https://github.com/Y1hanChen/COCONUTS.

[9] From Post To Personality: Harnessing LLMs for MBTI Prediction in Social Media

Tian Ma, Kaiyu Feng, Yu Rong, Kangfei Zhao

Main category: cs.CL

TL;DR: PtoP is a novel LLM framework for MBTI prediction from social media posts that addresses hallucination and class imbalance through retrieval-augmented generation and synthetic oversampling.

DetailsMotivation: Personality prediction from social media has important applications in psychology and sociology, but existing approaches face challenges with LLM hallucination and imbalanced MBTI type distribution in the population.

Method: Proposes PostToPersonality (PtoP) framework that uses Retrieval Augmented Generation with in-context learning to reduce hallucination, and fine-tunes pretrained LLMs with synthetic minority oversampling to handle class imbalance.

Result: Experiments on real-world social media data show PtoP achieves state-of-the-art performance compared to 10 ML and DL baseline methods.

Conclusion: The proposed PtoP framework effectively addresses key challenges in MBTI prediction from social media using LLMs, demonstrating superior performance through innovative techniques for hallucination mitigation and class balancing.

Abstract: Personality prediction from social media posts is a critical task that implies diverse applications in psychology and sociology. The Myers Briggs Type Indicator (MBTI), a popular personality inventory, has been traditionally predicted by machine learning (ML) and deep learning (DL) techniques. Recently, the success of Large Language Models (LLMs) has revealed their huge potential in understanding and inferring personality traits from social media content. However, directly exploiting LLMs for MBTI prediction faces two key challenges: the hallucination problem inherent in LLMs and the naturally imbalanced distribution of MBTI types in the population. In this paper, we propose PostToPersonality (PtoP), a novel LLM based framework for MBTI prediction from social media posts of individuals. Specifically, PtoP leverages Retrieval Augmented Generation with in context learning to mitigate hallucination in LLMs. Furthermore, we fine tune a pretrained LLM to improve model specification in MBTI understanding with synthetic minority oversampling, which balances the class imbalance by generating synthetic samples. Experiments conducted on a real world social media dataset demonstrate that PtoP achieves state of the art performance compared with 10 ML and DL baselines.

[10] Benchmarking GPT-5 for biomedical natural language processing

Yu Hou, Zaifu Zhan, Rui Zhang

Main category: cs.CL

TL;DR: GPT-5 achieves state-of-the-art performance in biomedical NLP, particularly excelling in question answering tasks like MedQA (94.1% accuracy) and reaching parity with supervised systems on PubMedQA, while still lagging in summarization and some extraction tasks compared to domain-specific models.

DetailsMotivation: The rapid growth of biomedical literature requires scalable NLP solutions, and while GPT-4 narrowed the performance gap with task-specific systems, its performance across different biomedical domains remained uneven, necessitating evaluation of newer models like GPT-5 and GPT-4o.

Method: Updated a standardized BioNLP benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across 12 datasets spanning six task families using fixed prompt templates, identical decoding parameters, and batch inference, with comparison to GPT-4, GPT-3.5, and LLaMA-2-13B.

Result: GPT-5 achieved the strongest overall performance with macro-average scores of 0.557 under five-shot prompting vs 0.506 for GPT-4 and 0.508 for GPT-4o. It reached 94.1% accuracy on MedQA (exceeding previous SOTA by 50+ points) and parity with supervised systems on PubMedQA. Major gains in chemical NER (0.886 F1) and ChemProt relation extraction (0.616 F1), though summarization and disease NER still lagged behind domain-specific baselines.

Conclusion: GPT-5 establishes itself as a general-purpose model with deployment-ready performance for reasoning-oriented biomedical QA, while precision-critical extraction and evidence-dense summarization continue to favor fine-tuned or hybrid approaches, providing guidance for BioNLP system design as frontier models advance.

Abstract: The rapid expansion of biomedical literature has heightened the need for scalable natural language processing (NLP) solutions. While GPT-4 substantially narrowed the gap with task-specific systems, especially in question answering, its performance across other domains remained uneven. We updated a standardized BioNLP benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across 12 datasets spanning six task families: named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification. Using fixed prompt templates, identical decoding parameters, and batch inference, we report primary metrics per dataset and include prior results for GPT-4, GPT-3.5, and LLaMA-2-13B for comparison. GPT-5 achieved the strongest overall benchmark performance, with macro-average scores rising to 0.557 under five-shot prompting versus 0.506 for GPT-4 and 0.508 for GPT-4o. On MedQA, GPT-5 reached 94.1% accuracy, exceeding the previous supervised state of the art by over fifty points, and attained parity with supervised systems on PubMedQA (0.734). In extraction tasks, GPT-5 delivered major gains in chemical NER (0.886 F1) and ChemProt relation extraction (0.616 F1), outperforming GPT-4 and GPT-4o, though summarization and disease NER still lagged behind domain-specific baselines. These results establish GPT-5 as a general-purpose model now offering deployment-ready performance for reasoning-oriented biomedical QA, while precision-critical extraction and evidence-dense summarization continue to favor fine-tuned or hybrid approaches. The benchmark delineates where simple prompting suffices and where retrieval-augmented or planning-based scaffolds are likely required, providing actionable guidance for BioNLP system design as frontier models advance.

[11] Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?

Yang Nan, Pengfei He, Ravi Tandon, Han Xu

Main category: cs.CL

TL;DR: LLMs produce unreliable outputs but uncertainty source diagnosis is understudied. This paper shows disagreement patterns in multiple LLM responses reveal uncertainty causes like input ambiguity or knowledge gaps, using an auxiliary LLM for diagnosis.

DetailsMotivation: Large language models often produce unreliable outputs, but existing research focuses mainly on quantifying uncertainty rather than diagnosing its underlying sources, which is crucial for improving LLM reliability in real-world applications.

Method: Collect multiple responses from a target LLM and employ an auxiliary LLM to analyze patterns of disagreement among these responses to diagnose uncertainty sources (input ambiguity, knowledge gaps, or both) and identify specific missing facts.

Result: The framework was validated on AmbigQA, OpenBookQA, and MMLU-Pro datasets, demonstrating its generality in diagnosing distinct uncertainty sources across different domains and question types.

Conclusion: Diagnosing uncertainty sources through response disagreement patterns enables targeted manual interventions that can significantly improve LLM performance and reliability in practical applications.

Abstract: Large language models (LLMs) have delivered significant breakthroughs across diverse domains but can still produce unreliable or misleading outputs, posing critical challenges for real-world applications. While many recent studies focus on quantifying model uncertainty, relatively little work has been devoted to \textit{diagnosing the source of uncertainty}. In this study, we show that, when an LLM is uncertain, the patterns of disagreement among its multiple generated responses contain rich clues about the underlying cause of uncertainty. To illustrate this point, we collect multiple responses from a target LLM and employ an auxiliary LLM to analyze their patterns of disagreement. The auxiliary model is tasked to reason about the likely source of uncertainty, such as whether it stems from ambiguity in the input question, a lack of relevant knowledge, or both. In cases involving knowledge gaps, the auxiliary model also identifies the specific missing facts or concepts contributing to the uncertainty. In our experiment, we validate our framework on AmbigQA, OpenBookQA, and MMLU-Pro, confirming its generality in diagnosing distinct uncertainty sources. Such diagnosis shows the potential for relevant manual interventions that improve LLM performance and reliability.

[12] Emotionally-Aware Agents for Dispute Resolution

Sushrita Rakshit, James Hale, Kushal Chawla, Jeanne M. Brett, Jonathan Gratch

Main category: cs.CL

TL;DR: Automatic text emotion recognition using large-language models provides better insight into how emotional expressions influence conflict escalation and resolution in dispute dialogues than previous methods.

DetailsMotivation: To understand how emotional expressions shape outcomes in conflict situations, particularly in buyer-seller disputes where emotions are stronger and social processes differ from negotiations.

Method: Used a large corpus of buyer-seller dispute dialogues and employed large-language models for emotion intensity annotation, comparing their performance with previous methods and human annotators.

Result: Large-language models showed considerably greater explanatory power than previous emotion annotation methods and better matched human annotator decisions. Findings supported existing theoretical models of emotional influence in conflict.

Conclusion: Agent-based systems could be useful for managing disputes by recognizing and potentially mitigating emotional escalation, as automatic emotion recognition provides valuable insights into conflict dynamics.

Abstract: In conflict, people use emotional expressions to shape their counterparts' thoughts, feelings, and actions. This paper explores whether automatic text emotion recognition offers insight into this influence in the context of dispute resolution. Prior work has shown the promise of such methods in negotiations; however, disputes evoke stronger emotions and different social processes. We use a large corpus of buyer-seller dispute dialogues to investigate how emotional expressions shape subjective and objective outcomes. We further demonstrate that large-language models yield considerably greater explanatory power than previous methods for emotion intensity annotation and better match the decisions of human annotators. Findings support existing theoretical models for how emotional expressions contribute to conflict escalation and resolution and suggest that agent-based systems could be useful in managing disputes by recognizing and potentially mitigating emotional escalation.

[13] Just-in-time and distributed task representations in language models

Yuxuan Li, Declan Campbell, Stephanie C. Y. Chan, Andrew Kyle Lampinen

Main category: cs.CL

TL;DR: Language models form transferable task representations in non-monotonic, sporadic patterns during in-context learning, with strong temporal and semantic locality rather than continuous development.

DetailsMotivation: To understand when and how language models form representations for new tasks during in-context learning, particularly focusing on transferable representations that can restore task context without the full prompt.

Method: Investigating the evolution of transferable task representations throughout context processing, analyzing how these representations change and when they become effective for task performance.

Result: Task representations evolve in non-monotonic, sporadic ways with strong locality along sequence dimensions. Models condense multiple evidence into transferable representations that align with performance improvements, but these representations only emerge at specific tokens despite task identity being decodable throughout context.

Conclusion: Language models employ a just-in-time computational process with two-fold locality (temporal and semantic) for task adaptation, using local transferable representations for minimal task scopes while relying on distributed representations for composite tasks.

Abstract: Many of language models’ impressive capabilities originate from their in-context learning: based on instructions or examples, they can infer and perform new tasks without weight updates. In this work, we investigate \emph{when} representations for new tasks are formed in language models, and \emph{how} these representations change over the course of context. We focus on ‘’transferrable’’ task representations – vector representations that can restore task context in another instance of the model, even without the full prompt. We show that these representations evolve in non-monotonic and sporadic ways, and are distinct from a more inert representation of high-level task categories that persists throughout the context. Specifically, models often condense multiple evidence into these transferrable task representations, which align well with the performance improvement based on more examples in the context. However, this accrual process exhibits strong locality along the sequence dimension, coming online only at certain tokens – despite task identity being reliably decodable throughout the context. Moreover, these local but transferrable task representations tend to capture minimal ‘’task scopes’’, such as a semantically-independent subtask, and models rely on more temporally-distributed representations to support longer and composite tasks. This two-fold locality (temporal and semantic) underscores a kind of just-in-time computational process underlying language models’ ability to adapt to new evidence and learn new tasks on the fly.

[14] Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

Hao Zhang, Mengsi Lyu, Yulong Ao, Yonghua Lin

Main category: cs.CL

TL;DR: A novel pruning method for LLMs that addresses prefill-decode disaggregation, enabling precise block pruning and token-aware KV cache optimization to reduce computational and memory costs while maintaining performance.

DetailsMotivation: LLMs face high computational and memory costs in deployment. Existing pruning methods ignore the practical characteristics of prefill-decode disaggregation, limiting their effectiveness in real-world scenarios.

Method: Proposes iterative block removal for prefill and decode stages separately using pruning and distillation sets. Introduces token-aware cache pruning that retains all KV Cache in prefill but selectively reuses entries for first/last token sequences in decode layers.

Result: Achieves 20.56% inference speedup and 4.95x reduction in data transmission bandwidth consumption. Consistently strong performance in both PD disaggregation and unified settings.

Conclusion: The method effectively addresses PD disaggregation challenges, providing significant efficiency improvements without compromising model performance, making LLM deployment more practical and cost-effective.

Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands. However, existing methods often ignore the characteristics of prefill-decode (PD) disaggregation in practice. In this paper, we propose a novel pruning method for PD disaggregation inference, enabling more precise and efficient block and KV Cache pruning. Our approach constructs pruning and distillation sets to perform iterative block removal independently for the prefill and decode stages, obtaining better pruning solutions. Moreover, we introduce a token-aware cache pruning mechanism that retains all KV Cache in the prefill stage but selectively reuses entries for the first and last token sequences in selected layers during decode, reducing communication costs with minimal overhead. Extensive experiments demonstrate that our approach consistently achieves strong performance in both PD disaggregation and PD unified settings without disaggregation. Under the default settings, our method achieves a 20.56% inference speedup and a 4.95 times reduction in data transmission bandwidth consumption.

[15] Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study

Xuan Yao, Qianteng Wang, Xinbo Liu, Ke-Wei Huang

Main category: cs.CL

TL;DR: Comprehensive evaluation of state-of-the-art LLMs on CFA exam questions shows reasoning-specialized models perform best in zero-shot settings, while RAG pipeline significantly improves accuracy for complex financial scenarios.

DetailsMotivation: The rapid advancement of large language models presents opportunities for financial applications, but systematic evaluation in specialized financial contexts remains limited. There's a need to understand how different LLM architectures perform on rigorous professional financial certification tasks.

Method: Used 1,560 multiple-choice questions from official CFA mock exams across Levels I-III. Compared models with different design priorities: multi-modal/computational, reasoning-specialized, and lightweight efficiency-optimized. Evaluated under zero-shot prompting and through a novel Retrieval-Augmented Generation pipeline that integrates official CFA curriculum content with hierarchical knowledge organization and structured query generation.

Result: Reasoning-oriented models consistently outperformed others in zero-shot settings. The RAG pipeline provided substantial improvements, particularly for complex scenarios. Error analysis identified knowledge gaps as the primary failure mode, with minimal impact from text readability.

Conclusion: The findings provide actionable insights for LLM deployment in finance, offering practitioners evidence-based guidance for model selection and cost-performance optimization. The RAG approach demonstrates significant value for enhancing domain-specific financial reasoning accuracy.

Abstract: The rapid advancement of large language models presents significant opportunities for financial applications, yet systematic evaluation in specialized financial contexts remains limited. This study presents the first comprehensive evaluation of state-of-the-art LLMs using 1,560 multiple-choice questions from official mock exams across Levels I-III of CFA, most rigorous professional certifications globally that mirror real-world financial analysis complexity. We compare models distinguished by core design priorities: multi-modal and computationally powerful, reasoning-specialized and highly accurate, and lightweight efficiency-optimized. We assess models under zero-shot prompting and through a novel Retrieval-Augmented Generation pipeline that integrates official CFA curriculum content. The RAG system achieves precise domain-specific knowledge retrieval through hierarchical knowledge organization and structured query generation, significantly enhancing reasoning accuracy in professional financial certification evaluation. Results reveal that reasoning-oriented models consistently outperform others in zero-shot settings, while the RAG pipeline provides substantial improvements particularly for complex scenarios. Comprehensive error analysis identifies knowledge gaps as the primary failure mode, with minimal impact from text readability. These findings provide actionable insights for LLM deployment in finance, offering practitioners evidence-based guidance for model selection and cost-performance optimization.

[16] Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing

David Berghaus, Armin Berger, Lars Hillebrand, Kostadin Cvejoski, Rafet Sifa

Main category: cs.CL

TL;DR: Benchmark of 8 multi-modal LLMs from GPT-5, Gemini 2.5, and Gemma 3 families on invoice document datasets using zero-shot prompting, comparing direct image processing vs structured markdown conversion approaches.

DetailsMotivation: To evaluate and compare the performance of different multi-modal large language models on document processing tasks, specifically invoice analysis, to provide guidance for automated document systems.

Method: Zero-shot prompting on three diverse invoice datasets, comparing two processing strategies: direct multi-modal image processing and structured parsing via markdown conversion.

Result: Native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics.

Conclusion: The benchmark provides insights for selecting appropriate models and processing strategies, with direct image processing showing better overall performance for automated document systems.

Abstract: This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.

[17] COCORELI: Cooperative, Compositional Reconstitution & Execution of Language Instructions

Swarnadeep Bhar, Omar Naim, Eleni Metheniti, Bastien Navarri, Loïc Cabannes, Morteza Ezzabady, Nicholas Asher

Main category: cs.CL

TL;DR: COCORELI is a hybrid agent framework that combines medium-sized LLMs with abstraction mechanisms to improve complex instruction following, reduce hallucinations, and enhance spatial reasoning capabilities.

DetailsMotivation: Large language models struggle with complex instructions, hallucination issues, and spatial reasoning tasks. There's a need for systems that can better parse instructions, maintain accurate representations, and handle collaborative construction scenarios.

Method: Integrates medium-sized LLM agents with novel abstraction mechanisms and a discourse module to parse instructions and learn dynamic, high-level environment representations in-context.

Result: Outperforms single-LLM CoT and agentic LLM systems using larger LLMs. Successfully avoids hallucinations, identifies missing information, asks for clarifications, and updates learned objects. Also demonstrates abstraction abilities beyond environment tasks in ToolBench API completion.

Conclusion: COCORELI provides an effective hybrid approach that enables medium-sized LLMs to surpass larger models in complex collaborative tasks through better instruction parsing, abstraction capabilities, and reduced hallucination.

Abstract: We present COCORELI, a hybrid agent framework designed to tackle the limitations of large language models (LLMs) in tasks requiring: following complex instructions, minimizing hallucination, and spatial reasoning. COCORELI integrates medium-sized LLM agents with novel abstraction mechanisms and a discourse module to parse instructions to in-context learn dynamic, high-level representations of the environment. Experiments on natural collaborative construction tasks show that COCORELI outperforms single-LLM CoT and agentic LLM systems, all using larger LLMs. It manages to largely avoid hallucinations, identify missing information, ask for clarifications, and update its learned objects. COCORELI’s abstraction abilities extend beyond ENVIRONMENT, as shown in the ToolBench API completion task.

[18] MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

Alice Schiavone, Marco Fraccaro, Lea Marie Pehrson, Silvia Ingala, Rasmus Bonnevie, Michael Bachmann Nielsen, Vincent Beliveau, Melanie Ganz, Desmond Elliott

Main category: cs.CL

TL;DR: MOSAIC is a multilingual, taxonomy-agnostic radiology report classification system using compact open-access language models that achieves expert-level performance with minimal computational resources and training data.

DetailsMotivation: Existing radiology report classification methods face limitations including linguistic variability challenges, need for large annotated datasets, dependency on closed-source/resource-intensive models, and restriction to English/single-modality datasets.

Method: Built on MedGemma-4B compact open-access language model, supports zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs (24GB memory). Evaluated across 7 datasets in 4 languages spanning multiple imaging modalities and label taxonomies.

Result: Achieves mean macro F1 score of 88 across five chest X-ray datasets (approaching/exceeding expert-level performance). With data augmentation, only 80 annotated samples needed for 82 weighted F1 score on Danish reports vs 86 with full 1600-sample training set.

Conclusion: MOSAIC offers practical alternative to large/proprietary LLMs in clinical settings, is open-source, and invites community evaluation/extension to new languages, taxonomies, and modalities.

Abstract: Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

[19] RECAP: REwriting Conversations for Intent Understanding in Agentic Planning

Kushan Mitra, Dan Zhang, Hannah Kim, Estevam Hruschka

Main category: cs.CL

TL;DR: RECAP is a new benchmark for evaluating intent rewriting in conversational AI, addressing challenges like ambiguity and intent drift in open-domain dialogues to improve agent planning.

DetailsMotivation: Real-world dialogues are often ambiguous and underspecified, making intent detection challenging for LLM-powered conversational assistants. Traditional classification approaches struggle in open-ended settings, leading to poor planning outcomes.

Method: Proposed RECAP benchmark with diverse dialogue challenges and an LLM-based evaluator for planning utility. Developed prompt-based rewriting approach and fine-tuned two DPO-based rewriters.

Result: The prompt-based rewriting approach outperformed baselines, and fine-tuned DPO-based rewriters yielded additional utility gains in intent rewriting performance.

Conclusion: Intent rewriting is a critical and tractable component for improving agent planning in open-domain dialogue systems, with RECAP providing an effective evaluation framework.

Abstract: Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning. We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that assesses planning utility given the rewritten intent. Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agent planning in open-domain dialogue systems.

[20] SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings

Jaekwon Yoo, Kunal Chandiramani, Divya Tadimeti, Abenezer Girma, Chandra Dhir

Main category: cs.CL

TL;DR: Parameter-efficient adapter that converts speech embeddings to LLM tokens achieves significant performance gains with 7x fewer parameters across ASR, NER, and sentiment analysis tasks.

DetailsMotivation: Integrating speech encoders with LLMs requires substantial data and resources, but many use cases face limitations due to insufficient data availability.

Method: Proposes a parameter-efficient adapter to convert speech embeddings into LLM-compatible tokens, uses LLM-based synthetic dataset annotation to reduce labeling costs, and employs techniques like classifier regularizer and LoRA optimization.

Result: 26% relative WER improvement on LibriSpeech ASR, 6.3% relative F1 increase on NER, 32% relative F1 boost on sentiment analysis, and SLUE score improvements of 6.6% and 9.5%.

Conclusion: The proposed adapter approach enables efficient speech-LLM integration with significantly reduced parameter requirements while achieving substantial performance gains across multiple speech processing tasks.

Abstract: While integrating speech encoder with LLM requires substantial data and resources, use cases face limitations due to insufficient availability. To address this, we propose a solution with a parameter-efficient adapter that converts speech embeddings into LLM-compatible tokens, focusing on end-to-end automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). To reduce labeling costs, we employ an LLM-based synthetic dataset annotation technique. The proposed adapter, using 7x fewer trainable parameters, achieves significant performance gains: a 26% relative Word Error Rates (WER) improvement on the LibriSpeech ASR task, a 6.3% relative F1 score increase on the NER task, and a 32% relative F1 score boost on the SA task. Moreover, using advanced techniques such as adding a classifier regularizer and optimizing the LLM with Low-Rank Adaptation (LoRA) yields notable performance gains, with Spoken Language Understanding Evaluation (SLUE) score improvement of 6.6% and 9.5%

[21] Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

Shengyin Sun, Yiming Li, Xing Li, Yingzhao Lian, Weizhe Lin, Hui-Ling Zhen, Zhiyuan Yang, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma

Main category: cs.CL

TL;DR: A comprehensive benchmark for evaluating speculative decoding methods to accelerate test-time scaling in LLMs, showing n-gram methods effectively capture repetitive patterns and can be combined with other approaches for better performance.

DetailsMotivation: Test-time scaling enhances LLM reasoning but generates redundant computational overhead through repetitive reasoning traces. Speculative decoding could mitigate this inefficiency, but its effectiveness in structured test-time scaling contexts remains unexplored.

Method: Created the first comprehensive benchmark with consistent experimental protocols across test-time scaling paradigms (Best-of-N sampling, multi-round thinking). Evaluated three speculative decoding categories: model-based, training-based, and n-gram-based methods.

Result: N-gram-based methods effectively capture repetitive patterns and show unique potential in accelerating test-time scaling. Integration of n-gram methods with model-based or training-based approaches balances acceleration for both repetitive and diverse reasoning.

Conclusion: The benchmark demonstrates the value of combining different speculative decoding approaches and should spur further research to enable faster, more practical LLM reasoning through better handling of repetitive and diverse reasoning paths.

Abstract: Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.

[22] ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute

Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li

Main category: cs.CL

TL;DR: ParaThinker introduces parallel thought generation to overcome sequential reasoning bottlenecks in LLMs, achieving significant accuracy improvements with minimal latency overhead.

DetailsMotivation: Current LLMs suffer from "Tunnel Vision" where imperfect initial reasoning steps lock them into suboptimal paths, limiting the benefits of increased computation.

Method: ParaThinker trains LLMs to generate multiple diverse reasoning paths in parallel and synthesize them into a superior final answer through native thought parallelism.

Result: Achieves 12.3% accuracy improvement for 1.5B models and 7.5% for 7B models with 8 parallel paths, adding only 7.1% latency overhead.

Conclusion: Parallel compute scaling (width) is more effective than sequential scaling (depth), enabling smaller models to surpass larger counterparts and establishing parallel thinking as critical for future LLM scaling.

Abstract: Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model’s capability but a flaw in the scaling strategy itself, a phenomenon we term “Tunnel Vision”, where a model’s imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model’s latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.

[23] Training Text-to-Molecule Models with Context-Aware Tokenization

Seojin Kim, Hyeontae Song, Jaehyun Nam, Jinwoo Shin

Main category: cs.CL

TL;DR: CAMT5 introduces substructure-level tokenization and importance-based training to improve text-to-molecule generation by better capturing global structural context, achieving state-of-the-art performance with only 2% of training tokens.

DetailsMotivation: Existing text-to-molecule models use atom-level tokenizations that focus on local connectivity but fail to capture global structural context within molecules, limiting their understanding of molecular semantics.

Method: Proposes Context-Aware Molecular T5 (CAMT5) with substructure-level tokenization (e.g., ring systems) and importance-based training strategy that prioritizes key substructures. Also introduces a simple ensemble strategy to aggregate outputs from multiple models.

Result: CAMT5 outperforms state-of-the-art methods in various text-to-molecule generation tasks while using only 2% of training tokens. The ensemble strategy further boosts generation performance.

Conclusion: Substructure-level tokenization and importance-based training enable better capture of molecular semantics, demonstrating that global structural context is crucial for text-to-molecule generation tasks.

Abstract: Recently, text-to-molecule models have shown great potential across various chemical applications, e.g., drug-discovery. These models adapt language models to molecular data by representing molecules as sequences of atoms. However, they rely on atom-level tokenizations, which primarily focus on modeling local connectivity, thereby limiting the ability of models to capture the global structural context within molecules. To tackle this issue, we propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the significance of the substructure-level contexts in understanding molecule structures, e.g., ring systems, we introduce substructure-level tokenization for text-to-molecule models. Building on our tokenization scheme, we develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics. Extensive experiments verify the superiority of CAMT5 in various text-to-molecule generation tasks. Intriguingly, we find that CAMT5 outperforms the state-of-the-art methods using only 2% of training tokens. In addition, we propose a simple yet effective ensemble strategy that aggregates the outputs of text-to-molecule models to further boost the generation performance. Code is available at https://github.com/Songhyeontae/CAMT5.git.

[24] No Clustering, No Routing: How Transformers Actually Process Rare Tokens

Jing Liu

Main category: cs.CL

TL;DR: Rare token processing in LLMs requires additional specialized “plateau” neurons beyond common token processing, forming dual computational regimes through distributed training-driven differentiation rather than architectural modularity.

DetailsMotivation: Large language models struggle with rare token prediction, and while prior work identified specialized plateau neurons for rare tokens, their functional organization and mechanisms remain unclear.

Method: Neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models.

Result: (1) Rare token processing requires additional plateau neurons beyond the power-law regime sufficient for common tokens; (2) Plateau neurons are spatially distributed rather than forming modular clusters; (3) Attention mechanisms show no preferential routing to specialists.

Conclusion: Rare token specialization arises through distributed, training-driven differentiation rather than architectural modularity, preserving context-sensitive flexibility while achieving adaptive capacity allocation.

Abstract: Large language models struggle with rare token prediction, yet the mechanisms driving their specialization remain unclear. Prior work identified specialized ``plateau’’ neurons for rare tokens following distinctive three-regime influence patterns \cite{liu2025emergent}, but their functional organization is unknown. We investigate this through neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models. Our findings show that: (1) rare token processing requires additional plateau neurons beyond the power-law regime sufficient for common tokens, forming dual computational regimes; (2) plateau neurons are spatially distributed rather than forming modular clusters; and (3) attention mechanisms exhibit no preferential routing to specialists. These results demonstrate that rare token specialization arises through distributed, training-driven differentiation rather than architectural modularity, preserving context-sensitive flexibility while achieving adaptive capacity allocation.

[25] An End-to-End System for Culturally-Attuned Driving Feedback using a Dual-Component NLG Engine

Iniakpokeikiye Peter Thompson, Yi Dewei, Reiter Ehud

Main category: cs.CL

TL;DR: End-to-end mobile system for culturally-attuned safe driving feedback in Nigeria using dual-component NLG engine with safety tips and behavioral reports, designed for low-resource environments with intermittent connectivity.

DetailsMotivation: Address road safety challenges in low-resource environments like Nigeria with significant infrastructural issues, focusing on culturally-relevant feedback and alcohol-influenced driving detection.

Method: Complete system architecture with automatic trip detection, on-device behavior analysis, sophisticated NLG pipeline using two-step reflection process, and specialized ML model for alcohol detection, engineered for robustness against connectivity and sensor noise.

Result: Pilot deployment with 90 drivers demonstrates viability, with initial results on detected unsafe behaviors presented.

Conclusion: Provides a framework for applying data-to-text and AI systems to achieve social good in challenging environments.

Abstract: This paper presents an end-to-end mobile system that delivers culturally-attuned safe driving feedback to drivers in Nigeria, a low-resource environment with significant infrastructural challenges. The core of the system is a novel dual-component Natural Language Generation (NLG) engine that provides both legally-grounded safety tips and persuasive, theory-driven behavioural reports. We describe the complete system architecture, including an automatic trip detection service, on-device behaviour analysis, and a sophisticated NLG pipeline that leverages a two-step reflection process to ensure high-quality feedback. The system also integrates a specialized machine learning model for detecting alcohol-influenced driving, a key local safety issue. The architecture is engineered for robustness against intermittent connectivity and noisy sensor data. A pilot deployment with 90 drivers demonstrates the viability of our approach, and initial results on detected unsafe behaviours are presented. This work provides a framework for applying data-to-text and AI systems to achieve social good.

[26] Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

Ravi Shankar, Sheng Wong, Lin Li, Magdalena Bachmann, Alex Silverthorne, Beth Albert, Gabriel Davis Jones

Main category: cs.CL

TL;DR: Energy-based model outperforms softmax and kNN for reliable abstention in RAG systems, especially on hard semantic cases in safety-critical domains.

DetailsMotivation: Need for reliable abstention in retrieval-augmented generation systems, particularly in safety-critical domains like women's health where incorrect answers can cause harm.

Method: Energy-based model that learns smooth energy landscape over 2.6M guideline-derived questions, benchmarked against calibrated softmax baseline and kNN density heuristic across easy and hard abstention splits.

Result: EBM achieves superior abstention performance on hard cases (AUROC 0.961 vs 0.950 for softmax, FPR@95 0.235 vs 0.331). Robustness primarily from energy scoring head, not specific negative sampling strategies.

Conclusion: Energy-based abstention scoring provides more reliable confidence signal than probability-based softmax, offering scalable and interpretable foundation for safe RAG systems.

Abstract: Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women’s health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM’s advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.

[27] Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition

Ryo Takahashi, Naoki Saito, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Main category: cs.CL

TL;DR: Personalized Visual Emotion Recognition using discrete prompt tuning to adapt MLLMs for individual-level emotion recognition, overcoming majority bias in general models.

DetailsMotivation: MLLMs favor majority viewpoints and familiar patterns from general training data, limiting performance in personalized VER needed for real-world applications like opinion mining and advertisement design.

Method: Discrete prompt tuning inspired by human prompt engineering, selecting optimal natural language representations from generated prompts to update prompts for accurate personalized emotion recognition.

Result: The method enables adaptation of VER tasks to individual users, improving accuracy in personalized emotion recognition compared to standard MLLMs.

Conclusion: Discrete prompt tuning effectively addresses the limitation of MLLMs in personalized VER, making them more suitable for practical applications requiring individual-level emotion understanding.

Abstract: Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans’ prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.

[28] DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs

Minghui Huang

Main category: cs.CL

TL;DR: DecMetrics introduces three new metrics (COMPLETENESS, CORRECTNESS, SEMANTIC ENTROPY) to automatically evaluate claim decomposition quality and uses them to optimize a lightweight decomposition model for fact-checking.

DetailsMotivation: Current research focuses on generative methods for claim decomposition but lacks proper evaluation metrics for assessing the quality of decomposed atomic claims, creating a gap in reliable fact-checking systems.

Method: Developed three automatic evaluation metrics (COMPLETENESS, CORRECTNESS, SEMANTIC ENTROPY) and integrated them as a reward function to optimize a lightweight claim decomposition model.

Result: The approach aims to set a benchmark for claim decomposition through automatic evaluation, enhancing both reliability and effectiveness of fact-checking systems.

Conclusion: DecMetrics bridges the evaluation gap in claim decomposition research by providing automated quality assessment metrics that can optimize decomposition models and improve fact-checking system performance.

Abstract: Claim decomposition plays a crucial role in the fact-checking process by breaking down complex claims into simpler atomic components and identifying their unfactual elements. Despite its importance, current research primarily focuses on generative methods for decomposition, with insufficient emphasis on evaluating the quality of these decomposed atomic claims. To bridge this gap, we introduce \textbf{DecMetrics}, which comprises three new metrics: \texttt{COMPLETENESS}, \texttt{CORRECTNESS}, and \texttt{SEMANTIC ENTROPY}, designed to automatically assess the quality of claims produced by decomposition models. Utilizing these metrics, we develop a lightweight claim decomposition model, optimizing its performance through the integration of these metrics as a reward function. Through automatic evaluation, our approach aims to set a benchmark for claim decomposition, enhancing both the reliability and effectiveness of fact-checking systems.

[29] The Good, the Bad and the Constructive: Automatically Measuring Peer Review’s Utility for Authors

Abdelrahman Sadallah, Tim Baumgärtner, Iryna Gurevych, Ted Briscoe

Main category: cs.CL

TL;DR: The paper introduces RevUtil dataset for evaluating review comment quality on four aspects (Actionability, Grounding & Specificity, Verifiability, Helpfulness) and shows fine-tuned models can match GPT-4o performance in assessing review utility.

DetailsMotivation: With reviewers having less time for peer review, automated systems are needed to ensure high-quality feedback that is useful for paper authors.

Method: Created RevUtil dataset with 1,430 human-labeled and 10k synthetically labeled review comments with rationales. Benchmarked fine-tuned models for assessing review comments on four utility aspects and generating explanations.

Result: Fine-tuned models achieved human agreement levels comparable to or exceeding GPT-4o. Machine-generated reviews generally underperform human reviews on the four utility aspects.

Conclusion: The RevUtil dataset enables effective evaluation of review comment utility, and fine-tuned models can effectively assess review quality, though human reviews still outperform automated ones on key utility dimensions.

Abstract: Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.

[30] ASCENDgpt: A Phenotype-Aware Transformer Model for Cardiovascular Risk Prediction from Electronic Health Records

Chris Sainsbury, Andreas Karwath

Main category: cs.CL

TL;DR: ASCENDgpt is a transformer model for cardiovascular risk prediction that uses phenotype-aware tokenization to reduce EHR complexity while maintaining clinical interpretability and achieving strong predictive performance.

DetailsMotivation: To improve cardiovascular risk prediction from electronic health records by addressing the complexity of raw ICD codes and enabling clinically interpretable predictions while maintaining computational efficiency.

Method: Developed a phenotype-aware tokenization scheme mapping 47,155 ICD codes to 176 clinically meaningful phenotype tokens, pretrained using masked language modeling on 19,402 individuals, then fine-tuned for time-to-event prediction of five cardiovascular outcomes.

Result: Achieved excellent discrimination with average C-index of 0.816 across all outcomes: MI (0.792), stroke (0.824), MACE (0.800), cardiovascular death (0.842), and all-cause mortality (0.824), with 77.9% vocabulary reduction.

Conclusion: Domain-specific tokenization and pretraining are effective for EHR-based risk prediction, enabling interpretable predictions while maintaining computational efficiency and strong performance across multiple cardiovascular outcomes.

Abstract: We present ASCENDgpt, a transformer-based model specifically designed for cardiovascular risk prediction from longitudinal electronic health records (EHRs). Our approach introduces a novel phenotype-aware tokenization scheme that maps 47,155 raw ICD codes to 176 clinically meaningful phenotype tokens, achieving 99.6% consolidation of diagnosis codes while preserving semantic information. This phenotype mapping contributes to a total vocabulary of 10,442 tokens - a 77.9% reduction when compared with using raw ICD codes directly. We pretrain ASCENDgpt on sequences derived from 19402 unique individuals using a masked language modeling objective, then fine-tune for time-to-event prediction of five cardiovascular outcomes: myocardial infarction (MI), stroke, major adverse cardiovascular events (MACE), cardiovascular death, and all-cause mortality. Our model achieves excellent discrimination on the held-out test set with an average C-index of 0.816, demonstrating strong performance across all outcomes (MI: 0.792, stroke: 0.824, MACE: 0.800, cardiovascular death: 0.842, all-cause mortality: 0.824). The phenotype-based approach enables clinically interpretable predictions while maintaining computational efficiency. Our work demonstrates the effectiveness of domain-specific tokenization and pretraining for EHR-based risk prediction tasks.

[31] Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR

Xinnian Zhao, Hugo Van Hamme

Main category: cs.CL

TL;DR: Novel approach using TV subtitles as context-rich prompts instead of direct supervision in weakly supervised ASR, with weighted attention mechanism for improved transcription accuracy.

DetailsMotivation: TV subtitles are readily available but have imprecise alignment with audio, limiting their use as direct supervision for verbatim transcription in ASR systems.

Method: Reimagines subtitles as context-rich prompts rather than direct supervision, uses generated pseudo transcripts as primary targets with subtitles as guiding cues for iterative refinement, and introduces weighted attention mechanism to emphasize relevant subtitle tokens.

Result: Experiments demonstrate significant improvements in transcription accuracy, showing effectiveness in refining transcripts.

Conclusion: Enhanced pseudo-labeled datasets provide high-quality foundational resources for training robust ASR systems through this novel subtitle utilization approach.

Abstract: This study proposes a novel approach to using TV subtitles within a weakly supervised (WS) Automatic Speech Recognition (ASR) framework. Although TV subtitles are readily available, their imprecise alignment with corresponding audio limits their applicability as supervised targets for verbatim transcription. Rather than using subtitles as direct supervision signals, our method reimagines them as context-rich prompts. This design enables the model to handle discrepancies between spoken audio and subtitle text. Instead, generated pseudo transcripts become the primary targets, with subtitles acting as guiding cues for iterative refinement. To further enhance the process, we introduce a weighted attention mechanism that emphasizes relevant subtitle tokens during inference. Our experiments demonstrate significant improvements in transcription accuracy, highlighting the effectiveness of the proposed method in refining transcripts. These enhanced pseudo-labeled datasets provide high-quality foundational resources for training robust ASR systems.

[32] Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

Charles Moslonka, Hicham Randrianarivo, Arthur Garnier, Emmanuel Malherbe

Main category: cs.CL

TL;DR: One-shot hallucination detection method using limited log-probability data from black-box LLM APIs, achieving improved performance over baseline entropy metrics without requiring multiple query runs.

DetailsMotivation: Hallucinations in LLM outputs for QA tasks undermine reliability, especially in API-constrained scenarios with limited access to token probabilities.

Method: Derives uncertainty indicators from available top candidate log-probabilities during non-greedy decoding, using Entropy Production Rate metric and supervised learning with entropic contribution features from single generated sequences.

Result: Significantly improves hallucination detection over EPR alone across diverse QA datasets and multiple LLMs, demonstrating high performance with only small sets of available log-probabilities (e.g., top <10 per token).

Conclusion: Provides a readily deployable technique to enhance LLM trustworthiness in QA and RAG systems from single generation passes, with demonstrated utility in financial analysis applications.

Abstract: Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks critically undermine their real-world reliability. This paper introduces an applied methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate (EPR) metric that offers baseline performance, later augmented with supervised learning. Our learned model uses features representing the entropic contributions of the accessible top-ranked tokens within a single generated sequence, requiring no multiple query re-runs. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves hallucination detection over using EPR alone. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top <10 per token), confirming its practical efficiency and suitability for these API-constrained deployments. This work provides a readily deployable technique to enhance the trustworthiness of LLM responses from a single generation pass in QA and Retrieval-Augmented Generation (RAG) systems, with its utility further demonstrated in a finance framework analyzing responses to queries on annual reports from an industrial dataset.

[33] A Narrative-Driven Computational Framework for Clinician Burnout Surveillance

Syed Ahmad Chan Bukhari, Fazel Keshtkar, Alyssa Meczkowska

Main category: cs.CL

TL;DR: A hybrid NLP pipeline using BioBERT sentiment embeddings, lexical stress analysis, and LDA topic modeling on ICU clinical notes achieves high performance (F1=0.84) in detecting clinician burnout, outperforming metadata-only approaches.

DetailsMotivation: Clinician burnout threatens patient safety in ICUs, but existing research relies on retrospective surveys or broad EHR metadata, missing valuable narrative information in clinical notes.

Method: Analyzed 10,000 ICU discharge summaries from MIMIC-IV using a hybrid pipeline combining BioBERT sentiment embeddings fine-tuned for clinical narratives, lexical stress lexicon for burnout surveillance, and five-topic LDA with workload proxies.

Result: Provider-level logistic regression classifier achieved precision=0.80, recall=0.89, F1=0.84, surpassing metadata-only baselines by ≥0.17 F1 score. Elevated burnout risk found in Radiology, Psychiatry, and Neurology specialties.

Conclusion: ICU clinical narratives contain actionable signals for proactive well-being monitoring, demonstrating the value of text analysis for burnout detection beyond traditional metadata approaches.

Abstract: Clinician burnout poses a substantial threat to patient safety, particularly in high-acuity intensive care units (ICUs). Existing research predominantly relies on retrospective survey tools or broad electronic health record (EHR) metadata, often overlooking the valuable narrative information embedded in clinical notes. In this study, we analyze 10,000 ICU discharge summaries from MIMIC-IV, a publicly available database derived from the electronic health records of Beth Israel Deaconess Medical Center. The dataset encompasses diverse patient data, including vital signs, medical orders, diagnoses, procedures, treatments, and deidentified free-text clinical notes. We introduce a hybrid pipeline that combines BioBERT sentiment embeddings fine-tuned for clinical narratives, a lexical stress lexicon tailored for clinician burnout surveillance, and five-topic latent Dirichlet allocation (LDA) with workload proxies. A provider-level logistic regression classifier achieves a precision of 0.80, a recall of 0.89, and an F1 score of 0.84 on a stratified hold-out set, surpassing metadata-only baselines by greater than or equal to 0.17 F1 score. Specialty-specific analysis indicates elevated burnout risk among providers in Radiology, Psychiatry, and Neurology. Our findings demonstrate that ICU clinical narratives contain actionable signals for proactive well-being monitoring.

[34] Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations

Krithi Shailya, Akhilesh Kumar Mishra, Gokul S Krishnan, Balaraman Ravindran

Main category: cs.CL

TL;DR: LLMs show significant geographic, demographic, and economic biases in university recommendations, favoring Global North institutions and reinforcing gender stereotypes despite some diversity in suggestions.

DetailsMotivation: To examine and quantify biases in LLM-based educational recommendation systems that risk perpetuating societal inequalities in higher education access.

Method: Empirical analysis using 360 simulated user profiles varying by gender, nationality, and economic status to generate over 25,000 recommendations from three open-source LLMs (LLaMA-3.1-8B, Gemma-7B, Mistral-7B), with a novel multi-dimensional evaluation framework.

Result: Strong biases observed: Global North institutions disproportionately favored, gender stereotypes reinforced, institutional repetition prevalent. LLaMA-3.1 showed highest diversity (481 unique universities across 58 countries) but systemic disparities persisted.

Conclusion: Urgent need for bias consideration in educational LLMs to ensure equitable global access to higher education, requiring improved evaluation frameworks beyond accuracy metrics.

Abstract: Large Language Models (LLMs) are increasingly used as daily recommendation systems for tasks like education planning, yet their recommendations risk perpetuating societal biases. This paper empirically examines geographic, demographic, and economic biases in university and program suggestions from three open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Using 360 simulated user profiles varying by gender, nationality, and economic status, we analyze over 25,000 recommendations. Results show strong biases: institutions in the Global North are disproportionately favored, recommendations often reinforce gender stereotypes, and institutional repetition is prevalent. While LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, systemic disparities persist. To quantify these issues, we propose a novel, multi-dimensional evaluation framework that goes beyond accuracy by measuring demographic and geographic representation. Our findings highlight the urgent need for bias consideration in educational LMs to ensure equitable global access to higher education.

[35] DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu

Main category: cs.CL

TL;DR: DeepTRACE audit framework reveals generative search engines and deep research agents frequently produce one-sided, highly confident responses with poor citation accuracy and unsupported statements.

DetailsMotivation: Users regularly encounter overconfidence, weak sourcing, and confusing citation practices in generative search engines and deep research LLM agents, despite promises of trustworthy synthesis.

Method: Developed DeepTRACE framework with eight measurable dimensions using statement-level analysis, decomposition, confidence scoring, and citation/factual-support matrices to audit evidence attribution end-to-end across popular models.

Result: Systems frequently produce one-sided responses on debate queries with large fractions of unsupported statements. Deep-research configurations reduce overconfidence but remain one-sided with citation accuracy ranging 40-80%.

Conclusion: Current generative search and deep research systems exhibit significant limitations in source attribution and balanced reasoning, highlighting the need for improved auditing frameworks and system designs.

Abstract: Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40–80% across systems.

[36] Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts

Rushi Wang, Jiateng Liu, Cheng Qian, Yifan Shen, Yanzhou Pan, Zhaozhuo Xu, Ahmed Abbasi, Heng Ji, Denghui Zhang

Main category: cs.CL

TL;DR: LLMs show vulnerability to inappropriate content in mixed contexts, exhibiting a counterintuitive tendency to prioritize less prevalent information. The paper introduces a neuroscience-inspired approach (RW-Steering) that improves response quality by 39.8% through fine-tuning to identify and ignore harmful signals.

DetailsMotivation: Real-world contexts often contain mixed relevant and inappropriate content, posing reliability risks to LLMs. Understanding how LLMs process and prioritize conflicting contextual signals is crucial for improving their safety and reliability.

Method: Introduced Poisoned Context Testbed with mixed relevant/inappropriate content. Adapted Rescorla-Wagner model from neuroscience to quantify contextual influence. Developed RW-Steering, a two-stage fine-tuning approach that teaches models to internally identify and ignore inappropriate signals.

Result: LLMs exhibit consistent behavioral pattern: strong tendency to incorporate less prevalent (often inappropriate) content. RW-Steering improved response quality by 39.8% and reversed the undesirable behavior curve, demonstrating robust generalization across varying proportions of inappropriate content.

Conclusion: RW-Steering provides a robust, generalizable context engineering solution that significantly enhances LLM safety by enabling models to internally filter out inappropriate contextual signals without extensive supervision across diverse context mixtures.

Abstract: Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.

[37] Understanding Reinforcement Learning for Model Training, and future directions with GRAPE

Rohit Patel

Main category: cs.CL

TL;DR: A comprehensive tutorial-style paper that explains key instruction tuning algorithms (SFT, Rejection Sampling, REINFORCE, TRPO, PPO, GRPO, DPO) with simplified notation focused on LLMs, provides literature review of newer techniques, and introduces new research ideas with GRAPE.

DetailsMotivation: Existing explanations of instruction tuning algorithms often assume prior knowledge, lack critical details, or are overly complex and generalized, making them difficult to understand for those new to the field.

Method: Step-by-step development of each algorithm using simplified and explicit notation specifically focused on LLMs, minimizing detours into broader RL literature to reduce cognitive overhead and eliminate ambiguity.

Result: Provides clear and intuitive understanding of key instruction tuning concepts and algorithms, making them accessible without requiring extensive prior knowledge in reinforcement learning.

Conclusion: The paper successfully demystifies complex instruction tuning algorithms through simplified explanations, provides comprehensive literature review of newer techniques, and introduces GRAPE as a new research direction for generalized relative advantage policy evolution.

Abstract: This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.

[38] VaccineRAG: Boosting Multimodal Large Language Models’ Immunity to Harmful RAG Samples

Qixin Sun, Ziqin Wang, Hengyuan Zhao, Yilin Li, Kaiyou Song, Linjiang Huang, Xiaolin Hu, Qingpei Guo, Si Liu

Main category: cs.CL

TL;DR: VaccineRAG is a novel Chain-of-Thought-based RAG dataset that addresses retrieval precision issues by evaluating models with varying positive/negative sample ratios and enhancing discrimination through explicit CoT analysis, with Partial-GRPO for better complex sequence learning.

DetailsMotivation: RAG effectiveness is hindered by poor retrieval precision where many irrelevant/misleading samples are fed to LLMs, creating a critical performance bottleneck.

Method: Introduces VaccineRAG dataset with varying positive/negative sample ratios, prompts LLMs to generate explicit Chain-of-Thought analysis for each sample, and proposes Partial-GRPO to model LLM outputs as multiple components for better complex sequence learning.

Result: Comprehensive evaluations and ablation studies validate the effectiveness of the proposed scheme in enhancing models’ sample-discrimination capabilities and complex CoT learning.

Conclusion: VaccineRAG successfully addresses RAG retrieval precision issues and enhances LLM performance through systematic evaluation and improved Chain-of-Thought learning capabilities.

Abstract: Retrieval Augmented Generation enhances the response accuracy of Large Language Models (LLMs) by integrating retrieval and generation modules with external knowledge, demonstrating particular strength in real-time queries and Visual Question Answering tasks. However, the effectiveness of RAG is frequently hindered by the precision of the retriever: many retrieved samples fed into the generation phase are irrelevant or misleading, posing a critical bottleneck to LLMs’ performance. To address this challenge, we introduce VaccineRAG, a novel Chain-of-Thought-based retrieval-augmented generation dataset. On one hand, VaccineRAG employs a benchmark to evaluate models using data with varying positive/negative sample ratios, systematically exposing inherent weaknesses in current LLMs. On the other hand, it enhances models’ sample-discrimination capabilities by prompting LLMs to generate explicit Chain-of-Thought (CoT) analysis for each sample before producing final answers. Furthermore, to enhance the model’s ability to learn long-sequence complex CoT content, we propose Partial-GRPO. By modeling the outputs of LLMs as multiple components rather than a single whole, our model can make more informed preference selections for complex sequences, thereby enhancing its capacity to learn complex CoT. Comprehensive evaluations and ablation studies on VaccineRAG validate the effectiveness of the proposed scheme. The code and dataset will be publicly released soon.

[39] Behavioral Fingerprinting of Large Language Models

Zehua Pei, Hui-Ling Zhen, Ying Zhang, Zhiyuan Yang, Xing Li, Xianzhi Yu, Mingxuan Yuan, Bei Yu

Main category: cs.CL

TL;DR: Behavioral Fingerprinting framework moves beyond traditional LLM benchmarks to analyze nuanced behavioral characteristics, revealing that while reasoning capabilities converge among top models, alignment-related behaviors vary significantly based on developer strategies.

DetailsMotivation: Current LLM benchmarks focus primarily on performance metrics but fail to capture the nuanced behavioral characteristics that differentiate models, creating a need for a more comprehensive evaluation framework.

Method: Uses a Diagnostic Prompt Suite and automated evaluation pipeline where a powerful LLM acts as an impartial judge to analyze eighteen models across capability tiers, creating multi-faceted behavioral profiles.

Result: Reveals critical divergence: core capabilities like abstract and causal reasoning are converging among top models, but alignment-related behaviors (sycophancy, semantic robustness) vary dramatically. Documents cross-model default persona clustering (ISTJ/ESTJ) reflecting common alignment incentives.

Conclusion: A model’s interactive nature is not emergent from scale or reasoning power, but a direct consequence of specific and highly variable developer alignment strategies. The framework provides reproducible methodology for uncovering deep behavioral differences.

Abstract: Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting’’ framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model’s intrinsic cognitive and interactive styles. Using a curated \textit{Diagnostic Prompt Suite} and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model’s interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: https://github.com/JarvisPei/Behavioral-Fingerprinting

[40] From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach

Nithyashree Sivasubramaniam

Main category: cs.CL

TL;DR: Enhanced ASR framework combining transformer acoustic model with LLM post-processing for silent speech interfaces, achieving 16% relative WER reduction.

DetailsMotivation: Silent speech interfaces generate speech with phonetic ambiguity and noise, requiring improved recognition and downstream processing.

Method: Transformer-based acoustic model captures full utterance context combined with large language model for linguistic post-processing.

Result: 16% relative and 6% absolute WER reduction over 36% baseline, significantly improving intelligibility.

Conclusion: The combined transformer-LLM framework effectively addresses phonetic ambiguity in silent speech, substantially enhancing recognition accuracy.

Abstract: Silent Speech Interfaces (SSIs) have gained attention for their ability to generate intelligible speech from non-acoustic signals. While significant progress has been made in advancing speech generation pipelines, limited work has addressed the recognition and downstream processing of synthesized speech, which often suffers from phonetic ambiguity and noise. To overcome these challenges, we propose an enhanced automatic speech recognition framework that combines a transformer-based acoustic model with a large language model (LLM) for post-processing. The transformer captures full utterance context, while the LLM ensures linguistic consistency. Experimental results show a 16% relative and 6% absolute reduction in word error rate (WER) over a 36% baseline, demonstrating substantial improvements in intelligibility for silent speech interfaces.

[41] Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations

Martha O. Dimgba, Sharon Oba, Ameeta Agrawal, Philippe J. Giabbanelli

Main category: cs.CL

TL;DR: BAME method uses model-generated explanations to reduce gender and ethnicity bias in AI-generated occupational stories by 2-20% through targeted prompt engineering without modifying model parameters.

DetailsMotivation: Language models propagate social bias in gender and ethnicity representation, particularly in occupational stories, which perpetuates training data stereotypes and requires effective mitigation strategies.

Method: BAME (Bias Analysis and Mitigation through Explanation) leverages model-generated explanations to inform targeted prompt engineering. Analyzed stories across 25 occupational groups using three LLMs (Claude 3.5 Sonnet, Llama 3.1 70B Instruct, GPT-4 Turbo) and multiple demographic dimensions.

Result: Achieved 2-20% improvements in demographic representation. Identified persistent patterns of overrepresentation and underrepresentation linked to training data stereotypes across all tested models.

Conclusion: Guiding models with their own internal reasoning mechanisms significantly enhances demographic parity and contributes to developing more transparent generative AI systems without requiring model parameter modifications.

Abstract: Language models have been shown to propagate social bias through their output, particularly in the representation of gender and ethnicity. This paper investigates gender and ethnicity biases in AI-generated occupational stories. Representation biases are measured before and after applying our proposed mitigation strategy, Bias Analysis and Mitigation through Explanation (BAME), revealing improvements in demographic representation ranging from 2% to 20%. BAME leverages model-generated explanations to inform targeted prompt engineering, effectively reducing biases without modifying model parameters. By analyzing stories generated across 25 occupational groups, three large language models (Claude 3.5 Sonnet, Llama 3.1 70B Instruct, and GPT-4 Turbo), and multiple demographic dimensions, we identify persistent patterns of overrepresentation and underrepresentation linked to training data stereotypes. Our findings demonstrate that guiding models with their own internal reasoning mechanisms can significantly enhance demographic parity, thereby contributing to the development of more transparent generative AI systems.

[42] ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models

Biddut Sarker Bijoy, Mohammad Saqib Hasan, Pegah Alipoormolabashi, Avirup Sil, Aruna Balasubramanian, Niranjan Balasubramanian

Main category: cs.CL

TL;DR: Multi-agent systems with smaller language models can outperform single large language model systems through progressive sub-task training, achieving better effectiveness-efficiency trade-offs.

DetailsMotivation: To compare single vs multi-agent systems using different sized language models for complex problems, addressing the limitations of smaller language models in long-trajectory learning.

Method: Instantiated single and multi-agent systems in AppWorld environment, introduced progressive sub-task training strategy that introduces new sub-tasks progressively each epoch (similar to curriculum learning).

Result: Progressive training consistently improved multi-agent effectiveness across all configurations. Fine-tuned multi-agent systems showed better effectiveness-efficiency trade-offs in Pareto analysis, with reduced subtask error rates.

Conclusion: Multi-agent systems with smaller language models, when trained with progressive sub-task strategy, provide a viable and efficient alternative to single large language model systems for complex problem solving.

Abstract: Multi-agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade-off, we instantiate single and multi-agent systems for the complex problems in the AppWorld environment using different sized language models. We find that difficulties with long-trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub-task training strategy, which introduces new sub-tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi-agents at all configurations. Our Pareto analysis shows that fine-tuned multi-agent systems yield better effectiveness-efficiency trade-offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates.

[43] Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation

Zaifu Zhan, Shuang Zhou, Min Zeng, Kai Yu, Meijia Song, Xiaoyi Chen, Jun Wang, Yu Hou, Rui Zhang

Main category: cs.CL

TL;DR: Quantization enables deployment of large language models on consumer GPUs with 75% memory reduction while maintaining performance across biomedical NLP tasks.

DetailsMotivation: Large language models face adoption barriers in healthcare due to size, computational requirements, and data privacy concerns that preclude cloud deployment.

Method: Systematically evaluated quantization impact on 12 state-of-the-art LLMs (general and biomedical-specific) across 8 benchmark datasets covering 4 key biomedical NLP tasks: named entity recognition, relation extraction, multi-label classification, and question answering.

Result: Quantization reduces GPU memory requirements by up to 75% while preserving model performance, enabling 70B-parameter models on 40GB consumer GPUs. Domain knowledge and prompt responsiveness are largely maintained.

Conclusion: Quantization is a practical and effective strategy for secure local deployment of large language models in biomedical contexts, bridging the gap between AI advances and clinical translation.

Abstract: Large language models have demonstrated remarkable capabilities in biomedical natural language processing, yet their rapid growth in size and computational requirements present a major barrier to adoption in healthcare settings where data privacy precludes cloud deployment and resources are limited. In this study, we systematically evaluated the impact of quantization on 12 state-of-the-art large language models, including both general-purpose and biomedical-specific models, across eight benchmark datasets covering four key tasks: named entity recognition, relation extraction, multi-label classification, and question answering. We show that quantization substantially reduces GPU memory requirements-by up to 75%-while preserving model performance across diverse tasks, enabling the deployment of 70B-parameter models on 40GB consumer-grade GPUs. In addition, domain-specific knowledge and responsiveness to advanced prompting methods are largely maintained. These findings provide significant practical and guiding value, highlighting quantization as a practical and effective strategy for enabling the secure, local deployment of large yet high-capacity language models in biomedical contexts, bridging the gap between technical advances in AI and real-world clinical translation.

[44] Combine Virtual Reality and Machine-Learning to Identify the Presence of Dyslexia: A Cross-Linguistic Approach

Michele Materazzini, Gianluca Morciano, Jose Manuel Alcalde-Llergo, Enrique Yeguas-Bolivar, Giuseppe Calabro, Andrea Zingoni, Juri Taborri

Main category: cs.CL

TL;DR: VR and AI used to predict dyslexia in Italian/Spanish students through silent reading tests and self-esteem assessments, achieving 87.5% accuracy for Italian and 66.6% for Spanish using machine learning.

DetailsMotivation: To investigate whether VR-derived data from reading tests and self-esteem assessments can differentiate between students with and without dyslexia using machine learning algorithms.

Method: Participants completed VR-based tasks measuring reading performance and self-esteem. Preliminary statistical analysis (t-tests and Mann-Whitney tests) compared scores, followed by training and testing supervised ML models on the data.

Result: Significant differences found in completion time for silent reading tests (but not accuracy or self-esteem). ML models achieved 87.5% accuracy for Italian, 66.6% for Spanish, and 75.0% for pooled group classification.

Conclusion: VR and ML can be effective supporting tools for dyslexia assessment, particularly capturing differences in task completion speed, though language-specific factors influence classification accuracy.

Abstract: This study explores the use of virtual reality (VR) and artificial intelligence (AI) to predict the presence of dyslexia in Italian and Spanish university students. In particular, the research investigates whether VR-derived data from Silent Reading (SR) tests and self-esteem assessments can differentiate between students that are affected by dyslexia and students that are not, employing machine learning (ML) algorithms. Participants completed VR-based tasks measuring reading performance and self-esteem. A preliminary statistical analysis (t tests and Mann Whitney tests) on these data was performed, to compare the obtained scores between individuals with and without dyslexia, revealing significant differences in completion time for the SR test, but not in accuracy, nor in self esteem. Then, supervised ML models were trained and tested, demonstrating an ability to classify the presence/absence of dyslexia with an accuracy of 87.5 per cent for Italian, 66.6 per cent for Spanish, and 75.0 per cent for the pooled group. These findings suggest that VR and ML can effectively be used as supporting tools for assessing dyslexia, particularly by capturing differences in task completion speed, but language-specific factors may influence classification accuracy.

[45] Scaling behavior of large language models in emotional safety classification across sizes and tasks

Edoardo Pinzuti, Oliver Tüscher, André Ferreira Castro

Main category: cs.CL

TL;DR: LLMs show better emotional safety classification with scale, but small models can match performance with fine-tuning, enabling privacy-preserving on-device applications.

DetailsMotivation: Understanding how LLMs handle emotionally sensitive content is crucial for building safe systems, especially in mental health contexts where safety and reliability are paramount.

Method: Constructed novel dataset from mental health datasets (>15K samples) with ChatGPT-generated emotion reinterpretation prompts. Evaluated four LLaMA models (1B-70B) across zero-shot, few-shot, and fine-tuning settings on trinary and multi-label safety classification tasks.

Result: Larger LLMs perform better on average, especially in nuanced multi-label classification and zero-shot settings. However, lightweight fine-tuning enabled the 1B model to achieve comparable performance to larger models and BERT in high-data categories while requiring <2GB VRAM.

Conclusion: Smaller on-device models can serve as viable, privacy-preserving alternatives for sensitive applications, offering emotional context interpretation and safe conversational boundaries, with implications for therapeutic LLM applications and scalable safety alignment.

Abstract: Understanding how large language models (LLMs) process emotionally sensitive content is critical for building safe and reliable systems, particularly in mental health contexts. We investigate the scaling behavior of LLMs on two key tasks: trinary classification of emotional safety (safe vs. unsafe vs. borderline) and multi-label classification using a six-category safety risk taxonomy. To support this, we construct a novel dataset by merging several human-authored mental health datasets (> 15K samples) and augmenting them with emotion re-interpretation prompts generated via ChatGPT. We evaluate four LLaMA models (1B, 3B, 8B, 70B) across zero-shot, few-shot, and fine-tuning settings. Our results show that larger LLMs achieve stronger average performance, particularly in nuanced multi-label classification and in zero-shot settings. However, lightweight fine-tuning allowed the 1B model to achieve performance comparable to larger models and BERT in several high-data categories, while requiring <2GB VRAM at inference. These findings suggest that smaller, on-device models can serve as viable, privacy-preserving alternatives for sensitive applications, offering the ability to interpret emotional context and maintain safe conversational boundaries. This work highlights key implications for therapeutic LLM applications and the scalable alignment of safety-critical systems.

[46] Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions

Faruk Alpay, Taylan Alpay

Main category: cs.CL

TL;DR: This paper presents a unified framework for controlling transformer language models through interventions at prompt, activation, and weight levels, achieving high success rates in sentiment control and factual edits while maintaining base performance.

DetailsMotivation: Transformer models excel at NLP tasks but lack fine-grained control, making it challenging to precisely manipulate their behavior for specific applications.

Method: The authors formalize controllable text generation as an optimization problem and introduce interventions at three levels: prompt engineering, activation interventions, and weight-space edits, using techniques like parameter-efficient fine-tuning and reinforcement learning.

Result: Empirical results show >90% success in sentiment control and factual edits while preserving base model performance, though trade-offs between generalization and specificity exist. Theoretically, minimal weight updates can achieve targeted behavior changes.

Conclusion: The work establishes a foundation for designing controllable and robust language models, but highlights ethical dual-use risks and the need for rigorous evaluation frameworks to ensure safety.

Abstract: Transformer-based language models excel in NLP tasks, but fine-grained control remains challenging. This paper explores methods for manipulating transformer models through principled interventions at three levels: prompts, activations, and weights. We formalize controllable text generation as an optimization problem addressable via prompt engineering, parameter-efficient fine-tuning, model editing, and reinforcement learning. We introduce a unified framework encompassing prompt-level steering, activation interventions, and weight-space edits. We analyze robustness and safety implications, including adversarial attacks and alignment mitigations. Theoretically, we show minimal weight updates can achieve targeted behavior changes with limited side-effects. Empirically, we demonstrate >90% success in sentiment control and factual edits while preserving base performance, though generalization-specificity trade-offs exist. We discuss ethical dual-use risks and the need for rigorous evaluation. This work lays groundwork for designing controllable and robust language models.

[47] Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets

Sophie Jaffer, Simeon Sayer

Main category: cs.CL

TL;DR: Native Swahili-trained BERT model outperformed English model with translated Swahili inputs by nearly 4x fewer errors, showing translation alone doesn’t bridge language representation gaps.

DetailsMotivation: To test whether data disparities in multilingual LLMs disadvantage non-English speakers and examine if translation can bridge performance gaps between languages.

Method: Compared two monolingual BERT models: one trained/tested on native Swahili news data, and another on English data with translated Swahili inputs to simulate multilingual model processing.

Result: Native Swahili model performed significantly better with 0.36% error rate vs 1.47% for translated inputs on English model, despite high-quality translation.

Conclusion: Native-language training remains crucial for reliable outcomes; translation alone cannot overcome representational differences, highlighting need for better multilingual dataset development and evaluation.

Abstract: As large language models (LLMs) expand multilingual capabilities, questions remain about the equity of their performance across languages. While many communities stand to benefit from AI systems, the dominance of English in training data risks disadvantaging non-English speakers. To test the hypothesis that such data disparities may affect model performance, this study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data. To simulate how multilingual LLMs process non-English queries through internal translation and abstraction, we translated the Swahili news data into English and evaluated it using the English-trained model. This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili, thus isolating the effect of language consistency versus cross-lingual abstraction. The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively. This gap suggests that translation alone does not bridge representational differences between languages and that models trained in one language may struggle to accurately interpret translated inputs due to imperfect internal knowledge representation, suggesting that native-language training remains important for reliable outcomes. In educational and informational contexts, even small performance gaps may compound inequality. Future research should focus on addressing broader dataset development for underrepresented languages and renewed attention to multilingual model evaluation, ensuring the reinforcing effect of global AI deployment on existing digital divides is reduced.

[48] Sample-efficient Integration of New Modalities into Large Language Models

Osman Batur İnce, André F. T. Martins, Oisin Mac Aodha, Edoardo M. Ponti

Main category: cs.CL

TL;DR: SEMI enables sample-efficient integration of new modalities into LLMs using a hypernetwork trained on high-resource modalities, achieving 64x data efficiency compared to training from scratch.

DetailsMotivation: Current multimodal foundation models struggle with integrating new modalities due to the large and evolving modality space and the need for extensive paired data, especially for low-resource modalities.

Method: Uses a hypernetwork trained on high-resource modalities (text, speech, audio, video) that generates adapters for any modality using few samples at inference. Increases training diversity through isometric transformations of encoders.

Result: Achieves significant sample efficiency improvements - 32-shot SEMI performs as well as training from scratch with 64x more data. Successfully integrated satellite images, astronomical images, inertial measurements, and molecules.

Conclusion: SEMI provides an effective approach to extend modality coverage of foundation models with minimal data requirements, making multimodal integration more feasible for low-resource modalities.

Abstract: Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data, which is often not available for low-resource modalities. In this paper, we introduce a method for sample-efficient modality integration (SEMI) into Large Language Models (LLMs). To this end, we devise a hypernetwork that can adapt a shared projector – placed between modality-specific encoders and an LLM – to any modality. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), is conditioned on a few samples from any arbitrary modality at inference time to generate a suitable adapter. To increase the diversity of training modalities, we artificially multiply the number of encoders through isometric transformations. We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, inertial measurements, and molecules) with encoders of arbitrary embedding dimensionality. For instance, to reach the same accuracy as 32-shot SEMI, training the projector from scratch needs 64$\times$ more data. As a result, SEMI holds promise to extend the modality coverage of foundation models.

[49] Analysis of Voluntarily Reported Data Post Mesh Implantation for Detecting Public Emotion and Identifying Concern Reports

Indu Bala, Lewis Mitchell, Marianne H Gillam

Main category: cs.CL

TL;DR: NLP analysis of MAUDE database patient reports reveals emotional patterns and increased concern reports during 2011-2012 and 2017-2018 periods following mesh implant surgeries.

DetailsMotivation: To investigate emotional aspects and sentiment patterns in patient experiences following mesh implantation surgeries, as postoperative complications remain a significant concern in hernia repair.

Method: Used Natural Language Processing (NLP) with NRC Emotion Lexicon and TextBlob to analyze patient reports from MAUDE database (2000-2021), categorizing narratives into eight emotions and assessing sentiment polarity.

Result: Detected increase in Concern Reports and higher emotional intensity during 2011-2012 and 2017-2018 periods, revealing temporal patterns in patient sentiment.

Conclusion: Emotional considerations are crucial in medical practices, and sentiment analysis can enhance preoperative counseling, postoperative care, and patient preparation for mesh implant surgeries.

Abstract: Mesh implants are widely utilized in hernia repair surgeries, but postoperative complications present a significant concern. This study analyzes patient reports from the Manufacturer and User Facility Device Experience (MAUDE) database spanning 2000 to 2021 to investigate the emotional aspects of patients following mesh implantation using Natural Language Processing (NLP). Employing the National Research Council Canada (NRC) Emotion Lexicon and TextBlob for sentiment analysis, the research categorizes patient narratives into eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and assesses sentiment polarity. The goal is to discern patterns in patient sentiment over time and to identify reports signaling urgent concerns, referred to as “Concern Reports,” thereby understanding shifts in patient experiences in relation to changes in medical device regulation and technological advancements in healthcare. The study detected an increase in Concern Reports and higher emotional intensity during the periods of 2011-2012 and 2017-2018. Through temporal analysis of Concern Reports and overall sentiment, this research provides valuable insights for healthcare practitioners, enhancing their understanding of patient experiences post-surgery, which is critical for improving preoperative counselling, postoperative care, and preparing patients for mesh implant surgeries. The study underscores the importance of emotional considerations in medical practices and the potential for sentiment analysis to inform and enhance patient care.

[50] Advancing SLM Tool-Use Capability using Reinforcement Learning

Dhruvi Paprunia, Vansh Kharidia, Pankti Doshi

Main category: cs.CL

TL;DR: Using Reinforcement Learning (GRPO) to improve tool-use capabilities in Small Language Models, making them more efficient and practical for real-world applications.

DetailsMotivation: Small Language Models struggle with tool use compared to Large Language Models due to limited training data and contextual understanding, creating a need for more efficient solutions.

Method: Employed Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), to enhance tool-use proficiency in SLMs without heavy computation requirements.

Result: Significantly boosted SLM tool-use accuracy, providing an efficient and effective solution that increases their practical utility.

Conclusion: The RL-based approach offers a superior alternative to conventional fine-tuning, making SLMs more capable of handling complex tool-use tasks in resource-constrained environments.

Abstract: Large Language Models (LLMs) have progressed beyond simple text creation, and tool use has become increasingly important for complex, real-world tasks. Tool use in LLMs refers to their ability to utilize external resources such as APIs, databases, or software functions to extend their functionality beyond generating text.Tools are used for tasks such as performing calculations, making API calls to retrieve the current time and date, and more. This capability enables models to fetch real-time data, execute commands, or solve problems requiring dynamic interaction, making it indispensable for applications like AI agents in virtual assistants, robotic control, or automated workflows. However, while LLMs are usually adept tool use, their vast resource requirements and computation complexity restrict their use in every use case.As a result, there is an increasing need for more compact and efficient Small Language Models (SLMs). Small language models (SLMs) struggle in tool use compared to large language models (LLMs). As soon in Table 1. SLMs are typically trained on smaller, more specific datasets, resulting in a narrower knowledge base and limited contextual understanding compared to LLMs. This research addresses these challenges by using Reinforcement Learning (RL), specifically Group Relative Policy Optimization (GRPO), to enhance tool-use proficiency in SLMs. Unlike conventional fine-tuning approaches that require heavy computation and often lack adaptability, our method provides an efficient, effective solution that significantly boosts SLM tool-use accuracy, increasing their practical utility.

[51] Hierarchical Section Matching Prediction (HSMP) BERT for Fine-Grained Extraction of Structured Data from Hebrew Free-Text Radiology Reports in Crohn’s Disease

Zvi Badash, Hadas Ben-Atya, Naama Gavrielov, Liam Hazan, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman

Main category: cs.CL

TL;DR: HSMP-BERT is a prompt-based model for extracting structured clinical information from Hebrew radiology reports, specifically for Crohn’s disease analysis. It significantly outperforms baseline methods with mean F1 score of 0.83 and enables efficient population-level analysis.

DetailsMotivation: Extracting structured clinical information from radiology reports is challenging, especially in low-resource languages like Hebrew and for complex conditions like Crohn's disease with multi-organ findings.

Method: Developed Hierarchical Structured Matching Prediction BERT (HSMP-BERT), a prompt-based model trained on 512 radiologist-annotated Hebrew radiology reports with 90 structured labels per subject, using multilabel-stratified split and hierarchical inference.

Result: HSMP-BERT achieved mean F1 0.83±0.08 and κ 0.65±0.17, significantly outperforming baseline methods (p < 10^{-7}). Hierarchical inference reduced runtime 5.1×. Applied to all 9,683 reports, it revealed clinical associations and trends.

Conclusion: HSMP-BERT provides a scalable solution for structured extraction in radiology, enabling population-level Crohn’s disease analysis and demonstrating AI’s potential in low-resource language settings.

Abstract: Extracting structured clinical information from radiology reports is challenging, especially in low-resource languages. This is pronounced in Crohn’s disease, with sparsely represented multi-organ findings. We developed Hierarchical Structured Matching Prediction BERT (HSMP-BERT), a prompt-based model for extraction from Hebrew radiology text. In an administrative database study, we analyzed 9,683 reports from Crohn’s patients imaged 2010-2023 across Israeli providers. A subset of 512 reports was radiologist-annotated for findings across six gastrointestinal organs and 15 pathologies, yielding 90 structured labels per subject. Multilabel-stratified split (66% train+validation; 33% test), preserving label prevalence. Performance was evaluated with accuracy, F1, Cohen’s $\kappa$, AUC, PPV, NPV, and recall. On 24 organ-finding combinations with $>$15 positives, HSMP-BERT achieved mean F1 0.83$\pm$0.08 and $\kappa$ 0.65$\pm$0.17, outperforming the SMP zero-shot baseline (F1 0.49$\pm$0.07, $\kappa$ 0.06$\pm$0.07) and standard fine-tuning (F1 0.30$\pm$0.27, $\kappa$ 0.27$\pm$0.34; paired t-test $p < 10^{-7}$). Hierarchical inference cuts runtime 5.1$\times$ vs. traditional inference. Applied to all reports, it revealed associations among ileal wall thickening, stenosis, and pre-stenotic dilatation, plus age- and sex-specific trends in inflammatory findings. HSMP-BERT offers a scalable solution for structured extraction in radiology, enabling population-level analysis of Crohn’s disease and demonstrating AI’s potential in low-resource settings.

[52] Using LLMs to create analytical datasets: A case study of reconstructing the historical memory of Colombia

David Anderson, Galia Benitez, Margret Bjarnadottir, Shriyan Reyya

Main category: cs.CL

TL;DR: Using GPT to analyze 200,000+ Spanish newspaper articles about Colombian armed conflict to build historical memory and study violence-coca eradication relationships.

DetailsMotivation: Colombia lacks systematic documentation of decades of armed conflict, resulting in missing historical accounts and publicly available conflict information.

Method: Utilize GPT large language model to read and answer questions about violence-related newspaper articles in Spanish, then conduct descriptive analysis and study violence-coca crop eradication relationships.

Result: Created a comprehensive dataset from newspaper analysis, enabling policy-relevant research on conflict dynamics that was previously infeasible.

Conclusion: LLMs enable new research opportunities by allowing deep examination of large text corpora, contributing to historical memory and supporting policy analysis in conflict studies.

Abstract: Colombia has been submerged in decades of armed conflict, yet until recently, the systematic documentation of violence was not a priority for the Colombian government. This has resulted in a lack of publicly available conflict information and, consequently, a lack of historical accounts. This study contributes to Colombia’s historical memory by utilizing GPT, a large language model (LLM), to read and answer questions about over 200,000 violence-related newspaper articles in Spanish. We use the resulting dataset to conduct both descriptive analysis and a study of the relationship between violence and the eradication of coca crops, offering an example of policy analyses that such data can support. Our study demonstrates how LLMs have opened new research opportunities by enabling examinations of large text corpora at a previously infeasible depth.

[53] Comparative Analysis of Transformer Models in Disaster Tweet Classification for Public Safety

Sharif Noor Zisad, Ragib Hasan

Main category: cs.CL

TL;DR: Transformer models (BERT, DistilBERT, RoBERTa, DeBERTa) significantly outperform traditional ML models for disaster tweet classification, with BERT achieving 91% accuracy vs 82% for Logistic Regression/Naive Bayes.

DetailsMotivation: Social media platforms provide real-time disaster information, but traditional ML models fail to understand context and informal language in tweets, limiting emergency response effectiveness.

Method: Evaluated transformer-based models (BERT, DistilBERT, RoBERTa, DeBERTa) against traditional ML approaches (Logistic Regression, Naive Bayes, SVM) for classifying disaster-related tweets.

Result: BERT achieved highest accuracy at 91%, significantly outperforming traditional models (82% for Logistic Regression/Naive Bayes). Transformer models better understand subtle language through contextual embeddings and attention mechanisms.

Conclusion: Transformer architectures are far more suitable for public safety applications, offering improved accuracy, deeper language understanding, and better generalization across real-world social media text.

Abstract: Twitter and other social media platforms have become vital sources of real time information during disasters and public safety emergencies. Automatically classifying disaster related tweets can help emergency services respond faster and more effectively. Traditional Machine Learning (ML) models such as Logistic Regression, Naive Bayes, and Support Vector Machines have been widely used for this task, but they often fail to understand the context or deeper meaning of words, especially when the language is informal, metaphorical, or ambiguous. We posit that, in this context, transformer based models can perform better than traditional ML models. In this paper, we evaluate the effectiveness of transformer based models, including BERT, DistilBERT, RoBERTa, and DeBERTa, for classifying disaster related tweets. These models are compared with traditional ML approaches to highlight the performance gap. Experimental results show that BERT achieved the highest accuracy (91%), significantly outperforming traditional models like Logistic Regression and Naive Bayes (both at 82%). The use of contextual embeddings and attention mechanisms allows transformer models to better understand subtle language in tweets, where traditional ML models fall short. This research demonstrates that transformer architectures are far more suitable for public safety applications, offering improved accuracy, deeper language understanding, and better generalization across real world social media text.

[54] Spoken in Jest, Detected in Earnest: A Systematic Review of Sarcasm Recognition – Multimodal Fusion, Challenges, and Future Prospects

Xiyuan Gao, Shekhar Nayak, Matt Coler

Main category: cs.CL

TL;DR: This systematic review analyzes speech-based sarcasm recognition, covering datasets, feature extraction, and classification methods from unimodal to multimodal approaches.

DetailsMotivation: Sarcasm poses challenges in human-machine interactions, and while text-based detection has been studied, speech data for sarcasm recognition remains underexplored despite prosodic cues being crucial for conveying sarcastic intent.

Method: The paper conducts a systematic review focusing on speech-based sarcasm recognition, examining the evolution from unimodal to multimodal approaches, including datasets, feature extraction techniques (from traditional acoustic features to deep learning representations), and classification methods.

Result: Findings reveal limitations in sarcasm recognition datasets, evolution of feature extraction techniques, and progression from unimodal to multimodal fusion classification methods.

Conclusion: The review identifies the need for greater emphasis on cross-cultural and multilingual sarcasm recognition and highlights the importance of treating sarcasm as a multimodal phenomenon rather than just a text-based challenge.

Abstract: Sarcasm, a common feature of human communication, poses challenges in interpersonal interactions and human-machine interactions. Linguistic research has highlighted the importance of prosodic cues, such as variations in pitch, speaking rate, and intonation, in conveying sarcastic intent. Although previous work has focused on text-based sarcasm detection, the role of speech data in recognizing sarcasm has been underexplored. Recent advancements in speech technology emphasize the growing importance of leveraging speech data for automatic sarcasm recognition, which can enhance social interactions for individuals with neurodegenerative conditions and improve machine understanding of complex human language use, leading to more nuanced interactions. This systematic review is the first to focus on speech-based sarcasm recognition, charting the evolution from unimodal to multimodal approaches. It covers datasets, feature extraction, and classification methods, and aims to bridge gaps across diverse research domains. The findings include limitations in datasets for sarcasm recognition in speech, the evolution of feature extraction techniques from traditional acoustic features to deep learning-based representations, and the progression of classification methods from unimodal approaches to multimodal fusion techniques. In so doing, we identify the need for greater emphasis on cross-cultural and multilingual sarcasm recognition, as well as the importance of addressing sarcasm as a multimodal phenomenon, rather than a text-based challenge.

[55] Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

Ayush Gupta, Ramneet Kaur, Anirban Roy, Adam D. Cobb, Rama Chellappa, Susmit Jha

Main category: cs.CL

TL;DR: Novel inference-time OOD detection method for specialized LLMs using dropout tolerance and conformal anomaly detection framework, achieving significant AUROC improvements over baselines.

DetailsMotivation: Specialized LLMs fine-tuned for specific domains remain vulnerable to incorrect outputs when presented with out-of-domain inputs, posing risks in critical applications like healthcare.

Method: Leverages Inductive Conformal Anomaly Detection (ICAD) with a new non-conformity measure based on model’s dropout tolerance, aggregating across multiple layers via valid ensemble approach while maintaining theoretical false alarm bounds.

Result: Experiments with medical-specialized LLMs show AUROC improvements of 2% to 37% over baseline methods when detecting OOD inputs.

Conclusion: The proposed dropout tolerance-based approach effectively detects out-of-domain inputs for specialized LLMs, providing reliable OOD detection with theoretical guarantees while significantly outperforming existing methods.

Abstract: We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model’s dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of $2%$ to $37%$ when treating OOD datapoints as positives and in-domain test datapoints as negatives.

[56] Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs

Brennen Hill, Surendra Parla, Venkata Abhijeeth Balabhadruni, Atharv Prajod Padmalayam, Sujay Chandra Shekara Sharma

Main category: cs.CL

TL;DR: Survey paper on prompt-based attacks against LLMs, categorizing attack methodologies and providing threat models to help develop secure LLM defenses.

DetailsMotivation: The proliferation of LLMs has created security vulnerabilities where adversarial prompts can bypass safety alignments, leading to IP theft, misinformation, and erosion of trust. A systematic understanding is needed to develop countermeasures.

Method: Comprehensive literature survey that categorizes prompt-based attack methodologies and provides clear threat models detailing mechanisms and impacts of these exploits.

Result: The survey systematically organizes and classifies various prompt-based attack vectors against LLMs, enabling better understanding of vulnerabilities.

Conclusion: This foundational work informs research efforts to build next-generation secure LLMs that are inherently resistant to unauthorized distillation, fine-tuning, and editing through prompt-based attacks.

Abstract: The proliferation of Large Language Models (LLMs) has introduced critical security challenges, where adversarial actors can manipulate input prompts to cause significant harm and circumvent safety alignments. These prompt-based attacks exploit vulnerabilities in a model’s design, training, and contextual understanding, leading to intellectual property theft, misinformation generation, and erosion of user trust. A systematic understanding of these attack vectors is the foundational step toward developing robust countermeasures. This paper presents a comprehensive literature survey of prompt-based attack methodologies, categorizing them to provide a clear threat model. By detailing the mechanisms and impacts of these exploits, this survey aims to inform the research community’s efforts in building the next generation of secure LLMs that are inherently resistant to unauthorized distillation, fine-tuning, and editing.

[57] Evaluating NL2SQL via SQL2NL

Mohammadtaher Safarzadeh, Afshin Oroojlooyjadid, Dan Roth

Main category: cs.CL

TL;DR: Proposes schema-aligned paraphrasing framework using SQL2NL to evaluate NL2SQL model robustness to linguistic variation, revealing significant performance drops in state-of-the-art models.

DetailsMotivation: Existing NL2SQL benchmarks lack systematic evaluation of linguistic variation robustness, which is crucial for understanding model generalization capabilities in real-world scenarios.

Method: Developed a novel schema-aligned paraphrasing framework that leverages SQL-to-NL generation to automatically create semantically equivalent but lexically diverse queries while maintaining schema and intent alignment.

Result: State-of-the-art models show significant brittleness - LLaMa3.3-70B drops 10.23% in execution accuracy, LLaMa3.1-8B drops nearly 20%, with smaller models disproportionately affected. Performance degradation varies by query complexity, dataset, and domain.

Conclusion: NL2SQL models are more fragile than standard benchmarks suggest, highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable real-world performance.

Abstract: Robust evaluation in the presence of linguistic variation is key to understanding the generalization capabilities of Natural Language to SQL (NL2SQL) models, yet existing benchmarks rarely address this factor in a systematic or controlled manner. We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation-distinct from prior work that primarily investigates ambiguity or schema perturbations. Our analysis reveals that state-of-the-art models are far more brittle than standard benchmarks suggest. For example, LLaMa3.3-70B exhibits a 10.23% drop in execution accuracy (from 77.11% to 66.9%) on paraphrased Spider queries, while LLaMa3.1-8B suffers an even larger drop of nearly 20% (from 62.9% to 42.5%). Smaller models (e.g., GPT-4o mini) are disproportionately affected. We also find that robustness degradation varies significantly with query complexity, dataset, and domain – highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable performance in real-world settings.

[58] AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Aisha Alansari, Hamzah Luqman

Main category: cs.CL

TL;DR: First comprehensive evaluation of hallucination in Arabic and multilingual LLMs on Arabic question answering and summarization tasks, revealing Arabic pre-trained models outperform multilingual ones.

DetailsMotivation: Research on LLM hallucination has focused mainly on English, leaving Arabic underexplored despite its widespread use and importance in global communication.

Method: Evaluated 12 LLMs (4 Arabic pre-trained, 4 multilingual, 4 reasoning-based) using a fine-grained framework with 12 hallucination indicators on generative question answering and summarization tasks.

Result: Factual hallucinations are more prevalent than faithfulness errors across all models. Arabic pre-trained model Allam consistently shows lower hallucination rates than multilingual models and comparable performance to reasoning-based models.

Conclusion: Arabic-specific pre-trained models demonstrate superior performance in reducing hallucinations compared to multilingual models, highlighting the importance of language-specific training for Arabic NLP tasks.

Abstract: Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs’ hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs’ outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: \href{https://github.com/aishaalansari57/AraHalluEval}{Github link}.

[59] ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs

Samira Khorshidi, Azadeh Nikfarjam, Suprita Shankar, Yisi Sang, Yash Govind, Hyun Jang, Ali Kasgari, Alexis McClimans, Mohamed Soliman, Vishnu Konda, Ahmed Fakhry, Xiaoguang Qi

Main category: cs.CL

TL;DR: ODKE+ is a production-grade system that automatically extracts millions of high-precision facts from web sources using a modular pipeline combining pattern-based rules and LLMs with ontological guidance and verification.

DetailsMotivation: Maintaining knowledge graph freshness and completeness is costly, requiring automated systems for scalable fact extraction from web sources with high precision.

Method: Five-component pipeline: (1) Extraction Initiator detects missing/stale facts, (2) Evidence Retriever collects documents, (3) hybrid Knowledge Extractors (pattern rules + ontology-guided LLMs), (4) lightweight Grounder validates with second LLM, (5) Corroborator ranks and normalizes facts. Uses dynamic ontology snippets for type-consistent extraction.

Result: Processed over 9M Wikipedia pages, ingested 19M high-confidence facts with 98.8% precision. Achieved 48% overlap with third-party KGs, reduced update lag by 50 days on average.

Conclusion: LLM-based extraction grounded in ontological structure and verification workflows enables trustworthy, production-scale knowledge ingestion with broad real-world applicability.

Abstract: Knowledge graphs (KGs) are foundational to many AI applications, but maintaining their freshness and completeness remains costly. We present ODKE+, a production-grade system that automatically extracts and ingests millions of open-domain facts from web sources with high precision. ODKE+ combines modular components into a scalable pipeline: (1) the Extraction Initiator detects missing or stale facts, (2) the Evidence Retriever collects supporting documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and ontology-guided prompting for large language models (LLMs), (4) a lightweight Grounder validates extracted facts using a second LLM, and (5) the Corroborator ranks and normalizes candidate facts for ingestion. ODKE+ dynamically generates ontology snippets tailored to each entity type to align extractions with schema constraints, enabling scalable, type-consistent fact extraction across 195 predicates. The system supports batch and streaming modes, processing over 9 million Wikipedia pages and ingesting 19 million high-confidence facts with 98.8% precision. ODKE+ significantly improves coverage over traditional methods, achieving up to 48% overlap with third-party KGs and reducing update lag by 50 days on average. Our deployment demonstrates that LLM-based extraction, grounded in ontological structure and verification workflows, can deliver trustworthiness, production-scale knowledge ingestion with broad real-world applicability. A recording of the system demonstration is included with the submission and is also available at https://youtu.be/UcnE3_GsTWs.

[60] Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang

Main category: cs.CL

TL;DR: Language models hallucinate because training and evaluation reward guessing over admitting uncertainty, and hallucinations originate from statistical classification errors that persist due to misaligned benchmark scoring.

DetailsMotivation: Hallucinations in large language models undermine trust and persist even in state-of-the-art systems, requiring understanding of their root causes to build more trustworthy AI.

Method: Analyzing statistical causes of hallucinations in modern training pipelines, showing they originate as errors in binary classification where incorrect statements cannot be distinguished from facts.

Result: Hallucinations arise through natural statistical pressures and persist because evaluations penalize uncertain responses, optimizing models to be good test-takers through guessing.

Conclusion: The epidemic of penalizing uncertainty requires socio-technical mitigation by modifying scoring of existing misaligned benchmarks rather than adding new hallucination evaluations, steering toward more trustworthy AI systems.

Abstract: Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such “hallucinations” persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious – they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded – language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This “epidemic” of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

[61] OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics

Wei Chu, Yuanzhe Dong, Ke Tan, Dong Han, Xavier Menendez-Pidal, Ruchao Fan, Chenfeng Miao, Chanwoo Kim, Bhiksha Raj, Rita Singh

Main category: cs.CL

TL;DR: OleSpeech-IV is a large-scale multispeaker multilingual conversational speech dataset from public English audio sources with human-sourced speaker and transcript data, plus a released subset for research.

DetailsMotivation: To create a comprehensive conversational speech dataset with diverse topics and speakers for speech processing research, addressing the need for high-quality multilingual conversational data.

Method: Collected audio from publicly available English podcasts, talk shows, teleconferences; used human-sourcing for speaker names, turns, and transcripts; employed proprietary pipeline for refinement and additional metadata like timestamps and confidence scores.

Result: Created OleSpeech-IV dataset as Tier IV in Olewave series, and released OleSpeech-IV-2025-EN-AR-100 subset for non-commercial research use.

Conclusion: The dataset provides valuable conversational speech resources with human-verified quality and diverse content, supporting advancement in speech processing research through open-sourced subset availability.

Abstract: OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset series. In addition, we have open-sourced a subset, OleSpeech-IV-2025-EN-AR-100, for non-commercial research use.

[62] KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering

Yushi Sun, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen

Main category: cs.CL

TL;DR: KERAG is a novel KG-based RAG pipeline that enhances question answering coverage by retrieving broader knowledge subgraphs and using fine-tuned LLMs for reasoning, outperforming state-of-the-art methods by 7% and GPT-4o by 10-21%.

DetailsMotivation: Traditional KGQA methods suffer from low coverage due to rigid schema requirements and semantic ambiguity in semantic parsing approaches, limiting their effectiveness in question answering.

Method: A retrieval-filtering-summarization approach that retrieves broader subgraphs likely to contain relevant information, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs to reduce noise.

Result: KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21% in experiments.

Conclusion: The proposed KERAG pipeline effectively enhances QA coverage and performance by leveraging broader knowledge retrieval and advanced reasoning techniques, demonstrating significant improvements over existing methods.

Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs) by incorporating external data, with Knowledge Graphs (KGs) offering crucial information for question answering. Traditional Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing, which typically retrieves knowledge strictly necessary for answer generation, thus often suffer from low coverage due to rigid schema requirements and semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information. Our retrieval-filtering-summarization approach, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs, reduces noises and improves QA for both simple and complex questions. Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.

[63] Phonological Representation Learning for Isolated Signs Improves Out-of-Vocabulary Generalization

Lee Kezar, Zed Sehyr, Jesse Thomason

Main category: cs.CL

TL;DR: This paper investigates phonological inductive biases to improve vector-quantized autoencoder performance for sign language recognition and reconstruction, particularly for unseen signs.

DetailsMotivation: Sign language datasets often lack vocabulary representativeness, creating a need for models that generalize to unseen signs. Vector quantization shows promise but may learn spurious correlations that hinder out-of-vocabulary performance.

Method: Proposes two phonological inductive biases: Parameter Disentanglement (architectural bias) and Phonological Semi-Supervision (regularization technique) within a vector-quantized autoencoder framework for isolated sign recognition.

Result: The learned representations are more effective for one-shot reconstruction of unseen signs and more discriminative for sign identification compared to baseline models.

Conclusion: Explicit, linguistically-motivated biases can significantly improve the generalization capabilities of learned sign language representations, providing better performance on both known and unseen signs.

Abstract: Sign language datasets are often not representative in terms of vocabulary, underscoring the need for models that generalize to unseen signs. Vector quantization is a promising approach for learning discrete, token-like representations, but it has not been evaluated whether the learned units capture spurious correlations that hinder out-of-vocabulary performance. This work investigates two phonological inductive biases: Parameter Disentanglement, an architectural bias, and Phonological Semi-Supervision, a regularization technique, to improve isolated sign recognition of known signs and reconstruction quality of unseen signs with a vector-quantized autoencoder. The primary finding is that the learned representations from the proposed model are more effective for one-shot reconstruction of unseen signs and more discriminative for sign identification compared to a controlled baseline. This work provides a quantitative analysis of how explicit, linguistically-motivated biases can improve the generalization of learned representations of sign language.

[64] A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning

Cheng Peng, Xinyu Dong, Mengxian Lyu, Daniel Paredes, Yaoyun Zhang, Yonghui Wu

Main category: cs.CL

TL;DR: This study explores optimal use of large language models (LLMs) for clinical patient information extraction, comparing encoder vs decoder architectures, fine-tuning strategies, and multi-task instruction tuning across multiple datasets.

DetailsMotivation: LLMs have revolutionized clinical NLP tasks but their optimal use for patient information extraction requires further exploration to develop robust and generalizable systems.

Method: Benchmarked encoder-based (BERT, GatorTron) and decoder-based (GatorTronGPT, Llama 3.1, GatorTronLlama) LLMs across five datasets, comparing full fine-tuning vs prompt-based PEFT, and evaluated multi-task instruction tuning using leave-one-dataset-out strategy.

Result: The study systematically evaluated different LLM architectures and training approaches, though specific performance metrics are not provided in the abstract.

Conclusion: The research provides comprehensive benchmarking and methodological exploration to optimize LLM usage for clinical concept and relation extraction tasks in healthcare applications.

Abstract: Natural language processing (NLP) is a key technology to extract important patient information from clinical narratives to support healthcare applications. The rapid development of large language models (LLMs) has revolutionized many NLP tasks in the clinical domain, yet their optimal use in patient information extraction tasks requires further exploration. This study examines LLMs’ effectiveness in patient information extraction, focusing on LLM architectures, fine-tuning strategies, and multi-task instruction tuning techniques for developing robust and generalizable patient information extraction systems. This study aims to explore key concepts of using LLMs for clinical concept and relation extraction tasks, including: (1) encoder-only or decoder-only LLMs, (2) prompt-based parameter-efficient fine-tuning (PEFT) algorithms, and (3) multi-task instruction tuning on few-shot learning performance. We benchmarked a suite of LLMs, including encoder-based LLMs (BERT, GatorTron) and decoder-based LLMs (GatorTronGPT, Llama 3.1, GatorTronLlama), across five datasets. We compared traditional full-size fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning framework that combines both tasks across four datasets to evaluate the zero-shot and few-shot learning performance using the leave-one-dataset-out strategy.

[65] Decoders Laugh as Loud as Encoders

Eli Borodach, Raj Dandekar, Rajat Dandekar, Sreedath Panat

Main category: cs.CL

TL;DR: GPT-4o performs nearly as well as RoBERTa in humor understanding tasks, achieving F1-macro scores of 0.85 vs 0.86 respectively, challenging the notion that computers cannot understand nuanced humor.

DetailsMotivation: To investigate whether advanced LLMs like GPT-4o truly understand nuanced language concepts such as humor, building on Turing's vision of human-like machine communication and addressing the open question of computer humor comprehension.

Method: Fine-tuned GPT-4o decoder model and compared its performance against fine-tuned RoBERTa encoder model on humor understanding tasks, using F1-macro score as the evaluation metric.

Result: GPT-4o achieved a Mean F1-macro score of 0.85, performing nearly identically to the best fine-tuned encoder model RoBERTa which scored 0.86, demonstrating comparable humor understanding capabilities.

Conclusion: Modern decoder-based LLMs like GPT-4o can understand nuanced humor nearly as effectively as specialized encoder models, suggesting significant progress in machines’ ability to comprehend complex linguistic phenomena.

Abstract: From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of Large Language Models (LLMs) shocked the scientific community when a single model can apply for various natural language processing (NLP) tasks, while the output results are sometimes even better than most human communication skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the decoders, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)

[66] Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework

Zucheng Liang, Wenxin Wei, Kaijie Zhang, Hongyi Chen

Main category: cs.CL

TL;DR: Multi-hop question decomposition method significantly improves LLM performance on complex questions, outperforming direct answering approaches both before and after fine-tuning.

DetailsMotivation: Addressing the challenge of accurately answering complex questions with Large Language Models by leveraging multi-hop decomposition within knowledge graphs.

Method: Used LLAMA3 model with MQUAKE-T dataset converted to single-hop and multi-hop formats. Applied LoRA fine-tuning and compared performance of direct answering vs multi-hop decomposition approaches.

Result: Multi-hop decomposition significantly outperformed direct answering without fine-tuning. After LoRA fine-tuning, both methods improved but multi-hop maintained superiority, validating its effectiveness.

Conclusion: Multi-hop question decomposition effectively enhances LLM’s complex question answering capability both before and after training, demonstrating consistent performance advantages.

Abstract: Accurately answering complex questions has consistently been a significant challenge for Large Language Models (LLMs). To address this, this paper proposes a multi-hop question decomposition method for complex questions, building upon research within the MQUAKE framework. Utilizing the LLAMA3 model, we systematically investigate the impact of multi-hop question decomposition within knowledge graphs on model comprehension and reasoning accuracy, both before and after model training. In our experiments, we systematically partitioned and converted the MQUAKE-T dataset into two distinct formats: a single-hop dataset designed for directly answering complex questions, and a multi-hop dataset constructed using the multi-hop question decomposition method. We then fine-tuned the LLAMA3 model on these datasets and conducted inference tests. Our results demonstrate that, without fine-tuning the LLM, the prediction performance based on the multi-hop question decomposition method significantly outperforms the method of directly answering complex questions. After fine-tuning using the LoRA (Low-Rank Adaptation) method, the performance of both approaches improved compared to the untrained baseline. Crucially, the method utilizing multi-hop decomposition consistently maintained its superiority. These findings validate the effectiveness of the multi-hop decomposition method both before and after training, demonstrating its capability to effectively enhance the LLM’s ability to answer complex questions.

[67] Enhancing Diversity in Large Language Models via Determinantal Point Processes

Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Ioannis Ch. Paschalidis, Aldo Pacchiano

Main category: cs.CL

TL;DR: DQO is a novel training method that uses determinantal point processes to optimize LLMs for both quality and semantic diversity, addressing the diversity loss issue in supervised fine-tuning and reinforcement learning.

DetailsMotivation: Existing post-training methods like supervised fine-tuning and reinforcement learning often reduce output diversity, leading to narrow, canonical responses. Current diversity enhancement methods are limited to inference-time operations or focus only on lexical differences.

Method: Proposes DQO method based on determinantal point processes (DPPs) that samples multiple responses per prompt, embeds them, and uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by response embeddings.

Result: Experiments across instruction-following, summarization, story generation, and reasoning tasks show substantial improvement in semantic diversity without sacrificing model quality.

Conclusion: DQO effectively addresses the diversity loss problem in LLM post-training by jointly optimizing for both quality and semantic diversity through a novel DPP-based training approach.

Abstract: Supervised fine-tuning and reinforcement learning are two popular methods for post-training large language models (LLMs). While improving the model’s performance on downstream tasks, they often reduce the model’s output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on lexical differences. We propose a novel training method named DQO based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.

[68] Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects

Gunmay Handa, Zekun Wu, Adriano Koshiyama, Philip Treleaven

Main category: cs.CL

TL;DR: Systematic study of personality manipulation in LLMs using Big Five traits, comparing three methods (ICL, PEFT, MS) with trade-offs in alignment, capability, and deployment efficiency.

DetailsMotivation: Personality manipulation is increasingly used in customer service and agentic scenarios, but the mechanisms and trade-offs remain unclear, requiring systematic investigation.

Method: Constructed contrastive dataset, developed unified evaluation framework with Δ analysis, trait purification techniques, and three-level stability framework. Compared in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS) on Gemma-2-2B-IT and LLaMA-3-8B-Instruct models.

Result: ICL achieves strong alignment with minimal capability loss, PEFT delivers highest alignment but degrades task performance, MS provides lightweight runtime control with competitive effectiveness. Openness is uniquely challenging, agreeableness most resistant to ICL, personality encoding consolidates around intermediate layers.

Conclusion: Personality manipulation serves as multi-level probe into behavioral representation, linking surface conditioning, parameter encoding, and activation-level steering. Mechanistic steering emerges as lightweight alternative to fine-tuning for both deployment and interpretability.

Abstract: Personality manipulation in large language models (LLMs) is increasingly applied in customer service and agentic scenarios, yet its mechanisms and trade-offs remain unclear. We present a systematic study of personality control using the Big Five traits, comparing in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our contributions are fourfold. First, we construct a contrastive dataset with balanced high/low trait responses, enabling effective steering vector computation and fair cross-method evaluation. Second, we introduce a unified evaluation framework based on within-run $\Delta$ analysis that disentangles, reasoning capability, agent performance, and demographic bias across MMLU, GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to separate openness from conscientiousness, addressing representational overlap in trait encoding. Fourth, we propose a three-level stability framework that quantifies method-, trait-, and combination-level robustness, offering practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment with minimal capability loss, PEFT delivers the highest alignment at the cost of degraded task performance, and MS provides lightweight runtime control with competitive effectiveness. Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL, and personality encoding consolidating around intermediate layers. Taken together, these results establish personality manipulation as a multi-level probe into behavioral representation, linking surface conditioning, parameter encoding, and activation-level steering, and positioning mechanistic steering as a lightweight alternative to fine-tuning for both deployment and interpretability.

[69] Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training

Figarri Keisha, Zekun Wu, Ze Wang, Adriano Koshiyama, Philip Treleaven

Main category: cs.CL

TL;DR: Recursive training on synthetic data causes knowledge collapse where models become confidently wrong - factual accuracy deteriorates while surface fluency persists, posing critical risks in accuracy-dependent domains.

DetailsMotivation: Address the problem of model collapse in large language models that increasingly rely on synthetic data due to human-written content scarcity, particularly focusing on how this threatens factual reliability in knowledge-intensive applications.

Method: Conducted controlled experiments with recursive synthetic training, analyzed collapse trajectory and timing based on instruction format, and proposed domain-specific synthetic training as mitigation strategy. Used evaluation framework combining model-centric indicators with task-centric metrics.

Result: Found that collapse depends critically on instruction format, distinguishing instruction-following collapse from traditional model collapse. Domain-specific synthetic training achieved substantial improvements in collapse resistance while maintaining computational efficiency.

Conclusion: Provides theoretical insights into collapse dynamics and practical guidance for sustainable AI training in knowledge-intensive applications where accuracy is paramount, with proposed framework enabling reproducible assessment of epistemic deterioration.

Abstract: Large language models increasingly rely on synthetic data due to human-written content scarcity, yet recursive training on model-generated outputs leads to model collapse, a degenerative process threatening factual reliability. We define knowledge collapse as a distinct three-stage phenomenon where factual accuracy deteriorates while surface fluency persists, creating “confidently wrong” outputs that pose critical risks in accuracy-dependent domains. Through controlled experiments with recursive synthetic training, we demonstrate that collapse trajectory and timing depend critically on instruction format, distinguishing instruction-following collapse from traditional model collapse through its conditional, prompt-dependent nature. We propose domain-specific synthetic training as a targeted mitigation strategy that achieves substantial improvements in collapse resistance while maintaining computational efficiency. Our evaluation framework combines model-centric indicators with task-centric metrics to detect distinct degradation phases, enabling reproducible assessment of epistemic deterioration across different language models. These findings provide both theoretical insights into collapse dynamics and practical guidance for sustainable AI training in knowledge-intensive applications where accuracy is paramount.

[70] Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Ilham Wicaksono, Zekun Wu, Theo King, Adriano Koshiyama, Philip Treleaven

Main category: cs.CL

TL;DR: AgentSeer is a new framework for evaluating safety risks in AI agents, revealing that traditional model-level safety tests miss critical vulnerabilities that only appear when models function as agents with tool-calling capabilities.

DetailsMotivation: Current safety evaluation frameworks have critical gaps in assessing deployment-specific risks as large language models transition to agentic systems, requiring specialized evaluation methods for agentic contexts.

Method: AgentSeer decomposes agentic executions into granular action and component graphs, enabling systematic agentic-situational assessment. Cross-model validation was conducted on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and iterative refinement attacks.

Result: Agentic-level assessment exposed agent-specific risks invisible to traditional evaluation, with tool-calling showing 24-60% higher attack success rates. Cross-model analysis revealed universal agentic patterns, agent transfer operations as highest-risk tools, and context-dependent attack effectiveness.

Conclusion: The findings establish the urgent need for agentic-situation evaluation paradigms, with AgentSeer providing the standardized methodology and empirical validation for assessing agent-specific vulnerabilities that traditional model-level testing cannot detect.

Abstract: As large language models transition to agentic systems, current safety evaluation frameworks face critical gaps in assessing deployment-specific risks. We introduce AgentSeer, an observability-based evaluation framework that decomposes agentic executions into granular action and component graphs, enabling systematic agentic-situational assessment. Through cross-model validation on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and iterative refinement attacks, we demonstrate fundamental differences between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash (50.00% ASR), with both models showing susceptibility to social engineering while maintaining logic-based attack resistance. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover “agentic-only” vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, agent transfer operations as highest-risk tools, semantic rather than syntactic vulnerability mechanisms, and context-dependent attack effectiveness, alongside model-specific security profiles in absolute ASR levels and optimal injection strategies. Direct attack transfer from model-level to agentic contexts shows degraded performance (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash: 28%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic evaluation gaps. These findings establish the urgent need for agentic-situation evaluation paradigms, with AgentSeer providing the standardized methodology and empirical validation.

[71] PLaMo 2 Technical Report

Preferred Networks, :, Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, Hiroaki Mikami, Shogo Murai, Daisuke Nishino, Kento Nozawa, Shintarou Okada, Daisuke Okanohara, Shunta Saito, Shotaro Sano, Shuji Suzuki, Daisuke Tanaka, Avinash Ummadisingu, Hanqin Wang, Sixue Wang, Tianqi Xu

Main category: cs.CL

TL;DR: PLaMo 2 is a series of Japanese-focused LLMs using hybrid Samba architecture with 32K context, achieving 100B model performance in an 8B model through efficient pruning and synthetic data training.

DetailsMotivation: To develop Japanese-focused language models that overcome data scarcity issues while maintaining computational efficiency and achieving state-of-the-art performance on Japanese benchmarks.

Method: Hybrid Samba-based architecture transitioning to full attention, extensive synthetic corpora training, weight reuse, structured pruning, supervised fine-tuning (SFT), direct preference optimization (DPO), model merging, and inference optimization with vLLM and quantization.

Result: Produced an 8B model with performance comparable to previous 100B model, achieving state-of-the-art results on Japanese benchmarks in instruction-following, language fluency, and Japanese-specific knowledge.

Conclusion: PLaMo 2 demonstrates that efficient architecture design, synthetic data utilization, and optimization techniques can create highly performant Japanese language models that outperform similarly-sized open models while maintaining computational efficiency.

Abstract: In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.

[72] Analyzing Finnish Inflectional Classes through Discriminative Lexicon and Deep Learning Models

Alexandre Nikolaev, Yu-Ying Chuang, R. Harald Baayen

Main category: cs.CL

TL;DR: The study tests whether the Discriminative Lexicon Model can handle Finnish noun inflection without using traditional inflectional classes, achieving good accuracy on both training and test data while showing sensitivity to class productivity.

DetailsMotivation: To investigate whether inflectional classes are cognitively necessary for language learning by testing if a computational model can learn Finnish noun inflection without predefined classes.

Method: Used Discriminative Lexicon Model with 55,271 inflected nouns from 2000 high-frequency Finnish nouns across 49 inflectional classes. Tested both endstate learning (infinite exposure) and frequency-informed learning (usage-based).

Result: Models achieved high accuracy on training data and acceptable performance on test data. Performance correlated with inflectional class productivity - better for productive classes with more types and lower-frequency words, worse for unproductive classes. Frequency was dominant predictor for usage-based models.

Conclusion: The DLM can successfully learn Finnish noun inflection without explicit inflectional classes, suggesting such classes may not be cognitively necessary for language acquisition, though performance is influenced by class productivity and word frequency.

Abstract: Descriptions of complex nominal or verbal systems make use of inflectional classes. Inflectional classes bring together nouns which have similar stem changes and use similar exponents in their paradigms. Although inflectional classes can be very useful for language teaching as well as for setting up finite state morphological systems, it is unclear whether inflectional classes are cognitively real, in the sense that native speakers would need to discover these classes in order to learn how to properly inflect the nouns of their language. This study investigates whether the Discriminative Lexicon Model (DLM) can understand and produce Finnish inflected nouns without setting up inflectional classes, using a dataset with 55,271 inflected nouns of 2000 high-frequency Finnish nouns from 49 inflectional classes. Several DLM comprehension and production models were set up. Some models were not informed about frequency of use, and provide insight into learnability with infinite exposure (endstate learning). Other models were set up from a usage based perspective, and were trained with token frequencies being taken into consideration (frequency-informed learning). On training data, models performed with very high accuracies. For held-out test data, accuracies decreased, as expected, but remained acceptable. Across most models, performance increased for inflectional classes with more types, more lower-frequency words, and more hapax legomena, mirroring the productivity of the inflectional classes. The model struggles more with novel forms of unproductive and less productive classes, and performs far better for unseen forms belonging to productive classes. However, for usage-based production models, frequency was the dominant predictor of model performance, and correlations with measures of productivity were tenuous or absent.

[73] AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding

Yan Xie, Yibo Cui, Liang Xie, Erwei Yin

Main category: cs.CL

TL;DR: Proposes Adaptive Feature Distillation framework for SLU that transfers knowledge from GTE teacher to lightweight student using dynamic adapter and adaptive distillation coefficient, achieving SOTA results on Chinese ProSLU benchmark.

DetailsMotivation: Address challenges in SLU development including scarcity of labeled training data and computational burden of deploying LLMs in real-world applications.

Method: Adaptive Feature Distillation framework with dynamic adapter (Residual Projection Neural Network) to align heterogeneous feature spaces and Dynamic Distillation Coefficient that modulates distillation strength based on real-time intent/slot prediction feedback.

Result: Achieves state-of-the-art results on Chinese ProSLU benchmark: 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.

Conclusion: The proposed AFD-SLU framework effectively addresses data scarcity and computational efficiency challenges in SLU through adaptive knowledge distillation, demonstrating superior performance on Chinese conversational understanding tasks.

Abstract: Spoken Language Understanding (SLU) is a core component of conversational systems, enabling machines to interpret user utterances. Despite its importance, developing effective SLU systems remains challenging due to the scarcity of labeled training data and the computational burden of deploying Large Language Models (LLMs) in real-world applications. To further alleviate these issues, we propose an Adaptive Feature Distillation framework that transfers rich semantic representations from a General Text Embeddings (GTE)-based teacher model to a lightweight student model. Our method introduces a dynamic adapter equipped with a Residual Projection Neural Network (RPNN) to align heterogeneous feature spaces, and a Dynamic Distillation Coefficient (DDC) that adaptively modulates the distillation strength based on real-time feedback from intent and slot prediction performance. Experiments on the Chinese profile-based ProSLU benchmark demonstrate that AFD-SLU achieves state-of-the-art results, with 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.

[74] Memorization $\neq$ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?

Boxiang Ma, Ru Li, Yuanlong Wang, Hongye Tan, Xiaoli Li

Main category: cs.CL

TL;DR: LLMs rely more on memorization than deep semantic understanding for scenario cognition, as shown through a bi-perspective evaluation framework testing their ability to link scenario elements with arguments.

DetailsMotivation: To determine whether LLMs' generalization capabilities stem from genuine semantic understanding or mere memorization of training data, particularly focusing on scenario cognition - the ability to connect semantic scenario elements with their contextual arguments.

Method: Proposed a bi-perspective evaluation framework using a novel scenario-based dataset with fictional facts. Evaluated LLMs through: 1) scenario-related question answering (output perspective), and 2) probing internal representations for scenario element-argument associations (internal representation perspective).

Result: Current LLMs predominantly rely on superficial memorization rather than robust semantic scenario cognition, failing even in simple cases. They show limitations in encoding and processing scenario element-argument relationships.

Conclusion: The findings expose critical limitations in LLMs’ semantic understanding capabilities and provide cognitive insights that can guide future advancements in developing more semantically-aware language models.

Abstract: Driven by vast and diverse textual data, large language models (LLMs) have demonstrated impressive performance across numerous natural language processing (NLP) tasks. Yet, a critical question persists: does their generalization arise from mere memorization of training data or from deep semantic understanding? To investigate this, we propose a bi-perspective evaluation framework to assess LLMs’ scenario cognition - the ability to link semantic scenario elements with their arguments in context. Specifically, we introduce a novel scenario-based dataset comprising diverse textual descriptions of fictional facts, annotated with scenario elements. LLMs are evaluated through their capacity to answer scenario-related questions (model output perspective) and via probing their internal representations for encoded scenario elements-argument associations (internal representation perspective). Our experiments reveal that current LLMs predominantly rely on superficial memorization, failing to achieve robust semantic scenario cognition, even in simple cases. These findings expose critical limitations in LLMs’ semantic understanding and offer cognitive insights for advancing their capabilities.

[75] Using LLMs for Multilingual Clinical Entity Linking to ICD-10

Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva

Main category: cs.CL

TL;DR: Automated ICD-10 coding system using LLMs with dictionary matching and GPT-4.1 for clinical text analysis in multiple languages

DetailsMotivation: Simplify healthcare professionals' work and ensure consistent disease coding in hospitals by automating ICD-10 code assignment from clinical texts

Method: Multistage pipeline combining clinical dictionary matching for unambiguous terms and in-context learning with GPT-4.1 for complex cases

Result: High performance across languages: 0.89 F1 for categories and 0.78 F1 for subcategories in Spanish (CodiEsp), 0.85 F1 in Greek (ElCardioCC)

Conclusion: LLM-based approach effectively automates clinical entity linking to ICD-10 codes across different languages with promising accuracy

Abstract: The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.

[76] PRIM: Towards Practical In-Image Multilingual Machine Translation

Yanzhi Tian, Zeming Liu, Zhengyang Liu, Chong Feng, Xin Li, Heyan Huang, Yuhang Guo

Main category: cs.CL

TL;DR: This paper introduces Practical In-Image Multilingual Machine Translation (IIMMT) to address real-world challenges in translating text within images, presents the PRIM dataset with complex real-world conditions, and proposes the VisTrans model that achieves better translation quality and visual effects.

DetailsMotivation: Current In-Image Machine Translation research mainly uses synthetic data with simple backgrounds, single fonts, fixed text positions, and bilingual translation, which creates a significant gap between research and practical real-world conditions.

Method: The authors annotated the PRIM dataset containing real-world captured one-line text images with complex backgrounds, various fonts, and diverse text positions. They proposed an end-to-end model called VisTrans that processes visual text and background information separately to handle practical conditions while supporting multilingual translation.

Result: Experimental results show that VisTrans achieves better translation quality and visual effects compared to other models, demonstrating improved performance in practical in-image multilingual translation scenarios.

Conclusion: The PRIM dataset and VisTrans model facilitate research in practical in-image multilingual machine translation, bridging the gap between synthetic research environments and real-world applications, with the code and dataset made publicly available.

Abstract: In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.

[77] L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning

Raul Singh, Nicolo Brunello, Vincenzo Scotti, Mark James Carman

Main category: cs.CL

TL;DR: L1RA is a novel technique that dynamically distributes rank budgets across low-rank adapters during LLM fine-tuning using L1 regularization to prune redundant ranks and optimize resource utilization.

DetailsMotivation: High computational requirements for fine-tuning Large Language Models pose significant challenges, especially when resources are limited, creating a need for more efficient adaptation methods.

Method: L1RA leverages L1 regularization to prune redundant ranks and redistribute them across adapters within a given rank budget, using LoRA for low-rank adaptation.

Result: L1RA maintains comparable or reduced computational overhead compared to other LoRA variants while achieving same or better performance, with analysis revealing feed-forward layers and attention output projection require most adaptation.

Conclusion: L1RA is an effective technique for enhancing LLM fine-tuning efficiency and providing diagnostic insights for model refinement, particularly valuable in resource-constrained scenarios.

Abstract: The ability of Large Language Models (LLMs) to solve complex tasks has made them crucial in the development of AI-based applications. However, the high computational requirements to fine-tune these LLMs on downstream tasks pose significant challenges, particularly when resources are limited. In response to this challenge, we introduce L1RA, a novel technique aimed at dynamically distributing the rank of low-rank adapters during fine-tuning using LoRA. Given a rank budget (i.e., total sum of adapters rank), L1RA leverages L1 regularisation to prune redundant ranks and redistribute them across adapters, thereby optimising resource utilisation. Through a series of comprehensive experiments, we empirically demonstrate that L1RA maintains comparable or even reduced computational overhead compared to other LoRA variants, including the vanilla approach, while achieving same or better performances. Moreover, the post-training analysis of rank distribution unveiled insights into the specific model components requiring the most adaptation to align with the task objective: the feed-forward layers and the attention output projection. These results highlight the efficacy of L1RA in not only enhancing the efficiency of LLM fine-tuning, but also in providing valuable diagnostic information for model refinement and customisation. In conclusion, L1RA stands as a promising technique for advancing the performance and interpretability of LLM adaptation, particularly in scenarios where computational resources are constrained.

[78] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, Jiajun Zhang

Main category: cs.CL

TL;DR: ACE-RL framework uses adaptive constraint-enhanced reinforcement learning to improve LLM long-form generation by automatically deconstructing instructions into fine-grained constraints and converting subjective quality evaluation into constraint verification.

DetailsMotivation: Current LLMs struggle with high-quality long-form generation due to reliance on scarce high-quality training data and focus on coarse-grained optimization dimensions, overlooking fine-grained specifics of diverse generation scenarios.

Method: Proposes ACE-RL framework that: 1) automatically deconstructs instructions into fine-grained adaptive constraint criteria, 2) designs reward mechanism quantifying response quality based on constraint satisfaction, 3) uses reinforcement learning to guide models toward better generation capabilities.

Result: ACE-RL outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, with top-performing model surpassing GPT-4o by 7.10%.

Conclusion: ACE-RL provides an effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios by converting subjective quality evaluation into objective constraint verification.

Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in long-context understanding, yet they face significant challenges in high-quality long-form generation. Existing studies primarily suffer from two limitations: (1) A heavy reliance on scarce, high-quality long-form response data for supervised fine-tuning (SFT) or for pairwise preference reward in reinforcement learning (RL). (2) Focus on coarse-grained quality optimization dimensions, such as relevance, coherence, and helpfulness, overlooking the fine-grained specifics inherent to diverse long-form generation scenarios. To address this issue, we propose a framework using Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first automatically deconstructs each instruction into a set of fine-grained, adaptive constraint criteria by identifying its underlying intents and demands. Subsequently, we design a reward mechanism that quantifies the quality of long-form responses based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we utilize reinforcement learning to guide models toward superior long-form generation capabilities. Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.

[79] ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions

Matteo Bortoletto, Constantin Ruhdorfer, Andreas Bulling

Main category: cs.CL

TL;DR: ToM-SSI is a new multimodal benchmark that tests Theory of Mind capabilities in complex social interactions with up to 4 agents, moving beyond traditional text-only Sally-Anne tests to capture richer social cognition.

DetailsMotivation: Existing ToM benchmarks are limited to simple text-only or dyadic interactions using Sally-Anne test variations, which fail to capture the complexity of real human social interactions and spatial dynamics.

Method: Developed ToM-SSI benchmark featuring multimodal environments with group interactions of up to four agents that communicate and move in situated environments, enabling study of mixed cooperative-obstructive settings and parallel mental state reasoning.

Result: Evaluations show current models’ performance is severely limited, especially in these new complex tasks involving multiple agents and spatial dynamics.

Conclusion: The benchmark reveals critical gaps in foundation models’ Theory of Mind capabilities and highlights the need for future research to address these limitations in complex social cognition tasks.

Abstract: Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents’ mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models’ performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.

Midhun Shyam, Jim Basilakis, Kieran Luken, Steven Thomas, John Crozier, Paul M. Middleton, X. Rosalind Wang

Main category: cs.CL

TL;DR: A pipeline for classifying sensitive triage data using LLMs with limited compute resources, combining GPU fine-tuning on open data and CPU fine-tuning on hospital-specific data.

DetailsMotivation: Triage notes contain valuable medical information but face challenges: privacy regulations requiring on-site analysis, lack of hardware for LLM training, and costly manual labeling by experts.

Method: Fine-tuned pre-trained LLM with classifier using 2k open-source dataset on GPU, then further fine-tuned with 1000 hospital-specific samples on CPU.

Result: Successfully classified triage data with limited compute resources through careful dataset curation and leveraging existing models.

Conclusion: The pipeline enables effective triage data classification despite hardware limitations and privacy constraints, making LLM analysis accessible to hospitals with limited resources.

Abstract: Triage notes, created at the start of a patient’s hospital visit, contain a wealth of information that can help medical staff and researchers understand Emergency Department patient epidemiology and the degree of time-dependent illness or injury. Unfortunately, applying modern Natural Language Processing and Machine Learning techniques to analyse triage data faces some challenges: Firstly, hospital data contains highly sensitive information that is subject to privacy regulation thus need to be analysed on site; Secondly, most hospitals and medical facilities lack the necessary hardware to fine-tune a Large Language Model (LLM), much less training one from scratch; Lastly, to identify the records of interest, expert inputs are needed to manually label the datasets, which can be time-consuming and costly. We present in this paper a pipeline that enables the classification of triage data using LLM and limited compute resources. We first fine-tuned a pre-trained LLM with a classifier using a small (2k) open sourced dataset on a GPU; and then further fine-tuned the model with a hospital specific dataset of 1000 samples on a CPU. We demonstrated that by carefully curating the datasets and leveraging existing models and open sourced data, we can successfully classify triage data with limited compute resources.

[81] ICR: Iterative Clarification and Rewriting for Conversational Search

Zhiyu Cao, Peifeng Li, Qiaoming Zhu

Main category: cs.CL

TL;DR: ICR framework uses iterative clarification questions and rewriting to handle multiple fuzzy expressions in conversational queries, achieving SOTA performance.

DetailsMotivation: End-to-end query rewriting struggles with multiple fuzzy expressions that require simultaneous identification and rewriting at multiple positions.

Method: Proposed ICR framework with iterative process alternating between generating clarification questions and rewritten queries.

Result: ICR continuously improves retrieval performance through iterative process and achieves state-of-the-art performance on two datasets.

Conclusion: Iterative clarification-rewriting approach effectively addresses limitations of end-to-end rewriting for conversational queries with multiple fuzzy expressions.

Abstract: Most previous work on Conversational Query Rewriting employs an end-to-end rewriting paradigm. However, this approach is hindered by the issue of multiple fuzzy expressions within the query, which complicates the simultaneous identification and rewriting of multiple positions. To address this issue, we propose a novel framework ICR (Iterative Clarification and Rewriting), an iterative rewriting scheme that pivots on clarification questions. Within this framework, the model alternates between generating clarification questions and rewritten queries. The experimental results show that our ICR can continuously improve retrieval performance in the clarification-rewriting iterative process, thereby achieving state-of-the-art performance on two popular datasets.

[82] Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short Texts

Julius Neumann, Robert Lange, Yuni Susanti, Michael Färber

Main category: cs.CL

TL;DR: Small Transformer models (BERT/RoBERTa) evaluated for multi-label sentiment classification on short texts. Data augmentation helps, continued pre-training on augmented data adds noise, and classification head modifications provide minimal benefits.

DetailsMotivation: Short text sentiment classification faces challenges like class imbalance, limited training samples, subjectivity, and data sparsity due to limited context, making effective learning difficult.

Method: Evaluated three factors: (1) continued domain-specific pre-training, (2) generative data augmentation using automatically generated examples, and (3) architectural variations of the classification head using BERT and RoBERTa models (<1B parameters).

Result: Data augmentation improves classification performance, continued pre-training on augmented datasets introduces noise instead of boosting accuracy, and classification head modifications yield only marginal benefits.

Conclusion: Provides practical guidance for optimizing BERT-based models in resource-constrained settings and refining strategies for short-text sentiment classification, emphasizing the value of data augmentation while cautioning against continued pre-training on augmented data.

Abstract: Sentiment classification in short text datasets faces significant challenges such as class imbalance, limited training samples, and the inherent subjectivity of sentiment labels – issues that are further intensified by the limited context in short texts. These factors make it difficult to resolve ambiguity and exacerbate data sparsity, hindering effective learning. In this paper, we evaluate the effectiveness of small Transformer-based models (i.e., BERT and RoBERTa, with fewer than 1 billion parameters) for multi-label sentiment classification, with a particular focus on short-text settings. Specifically, we evaluated three key factors influencing model performance: (1) continued domain-specific pre-training, (2) data augmentation using automatically generated examples, specifically generative data augmentation, and (3) architectural variations of the classification head. Our experiment results show that data augmentation improves classification performance, while continued pre-training on augmented datasets can introduce noise rather than boost accuracy. Furthermore, we confirm that modifications to the classification head yield only marginal benefits. These findings provide practical guidance for optimizing BERT-based models in resource-constrained settings and refining strategies for sentiment classification in short-text datasets.

[83] Do Large Language Models Need Intent? Revisiting Response Generation Strategies for Service Assistant

Inbal Bolshinsky, Shani Kupiec, Almog Sasson, Yehudit Aperstein, Alexander Apartsin

Main category: cs.CL

TL;DR: Comparative study examines whether explicit intent recognition is necessary for quality service responses or if models can generate effective replies directly, challenging conventional AI pipeline assumptions.

DetailsMotivation: Address the fundamental design dilemma in conversational AI about whether explicit intent recognition is a prerequisite for generating high-quality service responses or if models can bypass this step.

Method: Benchmark several state-of-the-art language models (including fine-tuned T5 variant) on two public service interaction datasets across Intent-First Response Generation and Direct Response Generation paradigms.

Result: Evaluation using linguistic quality and task success metrics reveals surprising insights about the necessity or redundancy of explicit intent modeling.

Conclusion: Findings challenge conventional assumptions in conversational AI pipelines and offer actionable guidelines for designing more efficient and effective response generation systems.

Abstract: In the era of conversational AI, generating accurate and contextually appropriate service responses remains a critical challenge. A central question remains: Is explicit intent recognition a prerequisite for generating high-quality service responses, or can models bypass this step and produce effective replies directly? This paper conducts a rigorous comparative study to address this fundamental design dilemma. Leveraging two publicly available service interaction datasets, we benchmark several state-of-the-art language models, including a fine-tuned T5 variant, across both paradigms: Intent-First Response Generation and Direct Response Generation. Evaluation metrics encompass both linguistic quality and task success rates, revealing surprising insights into the necessity or redundancy of explicit intent modelling. Our findings challenge conventional assumptions in conversational AI pipelines, offering actionable guidelines for designing more efficient and effective response generation systems.

[84] Masked Diffusion Language Models with Frequency-Informed Training

Despoina Kosmopoulou, Efthymios Georgiou, Vaggelis Dorovatas, Georgios Paraskevopoulos, Alexandros Potamianos

Main category: cs.CL

TL;DR: Masked diffusion language modeling framework for data-efficient training in BabyLM 2025 Challenge, showing competitive performance to hybrid baselines.

DetailsMotivation: To develop data-efficient language modeling under strict data constraints using diffusion training objectives.

Method: Applies diffusion training with frequency-informed masking that prioritizes rare tokens, explores multiple noise scheduling strategies and noise weighting schemes within NELBO objective.

Result: Performance competitive to hybrid autoregressive-masked baselines on BabyLM benchmark suite (linguistic competence, world knowledge, human-likeness).

Conclusion: Diffusion-based training offers a viable alternative for data-restricted language learning.

Abstract: We present a masked diffusion language modeling framework for data-efficient training for the BabyLM 2025 Challenge. Our approach applies diffusion training objectives to language modeling under strict data constraints, incorporating frequency-informed masking that prioritizes learning from rare tokens while maintaining theoretical validity. We explore multiple noise scheduling strategies, including two-mode approaches, and investigate different noise weighting schemes within the NELBO objective. We evaluate our method on the BabyLM benchmark suite, measuring linguistic competence, world knowledge, and human-likeness. Results show performance competitive to hybrid autoregressive-masked baselines, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.

[85] HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models

Chang Dai, Hongyu Shan, Mingyang Song, Di Liang

Main category: cs.CL

TL;DR: HoPE is a new positional encoding method that uses hyperbolic geometry to fix RoPE’s oscillation issues, enabling better long-range dependency modeling and outperforming existing methods on long sequences.

DetailsMotivation: Existing positional encodings struggle with long sequences - absolute encodings can't extrapolate, relative methods degrade on extreme lengths, and RoPE has oscillatory attention patterns that hinder stable long-distance modeling.

Method: Proposes Hyperbolic Rotary Positional Encoding (HoPE) inspired by Lorentz transformations in hyperbolic geometry, using hyperbolic functions to implement Lorentz rotations on token representations.

Result: HoPE fundamentally resolves RoPE’s oscillation issues by enforcing monotonic decay of attention weights with distance. Extensive experiments show it consistently exceeds existing positional encoding methods on extended sequence benchmarks.

Conclusion: HoPE demonstrates enhanced capacity for representing and generalizing long-range dependencies, with theoretical analysis showing RoPE is a special case of this generalized formulation.

Abstract: Positional encoding mechanisms enable Transformers to model sequential structure and long-range dependencies in text. While absolute positional encodings struggle with extrapolation to longer sequences due to fixed positional representations, and relative approaches like Alibi exhibit performance degradation on extremely long contexts, the widely-used Rotary Positional Encoding (RoPE) introduces oscillatory attention patterns that hinder stable long-distance dependency modelling. We address these limitations through a geometric reformulation of positional encoding. Drawing inspiration from Lorentz transformations in hyperbolic geometry, we propose Hyperbolic Rotary Positional Encoding (HoPE), which leverages hyperbolic functions to implement Lorentz rotations on token representations. Theoretical analysis demonstrates that RoPE is a special case of our generalized formulation. HoPE fundamentally resolves RoPE’s slation issues by enforcing monotonic decay of attention weights with increasing token distances. Extensive experimental results, including perplexity evaluations under several extended sequence benchmarks, show that HoPE consistently exceeds existing positional encoding methods. These findings underscore HoPE’s enhanced capacity for representing and generalizing long-range dependencies. Data and code will be available.

[86] Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya

Main category: cs.CL

TL;DR: Entropy2Vec uses language model entropy to create cross-lingual embeddings that capture typological relationships between languages, outperforming traditional sparse typological features.

DetailsMotivation: Traditional typological inventories suffer from feature sparsity and static snapshots, limiting their effectiveness for capturing dynamic language relationships.

Method: Train language models on individual languages and use the entropy of their predictions as a measure of structural similarity - low entropy indicates high similarity, high entropy indicates divergence.

Result: Produces dense, non-sparse language embeddings that align with established typological categories and achieve competitive performance in multilingual NLP tasks.

Conclusion: Entropy2Vec provides an effective framework for deriving cross-lingual representations that overcome limitations of traditional typological approaches and perform well in downstream applications.

Abstract: We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.

[87] CURE: Controlled Unlearning for Robust Embeddings – Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Aysenur Kocak, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.CL

TL;DR: CURE is a lightweight framework that disentangles and suppresses conceptual shortcuts in pre-trained language models while preserving essential content information, achieving significant performance improvements with minimal computational overhead.

DetailsMotivation: Pre-trained language models are susceptible to spurious, concept-driven correlations that impair robustness and fairness, requiring methods to combat these conceptual biases.

Method: Uses a content extractor with reversal network to get concept-irrelevant representations, followed by controllable debiasing module with contrastive learning to adjust residual conceptual cues.

Result: Achieves +10 points F1 improvement on IMDB and +2 points on Yelp across three pre-trained architectures with minimal computational overhead.

Conclusion: Provides a flexible, unsupervised blueprint for combating conceptual biases, enabling more reliable and fair language understanding systems.

Abstract: Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.

[88] Triadic Fusion of Cognitive, Functional, and Causal Dimensions for Explainable LLMs: The TAXAL Framework

David Herrera-Poyatos, Carlos Peláez-González, Cristina Zuheros, Virilo Tejedor, Rosana Montes, Francisco Herrera

Main category: cs.CL

TL;DR: TAXAL is a triadic framework that unites cognitive, functional, and causal dimensions to provide comprehensive explanations for agentic LLMs, addressing opacity and bias issues in high-risk domains.

DetailsMotivation: Traditional explainability methods fail to capture the reasoning pathways and systemic impacts of agentic LLMs, undermining trust and accountability in high-risk deployment domains.

Method: Triadic fusion framework combining cognitive (user understanding), functional (practical utility), and causal (faithful reasoning) dimensions, with analysis of existing methods and case studies across multiple domains.

Result: Provides a unified foundation for designing, evaluating, and deploying explanations that adapt to institutional constraints and stakeholder roles in various sociotechnical settings.

Conclusion: TAXAL advances explainability as both technical and sociotechnical practice, supporting trustworthy and context-sensitive LLM applications in the era of agentic AI.

Abstract: Large Language Models (LLMs) are increasingly being deployed in high-risk domains where opacity, bias, and instability undermine trust and accountability. Traditional explainability methods, focused on surface outputs, do not capture the reasoning pathways, planning logic, and systemic impacts of agentic LLMs. We introduce TAXAL (Triadic Alignment for eXplainability in Agentic LLMs), a triadic fusion framework that unites three complementary dimensions: cognitive (user understanding), functional (practical utility), and causal (faithful reasoning). TAXAL provides a unified, role-sensitive foundation for designing, evaluating, and deploying explanations in diverse sociotechnical settings. Our analysis synthesizes existing methods, ranging from post-hoc attribution and dialogic interfaces to explanation-aware prompting, and situates them within the TAXAL triadic fusion model. We further demonstrate its applicability through case studies in law, education, healthcare, and public services, showing how explanation strategies adapt to institutional constraints and stakeholder roles. By combining conceptual clarity with design patterns and deployment pathways, TAXAL advances explainability as a technical and sociotechnical practice, supporting trustworthy and context-sensitive LLM applications in the era of agentic AI.

[89] Hunyuan-MT Technical Report

Mao Zheng, Zheng Li, Bingxin Qu, Mingyang Song, Yang Du, Mingrui Sun, Di Wang

Main category: cs.CL

TL;DR: Hunyuan-MT-7B is a multilingual translation model supporting 33 languages with special focus on Mandarin-minority language translation, and Hunyuan-MT-Chimera-7B enhances performance through slow-thinking integration of multiple outputs.

DetailsMotivation: To address the need for high-quality multilingual translation, particularly between Mandarin and ethnic minority languages/dialects, and to improve translation performance through innovative slow-thinking approaches.

Method: Holistic training process: general and MT-oriented pre-training, Supervised Fine-Tuning (SFT), Reinforcement Learning (RL) and weak-to-strong RL alignment. Chimera model integrates multiple outputs from base model under varying parameters.

Result: Significantly outperforms comparable translation-specific models and most SOTA large models. Ranked first in 30 out of 31 language pairs in WMT2025 shared task, demonstrating robustness across high-resource and low-resource languages.

Conclusion: The models achieve state-of-the-art performance in multilingual translation, particularly excelling in Mandarin-minority language translation tasks, showcasing effectiveness of the holistic training approach and slow-thinking integration method.

Abstract: In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.

[90] BEDTime: A Unified Benchmark for Automatically Describing Time Series

Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen

Main category: cs.CL

TL;DR: The paper introduces a standardized benchmark for evaluating time series-language models across 3 tasks (recognition, differentiation, generation) using unified datasets, revealing that language-only models underperform while vision-language models excel, with multimodal time series models showing promise but needing improvement.

DetailsMotivation: Address the lack of standardized evaluation for time series foundation models, as previous works often introduce new datasets with their models, limiting direct comparisons and obscuring which capabilities drive performance.

Method: Formalize 3 time series description tasks using natural language: recognition (True/False QA), differentiation (multiple choice QA), and generation (open-ended description). Unify 4 recent datasets for head-to-head comparisons and evaluate 13 state-of-the-art models across language, vision-language, and time series-language categories.

Result: Language-only methods largely underperform, vision-language models perform well (demonstrating value of vision models), and pretrained multimodal time series-language models outperform LLMs but still have significant room for improvement. All approaches show fragility in robustness tests.

Conclusion: The benchmark provides standardized evaluation for time series reasoning systems, highlighting the need for time series-specific architectures and showing that current models, while promising, require further development to achieve robust performance.

Abstract: Many recent studies have proposed general-purpose foundation models designed for a variety of time series analysis tasks. While several established datasets already exist for evaluating these models, previous works frequently introduce their models in conjunction with new datasets, limiting opportunities for direct, independent comparisons and obscuring insights into the relative strengths of different methods. Additionally, prior evaluations often cover numerous tasks simultaneously, assessing a broad range of model abilities without clearly pinpointing which capabilities contribute to overall performance. To address these gaps, we formalize and evaluate 3 tasks that test a model’s ability to describe time series using generic natural language: (1) recognition (True/False question-answering), (2) differentiation (multiple choice question-answering), and (3) generation (open-ended natural language description). We then unify 4 recent datasets to enable head-to-head model comparisons on each task. Experimentally, in evaluating 13 state-of-the-art language, vision–language, and time series–language models, we find that (1) popular language-only methods largely underperform, indicating a need for time series-specific architectures, (2) VLMs are quite successful, as expected, identifying the value of vision models for these tasks and (3) pretrained multimodal time series–language models successfully outperform LLMs, but still have significant room for improvement. We also find that all approaches exhibit clear fragility in a range of robustness tests. Overall, our benchmark provides a standardized evaluation on a task necessary for time series reasoning systems.

[91] Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Deniz Bayazit, Aaron Mueller, Antoine Bosselut

Main category: cs.CL

TL;DR: Using sparse crosscoders to track linguistic feature evolution during LLM pretraining, introducing RelIE metric to identify when features become causally important for task performance.

DetailsMotivation: Traditional benchmarking fails to reveal how language models acquire specific linguistic concepts and capabilities during pretraining, creating a gap in understanding model training at the concept level.

Method: Train sparse crosscoders between open-sourced checkpoint triplets with significant performance shifts, and introduce Relative Indirect Effects (RelIE) metric to trace when individual features become causally important.

Result: Crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining, providing fine-grained analysis of representation learning.

Conclusion: The approach is architecture-agnostic and scalable, offering a promising path toward more interpretable analysis of representation learning throughout pretraining.

Abstract: Large language models (LLMs) learn non-trivial abstractions during pretraining, like detecting irregular plural noun subjects. However, it is not well understood when and how specific linguistic abilities emerge as traditional evaluation methods such as benchmarking fail to reveal how models acquire concepts and capabilities. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.

[92] Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation

Abdul Waheed, Chancharik Mitra, Laurie Z. Wang, Deva Ramanan, Bhiksha Raj

Main category: cs.CL

TL;DR: A framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity without architectural changes, using curated data with proportional chain-of-thought traces.

DetailsMotivation: Chain-of-thought reasoning produces unnecessarily verbose output for simpler problems, creating inefficiency in model reasoning processes.

Method: Post-training on carefully curated data that includes chain-of-thought traces proportional to problem difficulty, using supervised fine-tuning (SFT) and direct preference optimization (DPO) techniques.

Result: Models can learn to dynamically adjust reasoning depth, with SFT capturing reasoning length/format patterns and DPO preserving reasoning accuracy. Combination reduces length while maintaining or improving performance.

Conclusion: Models can successfully learn to “think proportionally” - reasoning minimally on simple problems while maintaining depth for complex ones, achieving more efficient reasoning without architectural modifications.

Abstract: Chain-of-thought reasoning, while powerful, can produce unnecessarily verbose output for simpler problems. We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity. Remarkably, we show that models can be endowed with such dynamic inference pathways without any architectural modifications; we simply post-train on data that is carefully curated to include chain-of-thought traces that are proportional in length to problem difficulty. Our analysis reveals that post-training via supervised fine-tuning (SFT) primarily captures patterns like reasoning length and format, while direct preference optimization (DPO) preserves reasoning accuracy, with their combination reducing length and maintaining or improving performance. Both quantitative metrics and qualitative assessments confirm that models can learn to “think proportionally”, reasoning minimally on simple problems while maintaining depth for complex ones.

[93] Uniform Information Density and Syntactic Reduction: Revisiting $\textit{that}$-Mentioning in English Complement Clauses

Hailin Hao, Elsi Kaiser

Main category: cs.CL

TL;DR: Replication study confirms that English complementizer ’that’ omission follows Uniform Information Density principle, with neural language models providing better predictive power than traditional measures.

DetailsMotivation: To revisit and advance research on the Uniform Information Density hypothesis by examining complementizer 'that' usage patterns using modern computational methods and large conversational data.

Method: Analyzed a large-scale contemporary conversational corpus using machine learning and neural language models to refine information density estimates, comparing traditional subcategorization probability measures with contextual word embeddings.

Result: Replicated the established relationship between information density and ’that’-mentioning, found that neural language models based on contextual word embeddings account for additional variance in complementizer usage patterns compared to traditional measures.

Conclusion: Contextual word embeddings provide superior estimates of information density for predicting complementizer usage, capturing more nuanced linguistic patterns than previous lexical-based measures.

Abstract: Speakers often have multiple ways to express the same meaning. The Uniform Information Density (UID) hypothesis suggests that speakers exploit this variability to maintain a consistent rate of information transmission during language production. Building on prior work linking UID to syntactic reduction, we revisit the finding that the optional complementizer $\textit{that}$in English complement clauses is more likely to be omitted when the clause has low information density (i.e., more predictable). We advance this line of research by analyzing a large-scale, contemporary conversational corpus and using machine learning and neural language models to refine estimates of information density. Our results replicated the established relationship between information density and $\textit{that}$-mentioning. However, we found that previous measures of information density based on matrix verbs’ subcategorization probability capture substantial idiosyncratic lexical variation. By contrast, estimates derived from contextual word embeddings account for additional variance in patterns of complementizer usage.

[94] Elucidating the Design Space of Decay in Linear Attention

Zhen Qin, Xuyang Shen, Yiran Zhong

Main category: cs.CL

TL;DR: Comprehensive analysis of decay mechanisms in linear complexity sequence models, examining parameterization strategies, parameter sharing, decay granularity, and RoPE compatibility, with key findings on optimal configurations and limitations.

DetailsMotivation: To systematically investigate and understand the decay mechanisms in linear complexity sequence models across multiple dimensions to identify optimal design choices and limitations.

Method: Extensive experiments on diverse language modeling tasks, analyzing four key dimensions: parameterization strategy, parameter sharing, decay granularity (scalar vs vector), and compatibility with RoPE positional encoding.

Result: Found that effective decay parameterization requires specific parameter ranges; parameter sharing cannot be arbitrary as it affects decay values; scalar decay generally underperforms vector decay but can outperform in certain scenarios; RoPE typically doesn’t benefit most linear attention mechanisms.

Conclusion: Decay mechanism design requires careful consideration across multiple dimensions, with specific optimal configurations and limitations identified, particularly regarding parameter sharing strategies and the limited utility of RoPE for linear attention.

Abstract: This paper presents a comprehensive investigation into the decay mechanisms inherent in linear complexity sequence models. We systematically delineate the design space of decay mechanisms across four pivotal dimensions: parameterization strategy, which refers to the computational methodology for decay; parameter sharing, which involves the utilization of supplementary parameters for decay computation; decay granularity, comparing scalar versus vector-based decay; and compatibility with relative positional encoding methods, such as Rotary Position Embedding (RoPE). Through an extensive series of experiments conducted on diverse language modeling tasks, we uncovered several critical insights. Firstly, the design of the parameterization strategy for decay requires meticulous consideration. Our findings indicate that effective configurations are typically confined to a specific range of parameters. Secondly, parameter sharing cannot be used arbitrarily, as it may cause decay values to be too large or too small, thereby significantly impacting performance. Thirdly, under identical parameterization strategies, scalar decay generally underperforms compared to its vector-based counterpart. However, in certain scenarios with alternative parameterization strategies, scalar decay may unexpectedly surpass vector decay in efficacy. Lastly, our analysis reveals that RoPE, a commonly employed relative positional encoding method, typically fails to provide tangible benefits to the majority of linear attention mechanisms.

[95] Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, Nan Duan

Main category: cs.CL

TL;DR: MSM is a multilingual pre-trained language model designed for cross-lingual retrieval that models universal sequential sentence relations across languages using masked sentence prediction with hierarchical contrastive loss.

DetailsMotivation: Existing multilingual PLMs like mBERT and XLM-R are general-purpose, but there's a need for models specifically tailored for cross-lingual retrieval. The observation that parallel documents maintain similar sentence order across languages provides a universal structure that can be leveraged for better cross-lingual representation learning.

Method: Proposed Masked Sentence Model (MSM) with sentence encoder for sentence representations and shared document encoder for sequential sentence relations. Uses masked sentence prediction task with hierarchical contrastive loss and negative sampling for training.

Result: MSM significantly outperforms existing advanced pre-training models on four cross-lingual retrieval tasks, demonstrating stronger cross-lingual retrieval capabilities.

Conclusion: Modeling universal sequential sentence relations across languages through masked sentence prediction is an effective approach for cross-lingual representation learning and retrieval, with MSM showing superior performance over general-purpose multilingual PLMs.

Abstract: Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach. Code and model are available at https://github.com/shunyuzh/MSM.

[96] Demystifying Chains, Trees, and Graphs of Thoughts

Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Aidan O’Mahony, Onur Mutlu, Torsten Hoefler

Main category: cs.CL

TL;DR: This paper provides a comprehensive taxonomy and analysis of structure-enhanced LLM reasoning techniques like Chain-of-Thought, Tree of Thoughts, and Graph of Thoughts, proposing a general blueprint for effective prompt engineering.

DetailsMotivation: To facilitate understanding of the growing field of structure-enhanced LLM reasoning and pave the way for future developments by creating a systematic framework for analyzing and comparing different prompting techniques.

Method: Conducted in-depth analysis of prompt execution pipeline, defined key concepts, built the first taxonomy of structure-enhanced LLM reasoning schemes, identified fundamental classes of harnessed structures (reasoning topologies), and analyzed representations and algorithms.

Result: Developed a comprehensive taxonomy that compares existing prompting schemes, revealing how different design choices affect performance and cost. Established theoretical foundations and relationships between prompting and other LLM ecosystem components.

Conclusion: The proposed taxonomy and analysis framework will help advance future prompt engineering techniques by providing a systematic way to understand, compare, and develop structure-enhanced LLM reasoning methods.

Abstract: The field of natural language processing (NLP) has witnessed significant progress in recent years, with a notable focus on improving large language models’ (LLM) performance through innovative prompting techniques. Among these, prompt engineering coupled with structures has emerged as a promising paradigm, with designs such as Chain-of-Thought, Tree of Thoughts, or Graph of Thoughts, in which the overall LLM reasoning is guided by a structure such as a graph. As illustrated with numerous examples, this paradigm significantly enhances the LLM’s capability to solve numerous tasks, ranging from logical or mathematical reasoning to planning or creative writing. To facilitate the understanding of this growing field and pave the way for future developments, we devise a general blueprint for effective and efficient LLM reasoning schemes. For this, we conduct an in-depth analysis of the prompt execution pipeline, clarifying and clearly defining different concepts. We then build the first taxonomy of structure-enhanced LLM reasoning schemes. We focus on identifying fundamental classes of harnessed structures, and we analyze the representations of these structures, algorithms executed with these structures, and many others. We refer to these structures as reasoning topologies, because their representation becomes to a degree spatial, as they are contained within the LLM context. Our study compares existing prompting schemes using the proposed taxonomy, discussing how certain design choices lead to different patterns in performance and cost. We also outline theoretical underpinnings, relationships between prompting and other parts of the LLM ecosystem such as knowledge bases, and the associated research challenges. Our work will help to advance future prompt engineering techniques.

[97] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu

Main category: cs.CL

TL;DR: AnyGPT is an any-to-any multimodal language model that uses discrete representations to unify speech, text, images, and music processing without changing LLM architecture, achieving performance comparable to specialized models.

DetailsMotivation: To create a unified multimodal language model that can handle any combination of inputs and outputs across different modalities (speech, text, images, music) without requiring architectural changes to existing LLMs.

Method: Uses discrete representations for multimodal processing, data-level preprocessing, builds multimodal text-centric dataset for alignment pre-training, and synthesizes a large-scale any-to-any multimodal instruction dataset with 108k multi-turn conversation samples.

Result: AnyGPT successfully facilitates any-to-any multimodal conversations and achieves performance comparable to specialized models across all modalities, demonstrating that discrete representations can effectively unify multiple modalities.

Conclusion: Discrete representations provide an effective and convenient way to unify multiple modalities within a language model, enabling seamless integration of new modalities similar to adding new languages to LLMs.

Abstract: We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

[98] PersonaGym: Evaluating Persona Agents and LLMs

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, Vishvak Murahari

Main category: cs.CL

TL;DR: PersonaGym is a new framework for evaluating how faithfully LLM persona agents adhere to their assigned personas, with PersonaScore metric showing that model size doesn’t guarantee better persona consistency.

DetailsMotivation: Evaluating persona agents' faithfulness to their assigned personas is challenging in free-form settings, requiring consistent performance across diverse persona-relevant environments.

Method: Introduced PersonaGym dynamic evaluation framework and PersonaScore metric based on decision theory for comprehensive large-scale evaluation of persona agents.

Result: Evaluation of 10 leading LLMs across 200 personas and 10,000 questions showed GPT-4.1 had same PersonaScore as LLaMA-3-8b despite being more advanced, indicating model size doesn’t improve persona capabilities.

Conclusion: Increased model size and complexity don’t necessarily enhance persona agent performance, highlighting need for algorithmic and architectural innovation for faithful persona agents.

Abstract: Persona agents, which are LLM agents conditioned to act according to an assigned persona, enable contextually rich and user aligned interactions across domains like education and healthcare. However, evaluating how faithfully these agents adhere to their personas remains a significant challenge, particularly in free-form settings that demand consistency across diverse, persona-relevant environments. We introduce PersonaGym, the first dynamic evaluation framework for persona agents, and PersonaScore, a human-aligned automatic metric grounded in decision theory that enables comprehensive large-scale evaluation. Our evaluation of 10 leading LLMs across 200 personas and 10,000 questions reveals significant advancement opportunities. For example, GPT-4.1 had the exact same PersonaScore as LLaMA-3-8b despite being a more recent and advanced closed source model. Importantly, increased model size and complexity do not necessarily enhance persona agent capabilities, underscoring the need for algorithmic and architectural innovation toward faithful, performant persona agents.

[99] ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Mohammed Khalil, Mohammed Sabry

Main category: cs.CL

TL;DR: The paper introduces ATHAR, a new 66,000-sample Classical Arabic to English translation dataset covering science, culture, and philosophy, addressing the scarcity of such resources and showing current LLMs need this data for better performance.

DetailsMotivation: Classical Arabic literature from the golden age is important for knowledge dissemination, but there's a scarcity of high-quality translation datasets, limiting the development of effective translation systems for this domain.

Method: Created the ATHAR dataset with 66,000 high-quality Classical Arabic to English translation samples across diverse topics, then evaluated state-of-the-art LLMs under various settings to assess their performance.

Result: Current LLMs show limitations in Classical Arabic translation, demonstrating the need for specialized datasets like ATHAR. Models benefit significantly from fine-tuning or incorporating this dataset into pretraining pipelines.

Conclusion: The ATHAR dataset fills a critical gap in Classical Arabic translation resources and enables improved performance of translation systems through fine-tuning or pretraining incorporation, with the dataset being publicly available.

Abstract: Classical Arabic represents a significant era that encompasses the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, which comprises 66,000 high-quality classical Arabic to English translation samples that cover a wide array of topics including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a need for such datasets in current systems. Our findings highlight how models can benefit from fine-tuning or incorporating this dataset into their pretraining pipelines. The dataset is publicly available on the HuggingFace Data Hub: https://huggingface.co/datasets/mohamed-khalil/ATHAR.

[100] Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions

Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao

Main category: cs.CL

TL;DR: MLLM agents in GUI environments are susceptible to distractions from unrelated environmental content, even with benign users and non-malicious environments.

DetailsMotivation: To investigate whether multimodal GUI agents can be distracted by environmental context and address the faithfulness of MLLM agents in GUI environments.

Method: Evaluated a wide range of MLLMs as GUI agents using a simulated dataset with three working patterns at different perception levels, and implemented adversarial environment injection to analyze improvement approaches.

Result: Even the most powerful MLLM models, including both generalist and specialist GUI agents, were found to be susceptible to environmental distractions.

Conclusion: The study reveals a critical vulnerability in MLLM agents’ faithfulness to task objectives when exposed to environmental context, calling for collective focus on improving agent robustness against distractions.

Abstract: This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general scenario is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using a simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness of agents, our findings first indicate that these agents are prone to environmental distractions. Furthermore, we implement an adversarial environment injection and analyze the approach to improve faithfulness, calling for a collective focus on this important topic.

[101] Selective Preference Optimization via Token-Level Reward Function Estimation

Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Erxue Min, Sophia Ananiadou

Main category: cs.CL

TL;DR: SePO is a selective token-level alignment method that uses DPO-based token selection with an oracle model to identify key tokens, achieving superior performance while optimizing only 30% of tokens and enabling weak-to-strong generalization.

DetailsMotivation: Existing token-level alignment methods are either noisy and inefficient by optimizing all tokens, or require complex and expensive key token selection strategies.

Method: Proposes Selective Preference Optimization (SePO) with DPO-based token selection using an oracle model to estimate token-level rewards, then selectively trains only on key tokens with a reference model-free contrastive objective.

Result: Outperforms competitive baselines by optimizing only 30% key tokens, enables weak-to-strong generalization (16.8x parameter scaling), and effectively handles out-of-distribution data while reducing over-optimization.

Conclusion: SePO provides an efficient and effective selective alignment strategy that reduces computational costs while maintaining or improving performance through intelligent token selection.

Abstract: Recent advancements in large language model alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection. SePO proposes the first token selection method based on Direct Preference Optimization (DPO), which trains an oracle model to estimate a token-level reward function on the target data. This method applies to any existing alignment datasets with response-level annotations and enables cost-efficient token selection with small-scale oracle models and training data. The estimated reward function is then utilized to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a reference model-free contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing 30% key tokens on the target dataset. SePO applications on weak-to-strong generalization show that weak oracle models effectively supervise strong policy models with up to 16.8x more parameters. SePO also effectively selects key tokens from out-of-distribution data to enhance strong policy models and alleviate the over-optimization problem.

[102] LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning

Jin Jiang, Yuchen Yan, Yang Liu, Jianing Wang, Shuai Peng, Xunliang Cai, Yixin Cao, Mengdi Zhang, Liangcai Gao

Main category: cs.CL

TL;DR: LogicPro is a novel data synthesis method that uses LeetCode-style algorithm problems to generate complex logical reasoning data with golden standard answers and high-quality reasoning processes.

DetailsMotivation: To create scalable, effective reasoning data that is difficult to obtain through traditional methods, leveraging algorithm problems as a rich source of complex logical reasoning content.

Method: Synthesizes reasoning problems from algorithm problems and test cases, obtains standard answers and intermediate variable outputs from Python solutions, then generates text reasoning processes guided by code intermediate variables.

Result: Generated 540K synthesized dataset from just 2,360 algorithm problems, achieving significant improvements on multiple reasoning benchmarks including BBH, LogicBench, DROP, AR-LSAT, and GSM8K.

Conclusion: LogicPro effectively synthesizes high-quality complex reasoning data that outperforms existing reasoning datasets, demonstrating the value of leveraging algorithm problems for reasoning data generation.

Abstract: In this paper, we propose a new data synthesis method called \textbf{LogicPro}, which leverages LeetCode-style algorithm \underline{Pro}blems and their corresponding \underline{Pro}gram solutions to synthesize Complex \underline{Logic}al Reasoning data in text format. First, we synthesize complex reasoning problems through source algorithm problems and test cases. Then, standard answers and intermediate variable outputs are obtained for each problem based on standard python solutions and test cases. Finally, with the guidance of code intermediate variables, we synthesize the text reasoning process for each reasoning problems. Through this method, we can synthesize data that is difficult, scalable, effective, and comes with golden standard answers and high-quality reasoning processes. As a result, with our 540K synthesized dataset constructed solely from 2,360 algorithm problems, our approach \footnote{Code and data are publicly available at https://github.com/jiangjin1999/LogicPro} achieves significant improvements in multiple models for the datasets \textit{BBH$^{27}$}, \textit{LogicBench}, \textit{DROP}, \textit{AR-LSAT}, and \textit{GSM8K}, etc. outperforming a wide range of existing reasoning datasets.

[103] What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric analysis

Mohammed Q. Shormani, Yehia A. AlSohbani

Main category: cs.CL

TL;DR: Scientometric analysis of 51 years (1974-2024) shows linguistics-AI correlation has evolved from unstable research in 1980s-1990s to explosive growth with 1478 articles in 2023, driven by NLP, ChatGPT, and deep learning models.

DetailsMotivation: To provide a comprehensive scientometric analysis of the correlation between linguistics and artificial intelligence over 51 years, examining the intellectual production landscape and emerging trends.

Method: Used Web of Science Core Collection database, analyzed with CiteSpace and VOSviewer software to generate mapping visualizations of intellectual landscape, trending issues, and emerging hotspots from 1974-2024.

Result: Research was unstable in 1980s-1990s but showed remarkable growth reaching 1478 articles in 2023 and 546 in Q1 2024. Emerging issues include Natural language processing, Cross-sectional study, bidirectional encoder representation, and ChatGPT. Hotspots include Novice programmer, Prioritization, and Artificial intelligence.

Conclusion: Linguistics and AI correlation is established at multiple levels, with research centers, journals, and countries shaping AIL knowledge production and reshaping future frontiers through new applications and powerful deep learning language models like ChatGPT.

Abstract: There is a strong correlation between linguistics and artificial intelligence (AI), best manifested by deep learning language models. This study provides a thorough scientometric analysis of this correlation, synthesizing the intellectual production over 51 years, from 1974 to 2024. Web of Science Core Collection (WoSCC) database was the data source. The data collected were analyzed by two powerful software, viz., CiteSpace and VOSviewer, through which mapping visualizations of the intellectual landscape, trending issues and (re)emerging hotspots were generated. The results indicate that in the 1980s and 1990s, linguistics and AI (AIL) research was not robust, characterized by unstable publication over time. It has, however, witnessed a remarkable increase of publication since then, reaching 1478 articles in 2023, and 546 articles in January-March timespan in 2024, involving emerging issues including Natural language processing, Cross-sectional study, Using bidirectional encoder representation, and Using ChatGPT and hotspots such as Novice programmer, Prioritization, and Artificial intelligence, addressing new horizons, new topics, and launching new applications and powerful deep learning language models including ChatGPT. It concludes that linguistics and AI correlation is established at several levels, research centers, journals, and countries shaping AIL knowledge production and reshaping its future frontiers.

[104] Assessing the Sensitivity and Alignment of FOL Closeness Metrics

Ramya Keerthy Thatikonda, Wray Buntine, Ehsan Shareghi

Main category: cs.CL

TL;DR: Study evaluates existing metrics for comparing generated vs ground-truth First-Order Logic statements, finding BLEU oversensitive to text changes and Smatch++/FOL metrics sensitive to operator changes, with BertScore aligning best with LLM judgments.

DetailsMotivation: Current tool-augmented LLMs for logical reasoning lack reliable evaluation metrics to verify correctness of generated FOL statements compared to ground-truth, as existing metrics have unverified sensitivity and robustness.

Method: Applied operator and text-based perturbations to ground-truth FOL statements to assess metric sensitivity, then compared metric rankings against LLM-as-a-judge evaluations to measure robustness and alignment.

Result: BLEU shows oversensitivity to text perturbations, Smatch++ is affected by structural operator changes, FOL metric is sensitive to specific operator changes. BertScore aligns closest with LLM judgments, and combining metrics improves both robustness and sensitivity.

Conclusion: Semantic evaluation (BertScore) aligns best with LLM judgment, and metric combination outperforms individual metrics for evaluating FOL statement correctness, highlighting the importance of comprehensive evaluation approaches.

Abstract: The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language (NL) statements into First-Order Logic~(FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text, often go unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we conduct a comprehensive study on the sensitivity of existing NL-, FOL-, and graph-based metrics to capture differences between a sampled FOL and its corresponding ground-truth. We then measure the alignment between a metric-based ranking of FOL outputs and a strong LLM as-a-judge. To do this, we first apply operator and text-based perturbations to ground-truth FOL statements to assess metric sensitivity. We then evaluate metric robustness by comparing the metrics against LLMs judgment. Our empirical findings highlight a clear oversensitivity in the n-gram metric BLEU for text perturbations. The operator perturbation affects the semantic graph metric Smatch++ for structural changes, and the FOL metric for specific operator changes. We observe a closer alignment between BertScore and LLM judgement, proving the importance of semantic evaluation. Additionally, we show that combining metrics enhances both robustness and sensitivity compared to using individual metrics.

[105] Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction

Maya Kruse, Shiyue Hu, Nicholas Derby, Yifu Wu, Samantha Stonbraker, Bingsheng Yao, Dakuo Wang, Elizabeth Goldberg, Yanjun Gao

Main category: cs.CL

TL;DR: LLMs show promise in clinical text summarization but struggle with long patient trajectories, temporal reasoning, and rare disease prediction despite long context windows and RAG improvements.

DetailsMotivation: To evaluate LLMs' ability to handle long patient trajectories with multi-modal data across time in clinical settings, as this remains underexplored despite recent advances.

Method: Systematic evaluation of open-source LLMs, their RAG variants, and chain-of-thought prompting on long-context clinical summarization and prediction tasks using two public EHR datasets, re-engineering discharge summarization and diagnosis prediction tasks.

Result: Long context windows improve input integration but don’t consistently enhance clinical reasoning; LLMs struggle with temporal progression and rare disease prediction; RAG reduces hallucinations but doesn’t fully address limitations.

Conclusion: The study fills the gap in long clinical text summarization evaluation and establishes a foundation for assessing LLMs with multi-modal data and temporal reasoning capabilities.

Abstract: Recent advances in large language models (LLMs) have shown potential in clinical text summarization, but their ability to handle long patient trajectories with multi-modal data spread across time remains underexplored. This study systematically evaluates several state-of-the-art open-source LLMs, their Retrieval Augmented Generation (RAG) variants and chain-of-thought (CoT) prompting on long-context clinical summarization and prediction. We examine their ability to synthesize structured and unstructured Electronic Health Records (EHR) data while reasoning over temporal coherence, by re-engineering existing tasks, including discharge summarization and diagnosis prediction from two publicly available EHR datasets. Our results indicate that long context windows improve input integration but do not consistently enhance clinical reasoning, and LLMs are still struggling with temporal progression and rare disease prediction. While RAG shows improvements in hallucination in some cases, it does not fully address these limitations. Our work fills the gap in long clinical text summarization, establishing a foundation for evaluating LLMs with multi-modal data and temporal reasoning.

[106] All That Glitters is Not Novel: Plagiarism in AI Generated Research

Tarun Gupta, Danish Pruthi

Main category: cs.CL

TL;DR: Study finds 24% of LLM-generated research documents are plagiarized from existing work, with automated detectors failing to catch these instances. Experts identified significant similarities despite documents bypassing standard plagiarism checks.

DetailsMotivation: To investigate claims about autonomous research agents generating novel ideas by examining whether LLM-generated research documents actually plagiarize existing work rather than creating truly novel content.

Method: 13 experts evaluated 50 LLM-generated research documents to identify similarities with existing work, with findings cross-verified by original authors. Controlled experiments tested automated plagiarism detectors.

Result: 24% of documents were either paraphrased with one-to-one methodological mapping or significantly borrowed from existing work. 76% showed varying degrees of similarity, with only a small fraction appearing completely novel. Automated detectors failed to catch plagiarized ideas.

Conclusion: LLM-generated research requires careful assessment as it often plagiarizes existing work without acknowledgment and bypasses standard plagiarism detection. Academic publishing needs to address these emerging challenges.

Abstract: Automating scientific research is considered the final frontier of science. Recently, several papers claim autonomous research agents can generate novel research ideas. Amidst the prevailing optimism, we document a critical concern: a considerable fraction of such research documents are smartly plagiarized. Unlike past efforts where experts evaluate the novelty and feasibility of research ideas, we request $13$ experts to operate under a different situational logic: to identify similarities between LLM-generated research documents and existing work. Concerningly, the experts identify $24%$ of the $50$ evaluated research documents to be either paraphrased (with one-to-one methodological mapping), or significantly borrowed from existing work. These reported instances are cross-verified by authors of the source papers. The remaining $76%$ of documents show varying degrees of similarity with existing work, with only a small fraction appearing completely novel. Problematically, these LLM-generated research documents do not acknowledge original sources, and bypass inbuilt plagiarism detectors. Lastly, through controlled experiments we show that automated plagiarism detectors are inadequate at catching plagiarized ideas from such systems. We recommend a careful assessment of LLM-generated research, and discuss the implications of our findings on academic publishing.

[107] Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunović, Nikola Jovanović, Martin Vechev

Main category: cs.CL

TL;DR: Current LLMs perform well on math benchmarks that only evaluate final answers, but struggle significantly with rigorous reasoning and proof generation when evaluated on full-solution mathematical problems from USAMO 2025.

DetailsMotivation: Existing math benchmarks evaluate models based solely on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks.

Method: Used expert human annotators to evaluate state-of-the-art reasoning models on six problems from the 2025 USAMO within hours of their release, analyzing full-solution reasoning and identifying failure modes.

Result: All tested models struggled significantly - only Gemini-2.5-Pro achieved 25% score while others scored less than 5%. Analysis revealed common failure modes and unwanted artifacts from training optimization strategies.

Conclusion: Current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

Abstract: Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, Gemini-2.5-Pro, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce a comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly: only Gemini-2.5-Pro achieves a non-trivial score of 25%, while all other models achieve less than 5%. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

[108] StereoDetect: Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings

Kaustubh Shivshankar Shejole, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: This paper introduces StereoDetect, a new benchmark dataset for stereotype and anti-stereotype detection, addressing shortcomings in existing benchmarks and providing clear definitions to distinguish stereotypes from stereotypical biases.

DetailsMotivation: Current research fails to clearly distinguish between stereotypes and stereotypical biases, slowing progress in stereotype detection. The task requires social knowledge and is one of the most difficult areas in Responsible AI.

Method: Proposed a five-tuple definition with precise terminologies, developed a conceptual framework grounded in social psychology, and created StereoDetect - a curated benchmark dataset aligned with their definitions.

Result: Sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. StereoDetect demonstrates effectiveness through qualitative and quantitative comparisons with existing benchmarks.

Conclusion: The work provides clear definitions and a reliable benchmark (StereoDetect) that addresses key shortcomings in existing stereotype detection research, enabling more accurate detection of stereotypes and anti-stereotypes.

Abstract: Stereotypes are known to have very harmful effects, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases, thereby leaving the study of stereotypes in its early stages. Our study revealed that many works have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and Anti-stereotype detection is a problem that requires social knowledge; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a five-tuple definition and provide precise terminologies disentangling stereotypes, anti-stereotypes, stereotypical bias, and general bias. We provide a conceptual framework grounded in social psychology for reliable detection. We identify key shortcomings in existing benchmarks for this task of stereotype and anti-stereotype detection. To address these gaps, we developed StereoDetect, a well curated, definition-aligned benchmark dataset designed for this task. We show that sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. We demonstrate StereoDetect’s effectiveness through multiple qualitative and quantitative comparisons with existing benchmarks and models fine-tuned on them. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.

[109] Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following

Sai Adith Senthil Kumar, Hao Yan, Saipavan Perepa, Murong Yue, Ziyu Yao

Main category: cs.CL

TL;DR: LLMs struggle to simulate personas with reversed performance (e.g., low-proficiency students), limiting simulation diversity and practical applications.

DetailsMotivation: Current state-of-the-art LLMs cannot effectively simulate personas with counterfactual performance levels, which restricts the diversity and utility of simulated environments.

Method: Proposed a benchmark dataset for evaluating LLMs on counterfactual instruction following, using mathematical reasoning as a representative scenario. Evaluated both open-weight and closed-source LLMs including OpenAI o1.

Result: All tested LLMs struggled to follow counterfactual instructions for simulating reversely performing personas. Intersectional simulation of performance level and race population worsened the effect.

Conclusion: The study highlights significant challenges in counterfactual instruction following and demonstrates the need for further research to improve LLMs’ ability to simulate diverse personas.

Abstract: Large Language Models (LLMs) are now increasingly widely used to simulate personas in virtual environments, leveraging their instruction-following capability. However, we discovered that even state-of-the-art LLMs cannot simulate personas with reversed performance (e.g., student personas with low proficiency in educational settings), which impairs the simulation diversity and limits the practical applications of the simulated environments. In this work, using mathematical reasoning as a representative scenario, we propose the first benchmark dataset for evaluating LLMs on simulating personas with reversed performance, a capability that we dub “counterfactual instruction following”. We evaluate both open-weight and closed-source LLMs on this task and find that LLMs, including the OpenAI o1 reasoning model, all struggle to follow counterfactual instructions for simulating reversedly performing personas. Intersectionally simulating both the performance level and the race population of a persona worsens the effect even further. These results highlight the challenges of counterfactual instruction following and the need for further research.

[110] ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos

Patrick Giedemann, Pius von Däniken, Jan Deriu, Alvaro Rodrigo, Anselmo Peñas, Mark Cieliebak

Main category: cs.CL

TL;DR: ViClaim dataset with 1,798 multilingual video transcripts annotated for claim detection, showing strong cross-validation performance but challenges with topic generalization.

DetailsMotivation: Address the gap in misinformation detection for video content, particularly spoken text in transcripts, as existing methods focus mainly on written text.

Method: Created multilingual dataset (English, German, Spanish) across 6 topics with sentence-level annotations (fact-check-worthy, non-check-worthy, opinion). Developed custom annotation tool and tested state-of-the-art multilingual language models.

Result: Achieved strong cross-validation performance (macro F1 up to 0.896) but models struggled with generalization to unseen topics, especially distinct domains.

Conclusion: Claim detection in video transcripts is complex. ViClaim provides a foundation for advancing video misinformation detection and addresses critical gaps in multimodal analysis.

Abstract: The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.

[111] First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Andrew Zhu, Evan Osgood, Chris Callison-Burch

Main category: cs.CL

TL;DR: Overhearing agents are LLM agents that passively listen to human conversations and provide background assistance, demonstrated through Dungeons & Dragons gameplay using multimodal audio-language models.

DetailsMotivation: To explore an alternative paradigm for LLM agents that don't actively participate in conversations but instead assist by listening to human-to-human interactions.

Method: Used large multimodal audio-language models as overhearing agents in Dungeons & Dragons gameplay, conducting human evaluation to assess their helpfulness.

Result: Found that some large audio-language models have emergent ability to perform overhearing agent tasks using implicit audio cues from conversations.

Conclusion: The overhearing agents paradigm shows promise for passive AI assistance, with released Python libraries and code to support further research in this area.

Abstract: Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call “overhearing agents”. These overhearing agents do not actively participate in conversation – instead, they “listen in” on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.

[112] Text2Cypher Across Languages: Evaluating and Finetuning LLMs

Makbule Gulcin Ozsoy, William Tai

Main category: cs.CL

TL;DR: Multilingual evaluation of Text2Cypher models shows performance gaps across languages, with English highest, Spanish middle, Turkish lowest. Prompt translation has minimal impact, while multilingual finetuning reduces performance disparities compared to English-only finetuning.

DetailsMotivation: Most Text2SQL/SPARQL/Cypher research focuses on English, with limited evaluation in other languages. The paper aims to investigate LLM performance on Text2Cypher across multiple languages to address this gap and promote more inclusive database query systems.

Method: Created multilingual dataset by translating English questions to Spanish and Turkish while preserving original Cypher queries. Evaluated foundational and finetuned LLMs using standardized prompts and metrics. Examined prompt translation impact and compared English-only vs multilingual finetuning approaches.

Result: Consistent performance pattern: English > Spanish > Turkish. Prompt translation showed little to no change in metrics. English-only finetuning improved overall accuracy but widened language performance gap. Multilingual finetuning narrowed the gap and produced more balanced performance across languages.

Conclusion: Multilingual evaluation and training are crucial for building inclusive and robust query generation systems. Performance disparities exist due to training data availability and linguistic features, but multilingual approaches can help mitigate these gaps.

Abstract: Recent advances in large language models (LLMs) have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses on English, with limited evaluation in other languages. This paper investigates the performance of both foundational and finetuned LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual dataset by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. Using standardized prompts and metrics, we evaluate several foundational models and observe a consistent performance pattern: highest on English, followed by Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic features. We also examine the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Furthermore, we finetune a foundational model on two datasets: one in English only, and one multilingual. Finetuning on English improves overall accuracy but widens the performance gap between languages. In contrast, multilingual finetuning narrows the gap, resulting in more balanced performance. Our findings highlight the importance for multilingual evaluation and training to build more inclusive and robust query generation systems.

[113] HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

Mohammad Shahedur Rahman, Peng Gao, Yuede Ji

Main category: cs.CL

TL;DR: This paper analyzes the supply chain relationships between large language models (LLMs) and datasets to identify inherited vulnerabilities, biases, and malicious components.

DetailsMotivation: LLMs inherit vulnerabilities, biases, and malicious components from base models and external datasets, making it critical to understand their origin and development process for risk detection, fairness improvement, and regulatory compliance.

Method: The researchers designed a methodology to systematically collect LLM supply chain information and created a directed heterogeneous graph with 402,654 nodes and 462,524 edges to model relationships between models and datasets.

Result: The study produced a comprehensive graph representation of LLM supply chains and yielded multiple interesting findings through different types of analysis.

Conclusion: Understanding model-dataset relationships in the LLM supply chain is essential for identifying risks, improving fairness, and ensuring regulatory compliance in language model development.

Abstract: Large language models (LLMs) leverage deep learning architectures to process and predict sequences of words, enabling them to perform a wide range of natural language processing tasks, such as translation, summarization, question answering, and content generation. As existing LLMs are often built from base models or other pre-trained models and use external datasets, they can inevitably inherit vulnerabilities, biases, or malicious components that exist in previous models or datasets. Therefore, it is critical to understand these components’ origin and development process to detect potential risks, improve model fairness, and ensure compliance with regulatory frameworks. Motivated by that, this project aims to study such relationships between models and datasets, which are the central parts of the LLM supply chain. First, we design a methodology to systematically collect LLMs’ supply chain information. Then, we design a new graph to model the relationships between models and datasets, which is a directed heterogeneous graph, having 402,654 nodes and 462,524 edges. Lastly, we perform different types of analysis and make multiple interesting findings.

[114] Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement

Hao Li, Yizheng Sun, Viktor Schlegel, Kailai Yang, Riza Batista-Navarro, Goran Nenadic

Main category: cs.CL

TL;DR: Arg-LLaDA is a novel LLM diffusion framework that iteratively improves argument summaries through sufficiency-guided remasking and regeneration, outperforming state-of-the-art methods.

DetailsMotivation: Argument summarization generation stage remains underexplored, with existing single-pass approaches offering limited support for factual correction or structural refinement.

Method: Combines flexible masking controller with sufficiency-checking module to iteratively identify and revise unsupported, redundant, or incomplete spans in summaries.

Result: Surpasses state-of-the-art baselines in 7/10 automatic metrics and shows substantial human evaluation improvements in coverage, faithfulness, and conciseness.

Conclusion: The iterative, sufficiency-aware generation strategy effectively produces more faithful, concise, and coherent argument summaries.

Abstract: Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.

[115] Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey

Main category: cs.CL

TL;DR: Researchers identify persona vectors in LLM activation space that control traits like evil, sycophancy, and hallucination, enabling monitoring and control of personality shifts during training and deployment.

DetailsMotivation: Large language models sometimes deviate from their intended helpful, harmless, and honest Assistant persona, requiring methods to monitor and control these undesirable personality traits.

Method: Extract automated persona vectors from model activation space using natural-language descriptions, apply these vectors to predict personality shifts during training, and develop post-hoc interventions and preventative steering methods.

Result: Persona vectors strongly correlate with both intended and unintended personality changes after finetuning, and can be used to mitigate shifts or flag problematic training data at dataset and individual sample levels.

Conclusion: Persona vectors provide an automated, scalable approach to monitor and control LLM personality traits, enabling better alignment with desired Assistant behaviors throughout training and deployment.

Abstract: Large language models interact with users through a simulated ‘Assistant’ persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model’s activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant’s personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

[116] Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: DuET-PD framework evaluates LLM trust in persuasive dialogues, revealing GPT-4o’s 27.32% accuracy under misleading persuasion and increasing sycophancy in newer models. Holistic DPO training improves robustness to misinformation and receptiveness to corrections.

DetailsMotivation: LLMs struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, which is critical for reliable deployment.

Method: Introduces DuET-PD framework evaluating multi-turn stance-change across persuasion type (corrective/misleading) and domain (knowledge/safety). Proposes Holistic DPO training approach balancing positive and negative persuasion examples.

Result: GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasion. Holistic DPO improves Llama-3.1-8B-Instruct’s accuracy from 4.21% to 76.54% under misleading persuasion in safety contexts.

Conclusion: The contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue, addressing critical trust issues in persuasive interactions.

Abstract: Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct’s accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.

[117] Social Bias in Multilingual Language Models: A Survey

Lance Calvin Lim Gamboa, Yue Feng, Mark Lee

Main category: cs.CL

TL;DR: Systematic review of multilingual bias in NLP models, examining evaluation and mitigation approaches across languages and cultures, identifying methodological gaps and proposing future research directions.

DetailsMotivation: Pretrained multilingual models exhibit the same social biases as English models, but research on bias evaluation and mitigation in multilingual contexts is fragmented and lacks comprehensive analysis.

Method: Systematic review of emerging research on multilingual bias, analyzing studies with respect to linguistic diversity, cultural awareness, evaluation metrics, and mitigation techniques across different languages.

Result: Identified gaps in methodological design choices (language preferences, scarce multilingual mitigation experiments) and cataloged common issues and solutions in adapting bias benchmarks across languages and cultures.

Conclusion: Proposes future research directions to enhance inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements in multilingual bias literature.

Abstract: Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field’s dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature’s inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.

[118] MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation

Marshall Thomas, Edward Fish, Richard Bowden

Main category: cs.CL

TL;DR: MultiStream-LLM introduces a modular framework for sign language translation that uses separate specialized predictors for continuous signing, fingerspelling, and lipreading, achieving state-of-the-art results by fusing these streams before LLM processing.

DetailsMotivation: Existing monolithic end-to-end SLT models fail at precise fingerspelling recognition and integration of non-manual facial cues, leading to poor performance on translating names, places, and technical terms.

Method: Separate expert networks decode continuous signing, fingerspelling, and lipreading into token sequences. A lightweight transformer fuses these parallel streams to resolve temporal misalignments, then passes to an LLM for final sentence generation.

Result: Achieves new SOTA on How2Sign benchmark with BLEU-4 score of 23.5 and 73.2% letter accuracy on ChicagoFSWildPlus fingerspelling dataset.

Conclusion: Isolating and solving distinct recognition tasks before fusion provides a more powerful pathway to robust, high-fidelity sign language translation compared to monolithic approaches.

Abstract: Despite progress in gloss-free Sign Language Translation (SLT), monolithic end-to-end models consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in Automated Sign Language Translation with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names,places, and technical terms. We introduce MultiStream-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and achieves 73.2% letter accuracy on the challenging ChicagoFSWildPlus fingerspelling dataset. These results validate our core hypothesis: by isolating and solving distinct recogni tion tasks before fusion, our multi-expert approach provides a more powerful and effective pathway to robust, high-fidelity sign language translation.

[119] TECP: Token-Entropy Conformal Prediction for LLMs

Beining Xu, Yongming Lu

Main category: cs.CL

TL;DR: TECP is a novel framework that uses token-level entropy as an uncertainty measure in conformal prediction to provide formal coverage guarantees for black-box language generation models.

DetailsMotivation: Uncertainty quantification for open-ended language generation remains challenging, especially under black-box constraints where internal model signals are inaccessible.

Method: Token-Entropy Conformal Prediction (TECP) leverages token-level entropy as a logit-free, reference-free uncertainty measure and integrates it into a split conformal prediction pipeline to construct prediction sets with formal coverage guarantees.

Result: Empirical evaluations across six large language models and two benchmarks demonstrate that TECP consistently achieves reliable coverage and compact prediction sets, outperforming prior self-consistency-based UQ methods.

Conclusion: TECP provides a principled and efficient solution for trustworthy generation in black-box LLM settings, offering provable error control without requiring white-box access or semantic consistency heuristics.

Abstract: Uncertainty quantification (UQ) for open-ended language generation remains a critical yet underexplored challenge, especially under black-box constraints where internal model signals are inaccessible. In this paper, we introduce Token-Entropy Conformal Prediction (TECP), a novel framework that leverages token-level entropy as a logit-free, reference-free uncertainty measure and integrates it into a split conformal prediction (CP) pipeline to construct prediction sets with formal coverage guarantees. Unlike existing approaches that rely on semantic consistency heuristics or white-box features, TECP directly estimates epistemic uncertainty from the token entropy structure of sampled generations and calibrates uncertainty thresholds via CP quantiles to ensure provable error control. Empirical evaluations across six large language models and two benchmarks (CoQA and TriviaQA) demonstrate that TECP consistently achieves reliable coverage and compact prediction sets, outperforming prior self-consistency-based UQ methods. Our method provides a principled and efficient solution for trustworthy generation in black-box LLM settings.

[120] MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart

Main category: cs.CL

TL;DR: MultiWikiQA is a new multilingual reading comprehension dataset covering 306 languages, with questions generated by LLMs from Wikipedia articles and human-evaluated for quality.

DetailsMotivation: To create a comprehensive multilingual reading comprehension benchmark that spans a wide range of languages (306) using Wikipedia as source material, addressing the need for diverse language evaluation in NLP.

Method: Used Wikipedia articles as context data, generated questions using a large language model (LLM), and conducted crowdsourced human evaluation of question fluency across 30 languages to ensure quality.

Result: Human evaluation showed the generated questions are of good quality. Evaluation of 6 different language models (both decoder and encoder models of varying sizes) revealed the benchmark is sufficiently difficult with large performance discrepancies across languages.

Conclusion: MultiWikiQA provides a challenging and high-quality multilingual reading comprehension dataset that highlights significant performance variations across languages, serving as a valuable resource for evaluating multilingual NLP models.

Abstract: We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

cs.CV

[121] Facial Emotion Recognition does not detect feeling unsafe in automated driving

Abel van Elburg, Konstantinos Gkentsidis, Mathieu Sarrazin, Sarah Barendswaard, Varun Kotian, Riender Happee

Main category: cs.CV

TL;DR: Study examines perceived risk in automated vehicles using driving simulator with different driving styles and pedestrian interactions. Facial expression analysis proved unreliable for risk assessment, but neural network using motion and physiological data effectively predicted perceived risk.

DetailsMotivation: To understand factors affecting public acceptance of automated vehicles by examining how different driving styles and critical interactions influence perceived risk and safety.

Method: Driving simulator experiment with 32 participants, testing two automated driving styles (calm vs dynamic) with optional pedestrian crossing. Collected continuous subjective ratings, motion data, facial expressions, skin conductance, heart rate, and eye tracking.

Result: Dynamic driving style caused stronger discomfort; pedestrian crossing doubled comfort decrement with dynamic style. Facial expression analysis unreliable (only 9/24 showed reactions, mostly Happy not Fear). Neural network using motion and skin conductance successfully predicted perceived risk.

Conclusion: Facial expression recognition is not reliable for assessing perceived risk in automated vehicles, but objective measures like vehicle motion and physiological data can effectively predict risk perception, reducing subjective bias.

Abstract: Trust and perceived safety play a crucial role in the public acceptance of automated vehicles. To understand perceived risk, an experiment was conducted using a driving simulator under two automated driving styles and optionally introducing a crossing pedestrian. Data was collected from 32 participants, consisting of continuous subjective comfort ratings, motion, webcam footage for facial expression, skin conductance, heart rate, and eye tracking. The continuous subjective perceived risk ratings showed significant discomfort associated with perceived risk during cornering and braking followed by relief or even positive comfort on continuing the ride. The dynamic driving style induced a stronger discomfort as compared to the calm driving style. The crossing pedestrian did not affect discomfort with the calm driving style but doubled the comfort decrement with the dynamic driving style. This illustrates the importance of consequences of critical interactions in risk perception. Facial expression was successfully analyzed for 24 participants but most (15/24) did not show any detectable facial reaction to the critical event. Among the 9 participants who did, 8 showed a Happy expression, and only 4 showed a Surprise expression. Fear was never dominant. This indicates that facial expression recognition is not a reliable method for assessing perceived risk in automated vehicles. To predict perceived risk a neural network model was implemented using vehicle motion and skin conductance. The model correlated well with reported perceived risk, demonstrating its potential for objective perceived risk assessment in automated vehicles, reducing subjective bias and highlighting areas for future research.

[122] PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Jiale Tao, Qixun Wang, Ruihuang Li, Xin Li, Mingrui Wu, Xinchi Deng, Chunyu Wang, Qinglin Lu

Main category: cs.CV

TL;DR: PromptEnhancer is a universal prompt rewriting framework that improves text-to-image model performance by using reinforcement learning to optimize prompts based on fine-grained alignment evaluation.

DetailsMotivation: Text-to-image diffusion models often fail to accurately render complex user prompts, leading to mismatches between user intent and generated images, particularly in areas like attribute binding, negation, and compositional relationships.

Method: A Chain-of-Thought rewriter trained through reinforcement learning, guided by an AlignEvaluator reward model that provides explicit feedback based on 24 key points derived from common T2I failure modes. The framework works with any pretrained T2I model without weight modifications.

Result: Extensive experiments on HunyuanImage 2.1 model show significant improvements in image-text alignment across various semantic and compositional challenges. The method also introduces a new high-quality human preference benchmark.

Conclusion: PromptEnhancer effectively addresses T2I model limitations by decoupling prompt rewriting from generation, providing a universal solution that enhances prompt interpretation without requiring model-specific fine-tuning.

Abstract: Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.

[123] Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model

Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, Chuanxin Tang, Zidong Wang, Yichen Wei, Liang Hu, Boyi Jiang, William Li, Ying He, Yang Liu, Xuchen Song, Eric Li, Yahui Zhou

Main category: cs.CV

TL;DR: UniPic2-SD3.5M-Kontext is a 2B-parameter DiT model that achieves state-of-the-art image generation and editing with a novel Progressive Dual-Task Reinforcement strategy, outperforming larger models and extending to unified multimodal capabilities.

DetailsMotivation: Many open-source multimodal models prioritize parameter scaling over training optimization, limiting efficiency and performance. The authors aim to develop a more efficient model with better training strategies.

Method: Architectural modifications to SD3.5-Medium, large-scale pre-training on high-quality data, and a novel Progressive Dual-Task Reinforcement (PDTR) strategy that strengthens text-to-image generation and editing capabilities in a staged manner without negative interference.

Result: UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing than larger models (BAGEL 7B and Flux-Kontext 12B). The unified multimodal model UniPic2-Metaquery achieves top-tier performance across diverse tasks with simple training.

Conclusion: The proposed training paradigm (Skywork UniPic 2.0) is effective and generalizable, enabling state-of-the-art performance with fewer parameters through optimized training strategies rather than parameter scaling.

Abstract: Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.

[124] Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper

Gehui Chen, Guan’an Wang, Xiaowen Huang, Jitao Sang

Main category: cs.CV

TL;DR: MFM-Mapper uses dual visual encoders and GPT-2 to efficiently map video features to audio generation, achieving better performance with only 16% of previous training scale.

DetailsMotivation: Training Video-to-Audio models from scratch is resource-intensive, so leveraging foundation models for cross-modal knowledge transfer is more efficient.

Method: Uses dual visual encoders for richer semantic/temporal features and replaces linear mapper with GPT-2 for better feature alignment, treating it as autoregressive translation.

Result: Achieves better semantic and temporal consistency with only 16% of previous training scale, competitive with larger-scale models.

Conclusion: MFM-Mapper demonstrates efficient cross-modal feature mapping with improved performance using fewer resources through foundation model fusion.

Abstract: Recent Video-to-Audio (V2A) generation relies on extracting semantic and temporal features from video to condition generative models. Training these models from scratch is resource intensive. Consequently, leveraging foundation models (FMs) has gained traction due to their cross-modal knowledge transfer and generalization capabilities. One prior work has explored fine-tuning a lightweight mapper network to connect a pre-trained visual encoder with a text-to-audio generation model for V2A. Inspired by this, we introduce the Multiple Foundation Model Mapper (MFM-Mapper). Compared to the previous mapper approach, MFM-Mapper benefits from richer semantic and temporal information by fusing features from dual visual encoders. Furthermore, by replacing a linear mapper with GPT-2, MFM-Mapper improves feature alignment, drawing parallels between cross-modal features mapping and autoregressive translation tasks. Our MFM-Mapper exhibits remarkable training efficiency. It achieves better performance in semantic and temporal consistency with fewer training consuming, requiring only 16% of the training scale compared to previous mapper-based work, yet achieves competitive performance with models trained on a much larger scale.

[125] Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping

Jingyi Lu, Kai Han

Main category: cs.CV

TL;DR: Inpaint4Drag is a real-time drag-based image editing framework that uses pixel-space bidirectional warping and inpainting instead of latent space manipulation, achieving 0.01s warping previews and 0.3s inpainting at 512x512 resolution.

DetailsMotivation: Existing drag-based editing methods rely on latent space manipulation of generative models, which leads to limited precision, delayed feedback, and model-specific constraints that hinder user experience.

Method: The framework decomposes drag-based editing into pixel-space bidirectional warping (treating image regions as deformable materials) and image inpainting, transforming drag inputs directly into standard inpainting formats.

Result: Achieves real-time warping previews (0.01s) and efficient inpainting (0.3s), significantly faster than existing methods that require minutes per edit, while maintaining superior visual quality and precise control.

Conclusion: Inpaint4Drag serves as a universal adapter for any inpainting model without architecture modification, automatically benefiting from future inpainting technology improvements while providing real-time performance and precise editing capabilities.

Abstract: Drag-based image editing has emerged as a powerful paradigm for intuitive image manipulation. However, existing approaches predominantly rely on manipulating the latent space of generative models, leading to limited precision, delayed feedback, and model-specific constraints. Accordingly, we present Inpaint4Drag, a novel framework that decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation in the physical world, we treat image regions as deformable materials that maintain natural shape under user manipulation. Our method achieves real-time warping previews (0.01s) and efficient inpainting (0.3s) at 512x512 resolution, significantly improving the interaction experience compared to existing methods that require minutes per edit. By transforming drag inputs directly into standard inpainting formats, our approach serves as a universal adapter for any inpainting model without architecture modification, automatically inheriting all future improvements in inpainting technology. Extensive experiments demonstrate that our method achieves superior visual quality and precise control while maintaining real-time performance. Project page: https://visual-ai.github.io/inpaint4drag/

[126] DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models

Jin Ma, Mohammed Aldeen, Christopher Salas, Feng Luo, Mashrur Chowdhury, Mert Pesé, Long Cheng

Main category: cs.CV

TL;DR: DISPATCH is a diffusion-based defense framework that regenerates and rectifies images to protect object detectors from adversarial patch attacks, achieving state-of-the-art performance without requiring prior knowledge of attacks.

DetailsMotivation: Object detectors are vulnerable to adversarial patch attacks that can conceal or create objects, leading to severe real-world consequences. Current defenses lack effectiveness, generalizability, and robustness against diverse and unknown attacks.

Method: Uses a ‘regenerate and rectify’ strategy with diffusion models to regenerate entire images to align with benign data distribution, then identifies and replaces adversarial regions with their regenerated benign counterparts. Attack-agnostic and requires no prior knowledge of patches.

Result: Achieves best overall mAP.5 score of 89.3% on hiding attacks, lowers attack success rate to 24.8% on untargeted creating attacks, and maintains strong robustness across multiple detectors and adaptive attacks.

Conclusion: DISPATCH provides an effective, generalizable, and robust defense framework that outperforms state-of-the-art methods, making it practical and reliable for real-world object detection systems.

Abstract: Object detection is fundamental to various real-world applications, such as security monitoring and surveillance video analysis. Despite their advancements, state-of-theart object detectors are still vulnerable to adversarial patch attacks, which can be easily applied to real-world objects to either conceal actual items or create non-existent ones, leading to severe consequences. Given the current diversity of adversarial patch attacks and potential unknown threats, an ideal defense method should be effective, generalizable, and robust against adaptive attacks. In this work, we introduce DISPATCH, the first diffusion-based defense framework for object detection. Unlike previous works that aim to “detect and remove” adversarial patches, DISPATCH adopts a “regenerate and rectify” strategy, leveraging generative models to disarm attack effects while preserving the integrity of the input image. Specifically, we utilize the in-distribution generative power of diffusion models to regenerate the entire image, aligning it with benign data. A rectification process is then employed to identify and replace adversarial regions with their regenerated benign counterparts. DISPATCH is attack-agnostic and requires no prior knowledge of the existing patches. Extensive experiments across multiple detectors and attacks demonstrate that DISPATCH consistently outperforms state-of-the-art defenses on both hiding attacks and creating attacks, achieving the best overall mAP.5 score of 89.3% on hiding attacks, and lowering the attack success rate to 24.8% on untargeted creating attacks. Moreover, it maintains strong robustness against adaptive attacks, making it a practical and reliable defense for object detection systems.

[127] WATCH: World-aware Allied Trajectory and pose reconstruction for Camera and Human

Qijun Ying, Zhongyuan Hu, Rui Zhang, Ronghui Li, Yu Lu, Zijiao Zeng

Main category: cs.CV

TL;DR: WATCH is a unified framework for global human motion reconstruction from monocular videos that addresses camera orientation and translation integration challenges through analytical heading angle decomposition and camera trajectory integration.

DetailsMotivation: Global human motion reconstruction from monocular videos faces challenges with depth ambiguity, motion ambiguity, and camera-human movement entanglement. Existing human-motion-centric approaches insufficiently exploit camera orientation information and ineffectively integrate camera translation cues.

Method: Introduces analytical heading angle decomposition technique for superior efficiency and extensibility, plus a camera trajectory integration mechanism inspired by world models to leverage camera translation information beyond naive hard-decoding approaches.

Result: Achieves state-of-the-art performance in end-to-end trajectory reconstruction on in-the-wild benchmarks, demonstrating effectiveness of jointly modeling camera-human motion relationships.

Conclusion: WATCH provides new insights for addressing the long-standing challenge of camera translation integration in global human motion reconstruction, with code to be made publicly available.

Abstract: Global human motion reconstruction from in-the-wild monocular videos is increasingly demanded across VR, graphics, and robotics applications, yet requires accurate mapping of human poses from camera to world coordinates-a task challenged by depth ambiguity, motion ambiguity, and the entanglement between camera and human movements. While human-motion-centric approaches excel in preserving motion details and physical plausibility, they suffer from two critical limitations: insufficient exploitation of camera orientation information and ineffective integration of camera translation cues. We present WATCH (World-aware Allied Trajectory and pose reconstruction for Camera and Human), a unified framework addressing both challenges. Our approach introduces an analytical heading angle decomposition technique that offers superior efficiency and extensibility compared to existing geometric methods. Additionally, we design a camera trajectory integration mechanism inspired by world models, providing an effective pathway for leveraging camera translation information beyond naive hard-decoding approaches. Through experiments on in-the-wild benchmarks, WATCH achieves state-of-the-art performance in end-to-end trajectory reconstruction. Our work demonstrates the effectiveness of jointly modeling camera-human motion relationships and offers new insights for addressing the long-standing challenge of camera translation integration in global human motion reconstruction. The code will be available publicly.

[128] Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

MinJu Jeon, Si-Woo Kim, Ye-Chan Kim, HyunGee Kim, Dong-Jin Kim

Main category: cs.CV

TL;DR: Sali4Vid is a saliency-aware framework that improves dense video captioning by converting timestamp annotations into frame importance weights and using semantic-based adaptive caption retrieval to handle scene transitions.

DetailsMotivation: Current end-to-end dense video captioning models have two main limitations: they apply timestamp supervision only to text while treating all video frames equally, and they retrieve captions from fixed-size video chunks which overlooks scene transitions.

Method: Proposes Saliency-aware Video Reweighting to convert timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval which segments videos by frame similarity to capture scene transitions and improve caption retrieval.

Result: Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT datasets.

Conclusion: The framework demonstrates the benefit of jointly improving video weighting and retrieval for dense video captioning, showing that saliency-aware approaches can significantly enhance performance.

Abstract: Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning

[129] UAV-Based Intelligent Traffic Surveillance System: Real-Time Vehicle Detection, Classification, Tracking, and Behavioral Analysis

Ali Khanpour, Tianyi Wang, Afra Vahidi-Shams, Wim Ectors, Farzam Nakhaie, Amirhossein Taheri, Christian Claudel

Main category: cs.CV

TL;DR: UAV-based traffic surveillance system using aerial video from 200m altitude achieves 91.8% detection precision and 90.5% F1-score, with advanced tracking and violation detection capabilities for smart city applications.

DetailsMotivation: Address limitations of traditional traffic monitoring systems (fixed cameras, sensors) which have limited coverage, low adaptability, and poor scalability in urban environments.

Method: Leverages multi-scale/multi-angle template matching, Kalman filtering, homography-based calibration, geofencing, motion filtering, and trajectory deviation analysis for processing aerial video data.

Result: Achieved 91.8% detection precision, 90.5% F1-score, 92.1% MOTA, 93.7% MOTP tracking metrics. Successfully classified 5 vehicle types and detected traffic violations (unsafe lane changes, illegal parking, crosswalk obstructions).

Conclusion: System demonstrates scalability, accuracy, and practical relevance as an enforcement-aware, infrastructure-independent traffic monitoring solution for next-generation smart cities.

Abstract: Traffic congestion and violations pose significant challenges for urban mobility and road safety. Traditional traffic monitoring systems, such as fixed cameras and sensor-based methods, are often constrained by limited coverage, low adaptability, and poor scalability. To address these challenges, this paper introduces an advanced unmanned aerial vehicle (UAV)-based traffic surveillance system capable of accurate vehicle detection, classification, tracking, and behavioral analysis in real-world, unconstrained urban environments. The system leverages multi-scale and multi-angle template matching, Kalman filtering, and homography-based calibration to process aerial video data collected from altitudes of approximately 200 meters. A case study in urban area demonstrates robust performance, achieving a detection precision of 91.8%, an F1-score of 90.5%, and tracking metrics (MOTA/MOTP) of 92.1% and 93.7%, respectively. Beyond precise detection, the system classifies five vehicle types and automatically detects critical traffic violations, including unsafe lane changes, illegal double parking, and crosswalk obstructions, through the fusion of geofencing, motion filtering, and trajectory deviation analysis. The integrated analytics module supports origin-destination tracking, vehicle count visualization, inter-class correlation analysis, and heatmap-based congestion modeling. Additionally, the system enables entry-exit trajectory profiling, vehicle density estimation across road segments, and movement direction logging, supporting comprehensive multi-scale urban mobility analytics. Experimental results confirms the system’s scalability, accuracy, and practical relevance, highlighting its potential as an enforcement-aware, infrastructure-independent traffic monitoring solution for next-generation smart cities.

[130] VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation

Mustafa Munir, Alex Zhang, Radu Marculescu

Main category: cs.CV

TL;DR: VCMamba is a hybrid vision backbone that combines CNNs for local feature extraction with multi-directional Mamba SSMs for global context modeling, achieving state-of-the-art performance with fewer parameters.

DetailsMotivation: To bridge the gap between CNNs (good at local features but poor global reasoning) and modern architectures like ViTs/SSMs (good global context but poor local feature capture), creating a model that leverages the strengths of both approaches.

Method: Uses a convolutional stem and hierarchical structure with convolutional blocks in early stages for local feature extraction, followed by multi-directional Mamba blocks in later stages for efficient long-range dependency modeling with linear complexity.

Result: VCMamba-B achieves 82.6% top-1 accuracy on ImageNet-1K (surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters) and 47.1 mIoU on ADE20K (exceeding EfficientFormer-L7 by 2.0 mIoU with 62% fewer parameters).

Conclusion: The hybrid CNN-Mamba architecture successfully combines the complementary strengths of both approaches, delivering superior performance with significantly reduced parameter counts across image classification and semantic segmentation tasks.

Abstract: Recent advances in Vision Transformers (ViTs) and State Space Models (SSMs) have challenged the dominance of Convolutional Neural Networks (CNNs) in computer vision. ViTs excel at capturing global context, and SSMs like Mamba offer linear complexity for long sequences, yet they do not capture fine-grained local features as effectively as CNNs. Conversely, CNNs possess strong inductive biases for local features but lack the global reasoning capabilities of transformers and Mamba. To bridge this gap, we introduce \textit{VCMamba}, a novel vision backbone that integrates the strengths of CNNs and multi-directional Mamba SSMs. VCMamba employs a convolutional stem and a hierarchical structure with convolutional blocks in its early stages to extract rich local features. These convolutional blocks are then processed by later stages incorporating multi-directional Mamba blocks designed to efficiently model long-range dependencies and global context. This hybrid design allows for superior feature representation while maintaining linear complexity with respect to image resolution. We demonstrate VCMamba’s effectiveness through extensive experiments on ImageNet-1K classification and ADE20K semantic segmentation. Our VCMamba-B achieves 82.6% top-1 accuracy on ImageNet-1K, surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters, and outperforming Vision GNN-B by 0.3% with 64% fewer parameters. Furthermore, VCMamba-B obtains 47.1 mIoU on ADE20K, exceeding EfficientFormer-L7 by 2.0 mIoU while utilizing 62% fewer parameters. Code is available at https://github.com/Wertyuui345/VCMamba.

[131] Guideline-Consistent Segmentation via Multi-Agent Refinement

Vanshika Vats, Ashwani Rathee, James Davis

Main category: cs.CV

TL;DR: A multi-agent framework that uses vision-language models to ensure semantic segmentation follows complex textual guidelines without retraining, outperforming state-of-the-art methods on Waymo and ReasonSeg datasets.

DetailsMotivation: Real-world semantic segmentation requires strict adherence to complex, paragraph-length labeling guidelines that both humans and automated systems often fail to follow, and traditional approaches require expensive task-specific retraining when guidelines evolve.

Method: Multi-agent training-free framework with Worker-Supervisor refinement architecture: Worker performs segmentation, Supervisor critiques against retrieved guidelines, and reinforcement learning stop policy terminates the loop to balance accuracy and resource use.

Result: Notably outperforms state-of-the-art baselines on Waymo and ReasonSeg datasets, demonstrating strong generalization and instruction adherence.

Conclusion: The framework successfully addresses the challenge of following complex segmentation guidelines without retraining, providing guideline-consistent masks while efficiently managing computational resources.

Abstract: Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.

[132] Domain Adaptation for Different Sensor Configurations in 3D Object Detection

Satoshi Tanaka, Kok Seang Tan, Isamu Yamashita

Main category: cs.CV

TL;DR: Proposes two techniques for adapting 3D object detection models across different LiDAR sensor configurations: Downstream Fine-tuning and Partial Layer Fine-tuning to address performance degradation from distribution shifts.

DetailsMotivation: Different vehicle platforms use distinct LiDAR sensor configurations, causing performance degradation when models trained on one configuration are applied to another due to point cloud distribution shifts. This domain gap for different sensor configurations remains largely unexplored compared to environmental domain gaps.

Method: Two proposed techniques: 1) Downstream Fine-tuning - dataset-specific fine-tuning after multi-dataset training, and 2) Partial Layer Fine-tuning - updating only a subset of layers to improve cross-configuration generalization. Used paired datasets from same geographic region with multiple sensor configurations.

Result: Joint training with Downstream Fine-tuning and Partial Layer Fine-tuning consistently outperforms naive joint training for each sensor configuration.

Conclusion: Provides a practical and scalable solution for adapting 3D object detection models to diverse vehicle platforms with different LiDAR sensor configurations.

Abstract: Recent advances in autonomous driving have underscored the importance of accurate 3D object detection, with LiDAR playing a central role due to its robustness under diverse visibility conditions. However, different vehicle platforms often deploy distinct sensor configurations, causing performance degradation when models trained on one configuration are applied to another because of shifts in the point cloud distribution. Prior work on multi-dataset training and domain adaptation for 3D object detection has largely addressed environmental domain gaps and density variation within a single LiDAR; in contrast, the domain gap for different sensor configurations remains largely unexplored. In this work, we address domain adaptation across different sensor configurations in 3D object detection. We propose two techniques: Downstream Fine-tuning (dataset-specific fine-tuning after multi-dataset training) and Partial Layer Fine-tuning (updating only a subset of layers to improve cross-configuration generalization). Using paired datasets collected in the same geographic region with multiple sensor configurations, we show that joint training with Downstream Fine-tuning and Partial Layer Fine-tuning consistently outperforms naive joint training for each configuration. Our findings provide a practical and scalable solution for adapting 3D object detection models to the diverse vehicle platforms.

[133] CD-Mamba: Cloud detection with long-range spatial dependency modeling

Tianxiang Xue, Jiayi Zhao, Jingsheng Li, Changlu Chen, Kun Zhan

Main category: cs.CV

TL;DR: CD-Mamba: Hybrid CNN-Mamba model for cloud detection in remote sensing images that combines local spatial feature extraction with long-range dependency modeling

DetailsMotivation: Cloud cover obscures remote sensing images, requiring detection methods that address both short-range spatial redundancies and long-range atmospheric similarities among cloud patches

Method: Integrates convolutional neural networks for local spatial dependencies and Mamba’s state-space modeling for long-range dependencies into a unified cloud detection network

Result: Extensive experiments validate effectiveness and demonstrate superior performance over existing methods

Conclusion: CD-Mamba successfully manages both pixel-wise interactions and extensive patch-wise dependencies simultaneously, improving detection accuracy across diverse spatial scales

Abstract: Remote sensing images are frequently obscured by cloud cover, posing significant challenges to data integrity and reliability. Effective cloud detection requires addressing both short-range spatial redundancies and long-range atmospheric similarities among cloud patches. Convolutional neural networks are effective at capturing local spatial dependencies, while Mamba has strong capabilities in modeling long-range dependencies. To fully leverage both local spatial relations and long-range dependencies, we propose CD-Mamba, a hybrid model that integrates convolution and Mamba’s state-space modeling into a unified cloud detection network. CD-Mamba is designed to comprehensively capture pixelwise textural details and long term patchwise dependencies for cloud detection. This design enables CD-Mamba to manage both pixel-wise interactions and extensive patch-wise dependencies simultaneously, improving detection accuracy across diverse spatial scales. Extensive experiments validate the effectiveness of CD-Mamba and demonstrate its superior performance over existing methods.

[134] Exploiting Unlabeled Structures through Task Consistency Training for Versatile Medical Image Segmentation

Shengqian Zhu, Jiafei Wu, Xiaogang Xu, Chengrong Yu, Ying Song, Zhang Yi, Guangjun Li, Junjie Hu

Main category: cs.CV

TL;DR: TCT framework addresses class imbalance in medical image segmentation using partially labeled datasets without extra models, employing consistency constraints and uncertainty-weighted loss.

DetailsMotivation: Current VMIS approaches suffer from class imbalance due to unequal category distribution in partially labeled datasets, and existing pseudo-label methods require additional models and suffer from label noise degradation.

Method: Task Consistency Training (TCT) with main segmentation head and multiple auxiliary task heads, consistency constraints between predictions, filtering strategy for low-consistency data, and unified auxiliary uncertainty-weighted loss.

Result: Extensive experiments on eight abdominal datasets from diverse clinical sites demonstrate the approach’s effectiveness in handling class imbalance.

Conclusion: TCT framework successfully addresses class imbalance in versatile medical image segmentation without requiring additional models, effectively utilizing unlabeled anatomical structures while mitigating noise and task dominance issues.

Abstract: Versatile medical image segmentation (VMIS) targets the segmentation of multiple classes, while obtaining full annotations for all classes is often impractical due to the time and labor required. Leveraging partially labeled datasets (PLDs) presents a promising alternative; however, current VMIS approaches face significant class imbalance due to the unequal category distribution in PLDs. Existing methods attempt to address this by generating pseudo-full labels. Nevertheless, these typically require additional models and often result in potential performance degradation from label noise. In this work, we introduce a Task Consistency Training (TCT) framework to address class imbalance without requiring extra models. TCT includes a backbone network with a main segmentation head (MSH) for multi-channel predictions and multiple auxiliary task heads (ATHs) for task-specific predictions. By enforcing a consistency constraint between the MSH and ATH predictions, TCT effectively utilizes unlabeled anatomical structures. To avoid error propagation from low-consistency, potentially noisy data, we propose a filtering strategy to exclude such data. Additionally, we introduce a unified auxiliary uncertainty-weighted loss (UAUWL) to mitigate segmentation quality declines caused by the dominance of specific tasks. Extensive experiments on eight abdominal datasets from diverse clinical sites demonstrate our approach’s effectiveness.

[135] Enhancing Self-Driving Segmentation in Adverse Weather Conditions: A Dual Uncertainty-Aware Training Approach to SAM Optimization

Dharsan Ravindran, Kevin Wang, Zhuoyuan Cao, Saleh Abdelrahman, Jeffery Wu

Main category: cs.CV

TL;DR: Enhancing SAM/SAM2 segmentation for autonomous driving in adverse weather by incorporating uncertainty-aware training methods from medical imaging, achieving improved robustness.

DetailsMotivation: Vision foundation models like SAM/SAM2 struggle with visual ambiguity in adverse weather conditions due to lack of uncertainty quantification, which is critical for safety in autonomous driving.

Method: Two approaches: 1) Multi-step finetuning of SAM2 with uncertainty metrics in loss function, 2) Adapting Uncertainty-Aware Adapter (UAT) from medical imaging to driving contexts.

Result: UAT-SAM outperforms standard SAM in extreme weather, while SAM2 with uncertainty-aware loss achieves improved performance across diverse driving scenes on CamVid, BDD100K, and GTA datasets.

Conclusion: Explicit uncertainty modeling is valuable for safety-critical autonomous driving in challenging environments, with both proposed methods showing enhanced segmentation robustness.

Abstract: Recent advances in vision foundation models, such as the Segment Anything Model (SAM) and its successor SAM2, have achieved state-of-the-art performance on general image segmentation benchmarks. However, these models struggle in adverse weather conditions where visual ambiguity is high, largely due to their lack of uncertainty quantification. Inspired by progress in medical imaging, where uncertainty-aware training has improved reliability in ambiguous cases, we investigate two approaches to enhance segmentation robustness for autonomous driving. First, we introduce a multi-step finetuning procedure for SAM2 that incorporates uncertainty metrics directly into the loss function, improving overall scene recognition. Second, we adapt the Uncertainty-Aware Adapter (UAT), originally designed for medical image segmentation, to driving contexts. We evaluate both methods on CamVid, BDD100K, and GTA driving datasets. Experiments show that UAT-SAM outperforms standard SAM in extreme weather, while SAM2 with uncertainty-aware loss achieves improved performance across diverse driving scenes. These findings underscore the value of explicit uncertainty modeling for safety-critical autonomous driving in challenging environments.

[136] WatchHAR: Real-time On-device Human Activity Recognition System for Smartwatches

Taeyoung Yeon, Vasco Xu, Henry Hoffmann, Karan Ahuja

Main category: cs.CV

TL;DR: WatchHAR is a smartwatch-based human activity recognition system that processes audio and inertial data entirely on-device, achieving 5x faster processing with over 90% accuracy across 25+ activities while addressing privacy and latency concerns.

DetailsMotivation: To create a privacy-preserving, low-latency HAR system that runs fully on smartwatches without external data processing, overcoming limitations of current systems that require off-device computation.

Method: Developed an optimized end-to-end trainable architecture that unifies sensor data preprocessing and inference, specifically designed for on-device operation with audio and inertial sensors.

Result: Achieved 9.3 ms processing time for activity event detection and 11.8 ms for multimodal activity classification, with over 90% accuracy across more than 25 activity classes while running directly on smartwatch hardware.

Conclusion: WatchHAR successfully demonstrates that smartwatches can serve as standalone, privacy-aware continuous activity tracking devices with minimal latency and high accuracy, advancing on-device activity recognition capabilities.

Abstract: Despite advances in practical and multimodal fine-grained Human Activity Recognition (HAR), a system that runs entirely on smartwatches in unconstrained environments remains elusive. We present WatchHAR, an audio and inertial-based HAR system that operates fully on smartwatches, addressing privacy and latency issues associated with external data processing. By optimizing each component of the pipeline, WatchHAR achieves compounding performance gains. We introduce a novel architecture that unifies sensor data preprocessing and inference into an end-to-end trainable module, achieving 5x faster processing while maintaining over 90% accuracy across more than 25 activity classes. WatchHAR outperforms state-of-the-art models for event detection and activity classification while running directly on the smartwatch, achieving 9.3 ms processing time for activity event detection and 11.8 ms for multimodal activity classification. This research advances on-device activity recognition, realizing smartwatches’ potential as standalone, privacy-aware, and minimally-invasive continuous activity tracking devices.

[137] MCANet: A Multi-Scale Class-Specific Attention Network for Multi-Label Post-Hurricane Damage Assessment using UAV Imagery

Zhangding Liu, Neda Mohammadi, John E. Taylor

Main category: cs.CV

TL;DR: MCANet is a multi-label classification framework for hurricane damage assessment that uses multi-scale feature learning and class-specific attention to outperform existing CNN methods, achieving 91.75% mAP on UAV imagery.

DetailsMotivation: Existing CNN-based methods struggle with capturing multi-scale spatial features and distinguishing visually similar damage types in post-hurricane damage assessment, which is critical for disaster response.

Method: Proposes MCANet with Res2Net-based hierarchical backbone for multi-scale context and multi-head class-specific residual attention module that focuses on different spatial granularities for each damage category.

Result: Achieves 91.75% mAP on RescueNet dataset (4,494 UAV images), outperforming ResNet, Res2Net, VGG, MobileNet, EfficientNet, and ViT. With 8 attention heads, improves to 92.35% mAP with over 6% boost for challenging classes like Road Blocked.

Conclusion: MCANet effectively localizes damage-relevant regions and provides interpretable results for disaster response applications. Future work could integrate knowledge graphs and multimodal LLMs for better adaptability and semantic understanding.

Abstract: Rapid and accurate post-hurricane damage assessment is vital for disaster response and recovery. Yet existing CNN-based methods struggle to capture multi-scale spatial features and to distinguish visually similar or co-occurring damage types. To address these issues, we propose MCANet, a multi-label classification framework that learns multi-scale representations and adaptively attends to spatially relevant regions for each damage category. MCANet employs a Res2Net-based hierarchical backbone to enrich spatial context across scales and a multi-head class-specific residual attention module to enhance discrimination. Each attention branch focuses on different spatial granularities, balancing local detail with global context. We evaluate MCANet on the RescueNet dataset of 4,494 UAV images collected after Hurricane Michael. MCANet achieves a mean average precision (mAP) of 91.75%, outperforming ResNet, Res2Net, VGG, MobileNet, EfficientNet, and ViT. With eight attention heads, performance further improves to 92.35%, boosting average precision for challenging classes such as Road Blocked by over 6%. Class activation mapping confirms MCANet’s ability to localize damage-relevant regions, supporting interpretability. Outputs from MCANet can inform post-disaster risk mapping, emergency routing, and digital twin-based disaster response. Future work could integrate disaster-specific knowledge graphs and multimodal large language models to improve adaptability to unseen disasters and enrich semantic understanding for real-world decision-making.

[138] Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita

Main category: cs.CV

TL;DR: Proposes dynamic human group detection in videos using Vision-Language Model features and global optimization across frames to handle changing groups, outperforming state-of-the-art methods.

DetailsMotivation: Existing group detection methods assume static groups, but real-world groups change dynamically. Both local appearance features and global context are needed to detect complex groups accurately.

Method: Uses Vision-Language Model (VLM) augmented for group detection to extract local and global features. Employs global optimization with graph-based approach using groupness probabilities from all frames to detect dynamically changing groups.

Result: Outperforms state-of-the-art group detection methods on public datasets.

Conclusion: The proposed method effectively detects dynamically changing human groups in videos by combining VLM features with temporal consistency through global optimization.

Abstract: This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames’ groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git

[139] MLP-SRGAN: A Single-Dimension Super Resolution GAN using MLP-Mixer

Samir Mitha, Seungho Choe, Pejman Jahbedar Maralani, Alan R. Moody, April Khademi

Main category: cs.CV

TL;DR: MLP-SRGAN is a novel single-dimension super resolution GAN that combines MLP-Mixers with convolutional layers for slice-direction upsampling in MRI, outperforming existing methods with sharper edges, less blurring, and better texture preservation while being more efficient.

DetailsMotivation: To address the challenge of low spatial resolution in the slice dimension of FLAIR MRI images from multicentre datasets, and to improve super resolution performance with better computational efficiency.

Method: Uses a hybrid architecture combining Multi-Layer Perceptron Mixers (MLP-Mixers) with convolutional layers for single-dimension upsampling. Trained on high-resolution FLAIR MRI from MSSEG2 dataset and validated on three multicentre datasets (CAIN, ADNI, CCNA).

Result: MLP-SRGAN produces sharper edges, less blurring, preserves more texture and fine anatomical details, with fewer parameters, faster training/evaluation time, and smaller model size compared to state-of-the-art SR networks.

Conclusion: The proposed MLP-SRGAN architecture effectively improves super resolution performance for MRI slice upsampling while being more computationally efficient than existing methods, making it suitable for clinical applications with limited ground truth data.

Abstract: We propose a novel architecture called MLP-SRGAN, which is a single-dimension Super Resolution Generative Adversarial Network (SRGAN) that utilizes Multi-Layer Perceptron Mixers (MLP-Mixers) along with convolutional layers to upsample in the slice direction. MLP-SRGAN is trained and validated using high resolution (HR) FLAIR MRI from the MSSEG2 challenge dataset. The method was applied to three multicentre FLAIR datasets (CAIN, ADNI, CCNA) of images with low spatial resolution in the slice dimension to examine performance on held-out (unseen) clinical data. Upsampled results are compared to several state-of-the-art SR networks. For images with high resolution (HR) ground truths, peak-signal-to-noise-ratio (PSNR) and structural similarity index (SSIM) are used to measure upsampling performance. Several new structural, no-reference image quality metrics were proposed to quantify sharpness (edge strength), noise (entropy), and blurriness (low frequency information) in the absence of ground truths. Results show MLP-SRGAN results in sharper edges, less blurring, preserves more texture and fine-anatomical detail, with fewer parameters, faster training/evaluation time, and smaller model size than existing methods. Code for MLP-SRGAN training and inference, data generators, models and no-reference image quality metrics will be available at https://github.com/IAMLAB-Ryerson/MLP-SRGAN.

[140] FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph

Zhangding Liu, Neda Mohammadi, John E. Taylor

Main category: cs.CV

TL;DR: FloodVision is a zero-shot framework that combines GPT-4o’s semantic reasoning with a domain knowledge graph to estimate floodwater depth from RGB images, achieving 8.17 cm MAE and outperforming previous methods.

DetailsMotivation: Existing computer vision methods for flood detection suffer from accuracy limitations and poor generalization due to dependence on fixed object detectors and task-specific training, making them unreliable for emergency response scenarios.

Method: Combines GPT-4o’s vision-language capabilities with a structured knowledge graph containing canonical dimensions of urban objects. Dynamically identifies reference objects, retrieves verified heights, estimates submergence ratios, and applies statistical outlier filtering.

Result: Achieves mean absolute error of 8.17 cm on 110 crowdsourced images, reducing GPT-4o baseline error by 20.5% and surpassing prior CNN-based methods. Generalizes well across varying scenes and operates in near real-time.

Conclusion: FloodVision provides accurate, generalizable flood depth estimation suitable for integration into digital twin platforms and citizen-reporting apps, enhancing smart city flood resilience capabilities.

Abstract: Timely and accurate floodwater depth estimation is critical for road accessibility and emergency response. While recent computer vision methods have enabled flood detection, they suffer from both accuracy limitations and poor generalization due to dependence on fixed object detectors and task-specific training. To enable accurate depth estimation that can generalize across diverse flood scenarios, this paper presents FloodVision, a zero-shot framework that combines the semantic reasoning abilities of the foundation vision-language model GPT-4o with a structured domain knowledge graph. The knowledge graph encodes canonical real-world dimensions for common urban objects including vehicles, people, and infrastructure elements to ground the model’s reasoning in physical reality. FloodVision dynamically identifies visible reference objects in RGB images, retrieves verified heights from the knowledge graph to mitigate hallucination, estimates submergence ratios, and applies statistical outlier filtering to compute final depth values. Evaluated on 110 crowdsourced images from MyCoast New York, FloodVision achieves a mean absolute error of 8.17 cm, reducing the GPT-4o baseline 10.28 cm by 20.5% and surpassing prior CNN-based methods. The system generalizes well across varying scenes and operates in near real-time, making it suitable for future integration into digital twin platforms and citizen-reporting apps for smart city flood resilience.

[141] Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval

Bangxiang Lan, Ruobing Xie, Ruixiang Zhao, Xingwu Sun, Zhanhui Kang, Gang Yang, Xirong Li

Main category: cs.CV

TL;DR: Proposes PIG, a Hybrid-Tower framework for text-to-video retrieval that combines advantages of Two-Tower (efficiency) and Single-Tower (effectiveness) approaches through pseudo-query generation and fine-grained interaction.

DetailsMotivation: Existing CLIP-based T2VR approaches face trade-offs: Two-Tower framework has low effectiveness while Single-Tower framework suffers from low efficiency. Need a solution that achieves both high effectiveness and efficiency simultaneously.

Method: PIG (Fine-grained Pseudo-query Interaction and Generation) generates pseudo-queries for each video, enabling fine-grained video-text feature interaction before receiving real queries. Maintains Two-Tower efficiency during inference with no additional overhead.

Result: Achieves 1.6% to 3.9% improvement in R@1 across five text-video retrieval benchmarks. Matches Two-Tower efficiency while achieving near state-of-the-art performance.

Conclusion: The Hybrid-Tower framework successfully combines effectiveness and efficiency advantages, demonstrating PIG as a promising approach for text-to-video retrieval tasks.

Abstract: The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, ie, PIG, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks demonstrate that our method achieves a significant improvement over the baseline, with an increase of $1.6% \sim 3.9%$ in R@1. Furthermore, our method matches the efficiency of Two-Tower models while achieving near state-of-the-art performance, highlighting the advantages of the Hybrid-Tower framework.

[142] Teacher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge

Seungho Choe, Xiaoli Qin, Abubakr Shafique, Amanda Dy, Susan Done, Dimitrios Androutsos, April Khademi

Main category: cs.CV

TL;DR: AI-based mitosis detection using teacher-student UNet with domain generalization for robust segmentation and classification across different tissue domains.

DetailsMotivation: Manual mitosis counting is time-consuming and inconsistent between pathologists. AI solutions face challenges with domain shift (organ/species/staining variations) and severe class imbalance between mitoses and normal nuclei.

Method: Pixel-level segmentation approach using UNet backbone with domain generalization modules (contrastive learning and domain-adversarial training). Teacher-student strategy generates pseudo-masks for mitoses, hard negatives, and normal nuclei. Multi-scale CNN classifier leverages segmentation features in multi-task learning.

Result: Achieved F1 score of 0.7660 for mitosis detection (Track 1) and balanced accuracy of 0.8414 for atypical mitosis classification (Track 2) on preliminary test set.

Conclusion: The integrated segmentation-based detection and classification framework effectively addresses domain shift and class imbalance, providing robust mitosis analysis with consistent performance across different domains.

Abstract: Counting mitotic figures is time-intensive for pathologists and leads to inter-observer variability. Artificial intelligence (AI) promises a solution by automatically detecting mitotic figures while maintaining decision consistency. However, AI tools are susceptible to domain shift, where a significant drop in performance can occur due to differences in the training and testing sets, including morphological diversity between organs, species, and variations in staining protocols. Furthermore, the number of mitoses is much less than the count of normal nuclei, which introduces severely imbalanced data for the detection task. In this work, we formulate mitosis detection as a pixel-level segmentation and propose a teacher-student model that simultaneously addresses mitosis detection (Track 1) and atypical mitosis classification (Track 2). Our method is based on a UNet segmentation backbone that integrates domain generalization modules, namely contrastive representation learning and domain-adversarial training. A teacher-student strategy is employed to generate pixel-level pseudo-masks not only for annotated mitoses and hard negatives but also for normal nuclei, thereby enhancing feature discrimination and improving robustness against domain shift. For the classification task, we introduce a multi-scale CNN classifier that leverages feature maps from the segmentation model within a multi-task learning paradigm. On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in Track 1 and balanced accuracy of 0.8414 in Track 2, demonstrating the effectiveness of integrating segmentation-based detection and classification into a unified framework for robust mitosis analysis.

[143] Comparative Evaluation of Traditional and Deep Learning Feature Matching Algorithms using Chandrayaan-2 Lunar Data

R. Makharia, J. G. Singla, Amitabh, N. Dube, H. Sharma

Main category: cs.CV

TL;DR: Evaluation of five feature matching algorithms for lunar image registration across different sensor modalities, showing SuperGlue (deep learning) outperforms classical methods, especially in challenging polar conditions.

DetailsMotivation: Accurate image registration is critical for lunar exploration tasks like surface mapping and resource localization, but aligning data from diverse lunar sensors (optical, hyperspectral, radar) is challenging due to resolution, illumination, and sensor distortion differences.

Method: Evaluated five feature matching algorithms (SIFT, ASIFT, AKAZE, RIFT2, and SuperGlue) using cross-modality image pairs from equatorial and polar regions with a preprocessing pipeline including georeferencing, resolution alignment, intensity normalization, and enhancements like adaptive histogram equalization, PCA, and shadow correction.

Result: SuperGlue consistently yielded the lowest root mean square error and fastest runtimes. Classical methods like SIFT and AKAZE performed well near the equator but degraded under polar lighting conditions.

Conclusion: The results highlight the importance of preprocessing and learning-based approaches for robust lunar image registration across diverse conditions, with deep learning methods like SuperGlue showing superior performance.

Abstract: Accurate image registration is critical for lunar exploration, enabling surface mapping, resource localization, and mission planning. Aligning data from diverse lunar sensors – optical (e.g., Orbital High Resolution Camera, Narrow and Wide Angle Cameras), hyperspectral (Imaging Infrared Spectrometer), and radar (e.g., Dual-Frequency Synthetic Aperture Radar, Selene/Kaguya mission) – is challenging due to differences in resolution, illumination, and sensor distortion. We evaluate five feature matching algorithms: SIFT, ASIFT, AKAZE, RIFT2, and SuperGlue (a deep learning-based matcher), using cross-modality image pairs from equatorial and polar regions. A preprocessing pipeline is proposed, including georeferencing, resolution alignment, intensity normalization, and enhancements like adaptive histogram equalization, principal component analysis, and shadow correction. SuperGlue consistently yields the lowest root mean square error and fastest runtimes. Classical methods such as SIFT and AKAZE perform well near the equator but degrade under polar lighting. The results highlight the importance of preprocessing and learning-based approaches for robust lunar image registration across diverse conditions.

[144] Toward Accessible Dermatology: Skin Lesion Classification Using Deep Learning Models on Mobile-Acquired Images

Asif Newaz, Masum Mushfiq Ishti, A Z M Ashraful Azam, Asif Ur Rahman Adib

Main category: cs.CV

TL;DR: Transformer-based models, especially Swin Transformer, outperform CNNs for mobile-acquired skin disease classification across 50+ disease categories, with Grad-CAM providing interpretable predictions for real-world applications.

DetailsMotivation: Skin disease diagnosis is costly and inaccessible in low-resource settings, and existing automated methods are limited to dermoscopic datasets with narrow disease ranges, creating a need for more practical mobile-based solutions.

Method: Curated a large mobile-captured dataset of 50+ skin disease categories, evaluated multiple CNN and Transformer architectures, and incorporated Grad-CAM for interpretability and clinical relevance visualization.

Result: Transformer models, particularly Swin Transformer, achieved superior performance by effectively capturing global contextual features, demonstrating better classification accuracy than traditional CNNs.

Conclusion: Transformer-based approaches show strong potential for accessible AI-assisted dermatological screening in resource-limited environments, with interpretable predictions that can support early diagnosis using mobile devices.

Abstract: Skin diseases are among the most prevalent health concerns worldwide, yet conventional diagnostic methods are often costly, complex, and unavailable in low-resource settings. Automated classification using deep learning has emerged as a promising alternative, but existing studies are mostly limited to dermoscopic datasets and a narrow range of disease classes. In this work, we curate a large dataset of over 50 skin disease categories captured with mobile devices, making it more representative of real-world conditions. We evaluate multiple convolutional neural networks and Transformer-based architectures, demonstrating that Transformer models, particularly the Swin Transformer, achieve superior performance by effectively capturing global contextual features. To enhance interpretability, we incorporate Gradient-weighted Class Activation Mapping (Grad-CAM), which highlights clinically relevant regions and provides transparency in model predictions. Our results underscore the potential of Transformer-based approaches for mobile-acquired skin lesion classification, paving the way toward accessible AI-assisted dermatological screening and early diagnosis in resource-limited environments.

[145] Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation

Svetlana Pavlitska, Beyza Keskin, Alwin Faßbender, Christian Hubschneider, J. Marius Zöllner

Main category: cs.CV

TL;DR: Mixture of Experts (MoE) provides well-calibrated predictive uncertainty estimates for semantic segmentation without architectural changes, outperforming ensemble methods especially on out-of-distribution data.

DetailsMotivation: Enhancing reliability of computer vision models for safety-critical applications like traffic scene perception by improving uncertainty estimation accuracy and calibration.

Method: Used MoE with gating network to dynamically weight expert predictions. Investigated three uncertainty extraction methods: predictive entropy, mutual information, and expert variance. Evaluated on A2D2 dataset with semantic split and Cityscapes dataset.

Result: MoEs yield more reliable uncertainty estimates than ensembles in conditional correctness metrics under OOD data. Simple gating mechanisms provide better calibration than complex classwise gates. Increasing number of experts further enhances uncertainty calibration.

Conclusion: MoEs offer an efficient and effective alternative to ensembles for uncertainty quantification in semantic segmentation, with well-calibrated estimates that improve reliability for safety-critical applications.

Abstract: Estimating accurate and well-calibrated predictive uncertainty is important for enhancing the reliability of computer vision models, especially in safety-critical applications like traffic scene perception. While ensemble methods are commonly used to quantify uncertainty by combining multiple models, a mixture of experts (MoE) offers an efficient alternative by leveraging a gating network to dynamically weight expert predictions based on the input. Building on the promising use of MoEs for semantic segmentation in our previous works, we show that well-calibrated predictive uncertainty estimates can be extracted from MoEs without architectural modifications. We investigate three methods to extract predictive uncertainty estimates: predictive entropy, mutual information, and expert variance. We evaluate these methods for an MoE with two experts trained on a semantical split of the A2D2 dataset. Our results show that MoEs yield more reliable uncertainty estimates than ensembles in terms of conditional correctness metrics under out-of-distribution (OOD) data. Additionally, we evaluate routing uncertainty computed via gate entropy and find that simple gating mechanisms lead to better calibration of routing uncertainty estimates than more complex classwise gates. Finally, our experiments on the Cityscapes dataset suggest that increasing the number of experts can further enhance uncertainty calibration. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.

[146] Exploring Non-Local Spatial-Angular Correlations with a Hybrid Mamba-Transformer Framework for Light Field Super-Resolution

Haosong Liu, Xiancheng Zhu, Huanqiang Zeng, Jianqing Zhu, Jiuwen Cao, Junhui Hou

Main category: cs.CV

TL;DR: Proposed LFMT framework combines Mamba and Transformer for light field super-resolution, using Subspace Simple Scanning strategy and dual-stage modeling to efficiently extract spatial-angular features while maintaining low computational complexity.

DetailsMotivation: Current Mamba-based methods have inefficient and redundant feature extraction for complex light field data, and state space models struggle to preserve spatial-angular and disparity information effectively.

Method: Introduces Subspace Simple Scanning strategy and Subspace Simple Mamba Block for efficient feature extraction. Uses dual-stage modeling: Stage I with Spatial-Angular Residual Subspace Mamba Block for shallow features, Stage II with dual-branch parallel structure combining Epipolar Plane Mamba Block and Epipolar Plane Transformer Block for deep refinement.

Result: LFMT significantly outperforms current state-of-the-art methods in light field super-resolution, achieving substantial performance improvements while maintaining low computational complexity on both real-world and synthetic datasets.

Conclusion: The hybrid Mamba-Transformer framework successfully integrates strengths of both models for comprehensive information exploration across spatial, angular, and epipolar-plane domains, providing an effective solution for light field super-resolution.

Abstract: Recently, Mamba-based methods, with its advantage in long-range information modeling and linear complexity, have shown great potential in optimizing both computational cost and performance of light field image super-resolution (LFSR). However, current multi-directional scanning strategies lead to inefficient and redundant feature extraction when applied to complex LF data. To overcome this challenge, we propose a Subspace Simple Scanning (Sub-SS) strategy, based on which we design the Subspace Simple Mamba Block (SSMB) to achieve more efficient and precise feature extraction. Furthermore, we propose a dual-stage modeling strategy to address the limitation of state space in preserving spatial-angular and disparity information, thereby enabling a more comprehensive exploration of non-local spatial-angular correlations. Specifically, in stage I, we introduce the Spatial-Angular Residual Subspace Mamba Block (SA-RSMB) for shallow spatial-angular feature extraction; in stage II, we use a dual-branch parallel structure combining the Epipolar Plane Mamba Block (EPMB) and Epipolar Plane Transformer Block (EPTB) for deep epipolar feature refinement. Building upon meticulously designed modules and strategies, we introduce a hybrid Mamba-Transformer framework, termed LFMT. LFMT integrates the strengths of Mamba and Transformer models for LFSR, enabling comprehensive information exploration across spatial, angular, and epipolar-plane domains. Experimental results demonstrate that LFMT significantly outperforms current state-of-the-art methods in LFSR, achieving substantial improvements in performance while maintaining low computational complexity on both real-word and synthetic LF datasets.

[147] PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang

Main category: cs.CV

TL;DR: PropVG is an end-to-end proposal-based visual grounding framework that integrates foreground object proposal generation with referential object comprehension, using contrastive learning and multi-granularity discrimination to improve object identification in complex scenarios.

DetailsMotivation: Current end-to-end visual grounding methods rely only on referred targets for supervision and lack multi-granularity discrimination, limiting their robustness in complex scenarios. The authors aim to address these limitations by leveraging prospective targets and incorporating multi-level discrimination.

Method: Proposed PropVG framework with two key modules: 1) Contrastive-based Refer Scoring (CRS) module using contrastive learning at sentence and word levels to enhance object understanding, and 2) Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information for better absent target recognition.

Result: Extensive experiments on multiple benchmarks including gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO demonstrate the effectiveness of PropVG, showing improved performance in visual grounding tasks.

Conclusion: PropVG successfully addresses limitations of existing methods by integrating proposal generation with referential comprehension and incorporating multi-granularity discrimination, achieving state-of-the-art performance on various visual grounding benchmarks.

Abstract: Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at https://github.com/Dmmm1997/PropVG.

[148] TemporalFlowViz: Parameter-Aware Visual Analytics for Interpreting Scramjet Combustion Evolution

Yifei Jia, Shiyu Cheng, Yu Dong, Guan Li, Dong Tian, Ruixiao Peng, Xuyi Lu, Yu Wang, Wei Yao, Guihua Shan

Main category: cs.CV

TL;DR: TemporalFlowViz is a visual analytics system that uses AI and expert annotations to analyze complex temporal flow field data from scramjet combustion simulations, enabling better pattern discovery and interpretation.

DetailsMotivation: The large scale and high dimensionality of simulation-generated temporal flow field data from scramjet engines present significant challenges for visual interpretation, feature differentiation, and cross-case comparison, making it difficult for experts to analyze complex combustion dynamics.

Method: Uses pretrained Vision Transformers to extract embeddings from flow field images, applies dimensionality reduction and density-based clustering to uncover combustion modes, constructs temporal trajectories, and employs expert annotations with vision-language models to generate natural-language summaries.

Result: The system effectively enhances hypothesis generation, supports interpretable pattern discovery, and improves knowledge discovery in large-scale scramjet combustion analysis, as demonstrated through expert-informed case studies and feedback.

Conclusion: TemporalFlowViz successfully bridges the gap between latent AI representations and expert reasoning, providing a powerful workflow for analyzing complex temporal combustion data and advancing high-speed propulsion technologies.

Abstract: Understanding the complex combustion dynamics within scramjet engines is critical for advancing high-speed propulsion technologies. However, the large scale and high dimensionality of simulation-generated temporal flow field data present significant challenges for visual interpretation, feature differentiation, and cross-case comparison. In this paper, we present TemporalFlowViz, a parameter-aware visual analytics workflow and system designed to support expert-driven clustering, visualization, and interpretation of temporal flow fields from scramjet combustion simulations. Our approach leverages hundreds of simulated combustion cases with varying initial conditions, each producing time-sequenced flow field images. We use pretrained Vision Transformers to extract high-dimensional embeddings from these frames, apply dimensionality reduction and density-based clustering to uncover latent combustion modes, and construct temporal trajectories in the embedding space to track the evolution of each simulation over time. To bridge the gap between latent representations and expert reasoning, domain specialists annotate representative cluster centroids with descriptive labels. These annotations are used as contextual prompts for a vision-language model, which generates natural-language summaries for individual frames and full simulation cases. The system also supports parameter-based filtering, similarity-based case retrieval, and coordinated multi-view exploration to facilitate in-depth analysis. We demonstrate the effectiveness of TemporalFlowViz through two expert-informed case studies and expert feedback, showing TemporalFlowViz enhances hypothesis generation, supports interpretable pattern discovery, and enhances knowledge discovery in large-scale scramjet combustion analysis.

[149] Pose-Free 3D Quantitative Phase Imaging of Flowing Cellular Populations

Enze Ye, Wei Lin, Shaochi Ren, Yakun Liu, Xiaoping Li, Hao Wang, He Sun, Feng Pan

Main category: cs.CV

TL;DR: OmniFHT enables high-throughput 3D quantitative phase imaging of flowing cells without requiring known poses, supporting arbitrary cell geometries and complex rotations through Fourier diffraction theorem and implicit neural representations.

DetailsMotivation: Current 3D QPI methods assume uniform single-axis rotation and known cell poses, limiting applicability to near-spherical cells and preventing accurate imaging of irregularly shaped cells with complex rotations, which restricts statistical analysis capabilities.

Method: Uses Fourier diffraction theorem and implicit neural representations (INRs) to jointly optimize each cell’s unknown rotational trajectory and volumetric structure under weak scattering assumptions, enabling reconstruction from sparse projections and limited angular coverage.

Result: Produces high-fidelity 3D refractive index reconstructions with as few as 10 views or only 120 degrees of angular range, supporting arbitrary cell geometries and multi-axis rotations.

Conclusion: OmniFHT provides the first scalable and unbiased solution for in situ, high-throughput tomographic imaging of entire flowing cell populations, enabling label-free morphometric analysis in flow cytometry platforms.

Abstract: High-throughput 3D quantitative phase imaging (QPI) in flow cytometry enables label-free, volumetric characterization of individual cells by reconstructing their refractive index (RI) distributions from multiple viewing angles during flow through microfluidic channels. However, current imaging methods assume that cells undergo uniform, single-axis rotation, which require their poses to be known at each frame. This assumption restricts applicability to near-spherical cells and prevents accurate imaging of irregularly shaped cells with complex rotations. As a result, only a subset of the cellular population can be analyzed, limiting the ability of flow-based assays to perform robust statistical analysis. We introduce OmniFHT, a pose-free 3D RI reconstruction framework that leverages the Fourier diffraction theorem and implicit neural representations (INRs) for high-throughput flow cytometry tomographic imaging. By jointly optimizing each cell’s unknown rotational trajectory and volumetric structure under weak scattering assumptions, OmniFHT supports arbitrary cell geometries and multi-axis rotations. Its continuous representation also allows accurate reconstruction from sparsely sampled projections and restricted angular coverage, producing high-fidelity results with as few as 10 views or only 120 degrees of angular range. OmniFHT enables, for the first time, in situ, high-throughput tomographic imaging of entire flowing cell populations, providing a scalable and unbiased solution for label-free morphometric analysis in flow cytometry platforms.

[150] CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus

Hannah Schieber, Dominik Frischmann, Simon Boche, Victor Schaack, Angela Schoellig, Stefan Leutenegger, Daniel Roth

Main category: cs.CV

TL;DR: CoRe-GS accelerates 3D reconstruction by focusing on points of interest using semantic Gaussian Splatting with color-based filtering, reducing training time by 25% while maintaining quality.

DetailsMotivation: Mobile robotics applications like tele-guidance and disaster response require fast and accurate 3D reconstruction, particularly for specific objects rather than entire scenes.

Method: Uses semantic Gaussian Splatting to generate coarse segmentation-ready scenes, then applies novel color-based effective filtering to isolate and refine semantic objects before full training completion.

Result: Achieves about 25% reduction in training time compared to full semantic GS training, with higher novel-view-synthesis quality on both real-world outdoor and synthetic indoor datasets.

Conclusion: CoRe-GS effectively balances reconstruction quality and training efficiency by focusing computational resources on semantically relevant areas, making it suitable for time-critical mobile robotics applications.

Abstract: Mobile reconstruction for autonomous aerial robotics holds strong potential for critical applications such as tele-guidance and disaster response. These tasks demand both accurate 3D reconstruction and fast scene processing. Instead of reconstructing the entire scene in detail, it is often more efficient to focus on specific objects, i.e., points of interest (PoIs). Mobile robots equipped with advanced sensing can usually detect these early during data acquisition or preliminary analysis, reducing the need for full-scene optimization. Gaussian Splatting (GS) has recently shown promise in delivering high-quality novel view synthesis and 3D representation by an incremental learning process. Extending GS with scene editing, semantics adds useful per-splat features to isolate objects effectively. Semantic 3D Gaussian editing can already be achieved before the full training cycle is completed, reducing the overall training time. Moreover, the semantically relevant area, the PoI, is usually already known during capturing. To balance high-quality reconstruction with reduced training time, we propose CoRe-GS. We first generate a coarse segmentation-ready scene with semantic GS and then refine it for the semantic object using our novel color-based effective filtering for effective object isolation. This is speeding up the training process to be about a quarter less than a full training cycle for semantic GS. We evaluate our approach on two datasets, SCRREAM (real-world, outdoor) and NeRDS 360 (synthetic, indoor), showing reduced runtime and higher novel-view-synthesis quality.

[151] Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning

Trixia Simangan, Ahmed Nadeem Abbasi, Yipeng Hu, Shaheer U. Saeed

Main category: cs.CV

TL;DR: Cryo-RL is a reinforcement learning framework that automates cryoablation planning for prostate cancer, achieving human-expert level performance with 8% Dice improvement over automated baselines and significantly reduced planning time.

DetailsMotivation: Current cryoablation planning is manual, expertise-dependent, time-consuming, and leads to treatment variability and limited scalability, requiring an automated solution.

Method: Reinforcement learning framework modeling cryoablation planning as Markov decision process, where an agent sequentially selects cryoprobe positions and ice sphere diameters in a simulated environment with clinical constraints.

Result: Achieved over 8 percentage-point Dice improvements compared to best automated baselines, matched human expert performance, and required substantially less planning time on 583 retrospective prostate cancer cases.

Conclusion: Reinforcement learning shows potential to deliver clinically viable, reproducible, and efficient cryoablation plans for prostate cancer treatment.

Abstract: Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.

Dominik Pegler, David Steyrl, Mengfan Zhang, Alexander Karner, Jozsef Arato, Frank Scharnowski, Filip Melinscak

Main category: cs.CV

TL;DR: Pretrained computer vision models can predict human fear ratings from spider images with good accuracy (MAE 10.1-11.0), showing potential for adaptive exposure therapy systems.

DetailsMotivation: To enable computerized exposure therapy where visual stimuli can be dynamically adjusted based on patient responses, by investigating if computer vision models can accurately predict fear levels from images.

Method: Adapted three diverse pretrained computer vision models using transfer learning to predict human fear ratings (0-100 scale) from 313 spider images, evaluated with cross-validation, learning curve analysis, and explainability assessments.

Result: Models achieved average MAE between 10.1-11.0. Learning curves showed performance degrades with smaller datasets but plateaus with larger ones. Models focused on spider-related features, with higher errors for distant views and artificial spiders.

Conclusion: Computer vision models show promise for predicting fear ratings in emotion-aware therapeutic technologies, with both model explainability and sufficient dataset size being critical factors for effectiveness.

Abstract: Advances in computer vision have opened new avenues for clinical applications, particularly in computerized exposure therapy where visual stimuli can be dynamically adjusted based on patient responses. As a critical step toward such adaptive systems, we investigated whether pretrained computer vision models can accurately predict fear levels from spider-related images. We adapted three diverse models using transfer learning to predict human fear ratings (on a 0-100 scale) from a standardized dataset of 313 images. The models were evaluated using cross-validation, achieving an average mean absolute error (MAE) between 10.1 and 11.0. Our learning curve analysis revealed that reducing the dataset size significantly harmed performance, though further increases yielded no substantial gains. Explainability assessments showed the models’ predictions were based on spider-related features. A category-wise error analysis further identified visual conditions associated with higher errors (e.g., distant views and artificial/painted spiders). These findings demonstrate the potential of explainable computer vision models in predicting fear ratings, highlighting the importance of both model explainability and a sufficient dataset size for developing effective emotion-aware therapeutic technologies.

[153] SynGen-Vision: Synthetic Data Generation for training industrial vision models

Alpana Dubey, Suma Mani Kuriakose, Nitish Bhardwaj

Main category: cs.CV

TL;DR: Synthetic data generation approach using vision language models and 3D simulation for industrial wear detection, achieving 0.87 mAP50 on rust detection tasks.

DetailsMotivation: Data curation for industrial wear and tear detection is expensive and time-consuming due to unavailability of datasets for different scenarios, making synthetic data generation necessary.

Method: Uses vision language model with 3D simulation and rendering engine to generate synthetic data for varying rust conditions, then trains computer vision models on this synthetic dataset.

Result: Model trained with synthetic data outperforms other approaches with mAP50 score of 0.87 when tested on real images of rusted industrial objects.

Conclusion: The approach is effective for industrial wear detection and is customizable for extending to other wear and tear detection scenarios beyond rust.

Abstract: We propose an approach to generate synthetic data to train computer vision (CV) models for industrial wear and tear detection. Wear and tear detection is an important CV problem for predictive maintenance tasks in any industry. However, data curation for training such models is expensive and time-consuming due to the unavailability of datasets for different wear and tear scenarios. Our approach employs a vision language model along with a 3D simulation and rendering engine to generate synthetic data for varying rust conditions. We evaluate our approach by training a CV model for rust detection using the generated dataset and tested the trained model on real images of rusted industrial objects. The model trained with the synthetic data generated by our approach, outperforms the other approaches with a mAP50 score of 0.87. The approach is customizable and can be easily extended to other industrial wear and tear detection scenarios

[154] Evaluating Multiple Instance Learning Strategies for Automated Sebocyte Droplet Counting

Maryam Adelipour, Gustavo Carneiro, Jeongkwon Kim

Main category: cs.CV

TL;DR: Simple attention-based MIL framework for automated sebocyte lipid droplet counting shows baseline MLP with bag-level aggregation outperforms attention-based MIL in stability and accuracy.

DetailsMotivation: Manual counting of sebocyte lipid droplets is labor-intensive and subjective, motivating the need for automated solutions in sebocyte biology research.

Method: Used Nile Red-stained sebocyte images annotated into 14 classes by droplet counts, expanded to ~50,000 cells via augmentation. Benchmarked baseline MLP trained on aggregated patch-level counts against attention-based MIL model using ResNet-50 features with instance weighting, evaluated with five-fold cross-validation.

Result: Baseline MLP achieved more stable performance (mean MAE = 5.6) compared to attention-based MIL (mean MAE = 10.7), though MIL occasionally performed better in specific folds. Simple bag-level aggregation proved more robust for slide-level droplet counting.

Conclusion: Simple bag-level aggregation provides a robust baseline for sebocyte droplet counting, while attention-based MIL requires task-aligned pooling and regularization to fully realize its potential in sebocyte image analysis.

Abstract: Sebocytes are lipid-secreting cells whose differentiation is marked by the accumulation of intracellular lipid droplets, making their quantification a key readout in sebocyte biology. Manual counting is labor-intensive and subjective, motivating automated solutions. Here, we introduce a simple attention-based multiple instance learning (MIL) framework for sebocyte image analysis. Nile Red-stained sebocyte images were annotated into 14 classes according to droplet counts, expanded via data augmentation to about 50,000 cells. Two models were benchmarked: a baseline multi-layer perceptron (MLP) trained on aggregated patch-level counts, and an attention-based MIL model leveraging ResNet-50 features with instance weighting. Experiments using five-fold cross-validation showed that the baseline MLP achieved more stable performance (mean MAE = 5.6) compared with the attention-based MIL, which was less consistent (mean MAE = 10.7) but occasionally superior in specific folds. These findings indicate that simple bag-level aggregation provides a robust baseline for slide-level droplet counting, while attention-based MIL requires task-aligned pooling and regularization to fully realize its potential in sebocyte image analysis.

[155] UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

Haowang Cui, Rui Chen, Tao Luo, Rui Li, Jiaze Wang

Main category: cs.CV

TL;DR: UniView is a novel view synthesis model that uses reference images from similar objects to provide strong priors, addressing ambiguity issues in single-image view synthesis through retrieval systems, MLLM-assisted selection, and multi-level feature integration.

DetailsMotivation: Single-image novel view synthesis is highly ill-posed due to multiple explanations for unobserved areas. Current methods often generate severe distortions from ambiguity priors and interpolation near input views.

Method: Proposes UniView with retrieval and augmentation system using MLLM for reference image selection, plug-and-play adapter with multi-level isolation layers for dynamic reference feature generation, and decoupled triple attention mechanism for multi-branch feature alignment and integration.

Result: Extensive experiments demonstrate that UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on challenging datasets.

Conclusion: Leveraging reference images from similar objects provides strong prior information that effectively addresses the limitations of single-image view synthesis, reducing distortions and improving synthesis quality.

Abstract: The task of synthesizing novel views from a single image is highly ill-posed due to multiple explanations for unobserved areas. Most current methods tend to generate unseen regions from ambiguity priors and interpolation near input views, which often lead to severe distortions. To address this limitation, we propose a novel model dubbed as UniView, which can leverage reference images from a similar object to provide strong prior information during view synthesis. More specifically, we construct a retrieval and augmentation system and employ a multimodal large language model (MLLM) to assist in selecting reference images that meet our requirements. Additionally, a plug-and-play adapter module with multi-level isolation layers is introduced to dynamically generate reference features for the target views. Moreover, in order to preserve the details of an original input image, we design a decoupled triple attention mechanism, which can effectively align and integrate multi-branch features into the synthesis process. Extensive experiments have demonstrated that our UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.

[156] Dual-Domain Perspective on Degradation-Aware Fusion: A VLM-Guided Robust Infrared and Visible Image Fusion Framework

Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui

Main category: cs.CV

TL;DR: GD^2Fusion is a novel infrared-visible image fusion framework that handles dual-source degraded scenarios by integrating vision-language models for degradation perception with dual-domain frequency/spatial joint optimization, eliminating the need for manual pre-enhancement steps.

DetailsMotivation: Existing IVIF methods assume high-quality inputs and struggle with degraded scenarios, requiring manual selection and sequential application of multiple pre-enhancement steps which leads to error accumulation and performance degradation.

Method: Proposes Guided Dual-Domain Fusion (GD^2Fusion) with two modules: Guided Frequency Modality-Specific Extraction (GFMSE) for frequency-domain degradation perception/suppression and feature extraction, and Guided Spatial Modality-Aggregated Fusion (GSMAF) for cross-modal degradation filtering and adaptive feature aggregation in spatial domain.

Result: Extensive qualitative and quantitative experiments demonstrate superior fusion performance compared to existing algorithms and strategies in dual-source degraded scenarios.

Conclusion: GD^2Fusion effectively overcomes limitations of traditional decoupled approaches by synergistically integrating VLMs with dual-domain optimization, achieving better performance in handling degraded infrared-visible image fusion scenarios.

Abstract: Most existing infrared-visible image fusion (IVIF) methods assume high-quality inputs, and therefore struggle to handle dual-source degraded scenarios, typically requiring manual selection and sequential application of multiple pre-enhancement steps. This decoupled pre-enhancement-to-fusion pipeline inevitably leads to error accumulation and performance degradation. To overcome these limitations, we propose Guided Dual-Domain Fusion (GD^2Fusion), a novel framework that synergistically integrates vision-language models (VLMs) for degradation perception with dual-domain (frequency/spatial) joint optimization. Concretely, the designed Guided Frequency Modality-Specific Extraction (GFMSE) module performs frequency-domain degradation perception and suppression and discriminatively extracts fusion-relevant sub-band features. Meanwhile, the Guided Spatial Modality-Aggregated Fusion (GSMAF) module carries out cross-modal degradation filtering and adaptive multi-source feature aggregation in the spatial domain to enhance modality complementarity and structural consistency. Extensive qualitative and quantitative experiments demonstrate that GD^2Fusion achieves superior fusion performance compared with existing algorithms and strategies in dual-source degraded scenarios. The code will be publicly released after acceptance of this paper.

[157] Interpretable Deep Transfer Learning for Breast Ultrasound Cancer Detection: A Multi-Dataset Study

Mohammad Abbadi, Yassine Himeur, Shadi Atalla, Wathiq Mansoor

Main category: cs.CV

TL;DR: Deep learning models, particularly ResNet-18, achieve superior performance (99.7% accuracy) for breast cancer classification in ultrasound images compared to classical ML methods, with Grad-CAM providing interpretability for clinical integration.

DetailsMotivation: Breast cancer is a leading cause of cancer mortality, and ultrasound imaging is crucial for early detection, especially in dense breast tissue. There's a need for automated, accurate AI-based diagnostic tools to assist clinicians.

Method: Evaluated classical ML models (SVM, KNN) and deep CNNs (ResNet-18, EfficientNet-B0, GoogLeNet) on multiple ultrasound datasets (BUSI, BUS-BRA, BrEaST-Lesions USG). Used deep feature extraction for classical models and Grad-CAM for visualization.

Result: ResNet-18 achieved the highest accuracy (99.7%) and perfect sensitivity for malignant lesions. Classical ML models performed competitively when enhanced with deep features. Grad-CAM provided transparent visualizations of diagnostically relevant regions.

Conclusion: AI-based diagnostic tools, particularly deep learning models, show high performance and interpretability for breast cancer detection in ultrasound images, supporting their integration into clinical workflows for improved early detection.

Abstract: Breast cancer remains a leading cause of cancer-related mortality among women worldwide. Ultrasound imaging, widely used due to its safety and cost-effectiveness, plays a key role in early detection, especially in patients with dense breast tissue. This paper presents a comprehensive study on the application of machine learning and deep learning techniques for breast cancer classification using ultrasound images. Using datasets such as BUSI, BUS-BRA, and BrEaST-Lesions USG, we evaluate classical machine learning models (SVM, KNN) and deep convolutional neural networks (ResNet-18, EfficientNet-B0, GoogLeNet). Experimental results show that ResNet-18 achieves the highest accuracy (99.7%) and perfect sensitivity for malignant lesions. Classical ML models, though outperformed by CNNs, achieve competitive performance when enhanced with deep feature extraction. Grad-CAM visualizations further improve model transparency by highlighting diagnostically relevant image regions. These findings support the integration of AI-based diagnostic tools into clinical workflows and demonstrate the feasibility of deploying high-performing, interpretable systems for ultrasound-based breast cancer detection.

[158] A biologically inspired separable learning vision model for real-time traffic object perception in Dark

Hulin Li, Qiliang Ren, Jun Li, Hanbing Wei, Zheng Liu, Linfang Fan

Main category: cs.CV

TL;DR: The paper introduces Dark-traffic, the largest dataset for low-light traffic scenes, and proposes SLVM - a biologically inspired vision model that achieves state-of-the-art performance in object detection, instance segmentation, and optical flow estimation with reduced computational overhead.

DetailsMotivation: Existing perception models struggle with low-light traffic scenes due to severe illumination degradation and lack of reliable visual cues. There is also no large-scale benchmark specifically for low-light traffic environments.

Method: Proposes Separable Learning Vision Model (SLVM) with four components: light-adaptive pupillary mechanism, feature-level separable learning strategy, task-specific decoupled branches, and spatial misalignment-aware fusion module. Also introduces Dark-traffic dataset using physically grounded illumination degradation method.

Result: SLVM outperforms RT-DETR by 11.2% in detection, YOLOv12 by 6.1% in instance segmentation, and reduces endpoint error by 12.37% on Dark-traffic. On LIS benchmark, it surpasses Swin Transformer+EnlightenGAN and ConvNeXt-T+EnlightenGAN by average 11 percentage points.

Conclusion: The proposed SLVM framework effectively addresses low-light perception challenges in traffic scenes, achieving superior performance with reduced computational costs. The Dark-traffic dataset provides a valuable benchmark for future research in this domain.

Abstract: Fast and accurate object perception in low-light traffic scenes has attracted increasing attention. However, due to severe illumination degradation and the lack of reliable visual cues, existing perception models and methods struggle to quickly adapt to and accurately predict in low-light environments. Moreover, there is the absence of available large-scale benchmark specifically focused on low-light traffic scenes. To bridge this gap, we introduce a physically grounded illumination degradation method tailored to real-world low-light settings and construct Dark-traffic, the largest densely annotated dataset to date for low-light traffic scenes, supporting object detection, instance segmentation, and optical flow estimation. We further propose the Separable Learning Vision Model (SLVM), a biologically inspired framework designed to enhance perception under adverse lighting. SLVM integrates four key components: a light-adaptive pupillary mechanism for illumination-sensitive feature extraction, a feature-level separable learning strategy for efficient representation, task-specific decoupled branches for multi-task separable learning, and a spatial misalignment-aware fusion module for precise multi-feature alignment. Extensive experiments demonstrate that SLVM achieves state-of-the-art performance with reduced computational overhead. Notably, it outperforms RT-DETR by 11.2 percentage points in detection, YOLOv12 by 6.1 percentage points in instance segmentation, and reduces endpoint error (EPE) of baseline by 12.37% on Dark-traffic. On the LIS benchmark, the end-to-end trained SLVM surpasses Swin Transformer+EnlightenGAN and ConvNeXt-T+EnlightenGAN by an average of 11 percentage points across key metrics, and exceeds Mask RCNN (with light enhancement) by 3.1 percentage points. The Dark-traffic dataset and complete code is released at https://github.com/alanli1997/slvm.

[159] Leveraging Transfer Learning and Mobile-enabled Convolutional Neural Networks for Improved Arabic Handwritten Character Recognition

Mohsine El Khayati, Ayyad Maafiri, Yassine Himeur, Hamzah Ali Alkhazaleh, Shadi Atalla, Wathiq Mansoor

Main category: cs.CV

TL;DR: Transfer learning with mobile convolutional neural networks improves Arabic handwritten character recognition, with MobileNet performing best overall and full fine-tuning being the most effective strategy.

DetailsMotivation: Address computational requirements and dataset scarcity challenges in Arabic Handwritten Character Recognition (AHCR) by leveraging transfer learning with lightweight mobile networks.

Method: Evaluated three transfer learning strategies (full fine-tuning, partial fine-tuning, training from scratch) using four mobile networks (MobileNet, SqueezeNet, MnasNet, ShuffleNet) on three Arabic character datasets (AHCD, HIJJA, IFHCDB).

Result: MobileNet achieved superior accuracy, robustness and efficiency. IFHCDB dataset yielded 99% accuracy with MnasNet, AHCD achieved 97% with ShuffleNet, HIJJA reached 92% with ShuffleNet. Full fine-tuning performed best overall.

Conclusion: Combining transfer learning with mobile networks enables resource-efficient Arabic handwritten character recognition, with potential for further optimizations through architectural modifications and data augmentation.

Abstract: The study explores the integration of transfer learning (TL) with mobile-enabled convolutional neural networks (MbNets) to enhance Arabic Handwritten Character Recognition (AHCR). Addressing challenges like extensive computational requirements and dataset scarcity, this research evaluates three TL strategies–full fine-tuning, partial fine-tuning, and training from scratch–using four lightweight MbNets: MobileNet, SqueezeNet, MnasNet, and ShuffleNet. Experiments were conducted on three benchmark datasets: AHCD, HIJJA, and IFHCDB. MobileNet emerged as the top-performing model, consistently achieving superior accuracy, robustness, and efficiency, with ShuffleNet excelling in generalization, particularly under full fine-tuning. The IFHCDB dataset yielded the highest results, with 99% accuracy using MnasNet under full fine-tuning, highlighting its suitability for robust character recognition. The AHCD dataset achieved competitive accuracy (97%) with ShuffleNet, while HIJJA posed significant challenges due to its variability, achieving a peak accuracy of 92% with ShuffleNet. Notably, full fine-tuning demonstrated the best overall performance, balancing accuracy and convergence speed, while partial fine-tuning underperformed across metrics. These findings underscore the potential of combining TL and MbNets for resource-efficient AHCR, paving the way for further optimizations and broader applications. Future work will explore architectural modifications, in-depth dataset feature analysis, data augmentation, and advanced sensitivity analysis to enhance model robustness and generalizability.

[160] LUIVITON: Learned Universal Interoperable VIrtual Try-ON

Cong Cao, Xianhang Cheng, Jingyuan Liu, Yujian Zheng, Zhenhui Lin, Meriem Chkir, Hao Li

Main category: cs.CV

TL;DR: LUIVITON is an automated virtual try-on system that drapes multi-layer clothing onto diverse humanoid characters using SMPL proxy representation and dual correspondence tasks with geometric learning and diffusion models.

DetailsMotivation: To address the challenge of aligning complex garments with arbitrary body shapes and diverse humanoid characters for fully automated virtual try-on without manual intervention.

Method: Uses SMPL as proxy representation with two correspondence tasks: clothing-to-SMPL (geometric learning-based approach) and body-to-SMPL (diffusion model using multi-view consistent appearance features and pre-trained 2D foundation model).

Result: Handles complex geometries, non-manifold meshes, and generalizes to various humanoid characters including humans, robots, cartoons, creatures, and aliens while maintaining computational efficiency.

Conclusion: Produces high-quality 3D clothing fittings automatically without human labor or 2D sewing patterns, supporting fast customization of clothing size and material properties after draping.

Abstract: We present LUIVITON, an end-to-end system for fully automated virtual try-on, capable of draping complex, multi-layer clothing onto diverse and arbitrarily posed humanoid characters. To address the challenge of aligning complex garments with arbitrary and highly diverse body shapes, we use SMPL as a proxy representation and separate the clothing-to-body draping problem into two correspondence tasks: 1) clothing-to-SMPL and 2) body-to-SMPL correspondence, where each has its unique challenges. While we address the clothing-to-SMPL fitting problem using a geometric learning-based approach for partial-to-complete shape correspondence prediction, we introduce a diffusion model-based approach for body-to-SMPL correspondence using multi-view consistent appearance features and a pre-trained 2D foundation model. Our method can handle complex geometries, non-manifold meshes, and generalizes effectively to a wide range of humanoid characters – including humans, robots, cartoon subjects, creatures, and aliens, while maintaining computational efficiency for practical adoption. In addition to offering a fully automatic fitting solution, LUIVITON supports fast customization of clothing size, allowing users to adjust clothing sizes and material properties after they have been draped. We show that our system can produce high-quality 3D clothing fittings without any human labor, even when 2D clothing sewing patterns are not available.

[161] Towards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization

Jingqi Wu, Hanxi Li, Lin Yuanbo Wu, Hao Chen, Deyin Liu, Peng Wang

Main category: cs.CV

TL;DR: ADClick is an interactive image segmentation tool that generates pixel-level anomaly annotations from minimal user input (clicks + text), enabling efficient labeling and boosting anomaly detection performance.

DetailsMotivation: Industrial anomaly detection typically requires pixel-level annotations of defective samples, which are costly and limit scalability. Current methods struggle to leverage defective samples without extensive manual labeling.

Method: Proposes ADClick (Interactive Image Segmentation algorithm) that uses few user clicks and brief text descriptions to generate precise anomaly annotations. Also introduces ADClick-Seg, a cross-modal framework that aligns visual features with textual prompts using prototype-based approach.

Result: Achieves AP = 96.1% on MVTec AD with ADClick, and state-of-the-art results with ADClick-Seg: AP = 80.0%, PRO = 97.5%, Pixel-AUROC = 99.1% on multi-class AD task.

Conclusion: The approach enables efficient pixel-level annotation generation, significantly improves anomaly detection performance, and provides a scalable solution for industrial inspection by combining visual and textual cues.

Abstract: Industrial product inspection is often performed using Anomaly Detection (AD) frameworks trained solely on non-defective samples. Although defective samples can be collected during production, leveraging them usually requires pixel-level annotations, limiting scalability. To address this, we propose ADClick, an Interactive Image Segmentation (IIS) algorithm for industrial anomaly detection. ADClick generates pixel-wise anomaly annotations from only a few user clicks and a brief textual description, enabling precise and efficient labeling that significantly improves AD model performance (e.g., AP = 96.1% on MVTec AD). We further introduce ADClick-Seg, a cross-modal framework that aligns visual features and textual prompts via a prototype-based approach for anomaly detection and localization. By combining pixel-level priors with language-guided cues, ADClick-Seg achieves state-of-the-art results on the challenging ``Multi-class’’ AD task (AP = 80.0%, PRO = 97.5%, Pixel-AUROC = 99.1% on MVTec AD).

[162] Systematic Review and Meta-analysis of AI-driven MRI Motion Artifact Detection and Correction

Mojtaba Safari, Zach Eidex, Richard L. J. Qiu, Matthew Goette, Tonghe Wang, Xiaofeng Yang

Main category: cs.CV

TL;DR: AI-driven deep learning methods, particularly generative models, show promise for detecting and correcting MRI motion artifacts but face challenges with generalizability and data requirements.

DetailsMotivation: To systematically review and assess the effectiveness of AI methods for addressing MRI motion artifacts, which degrade image quality and diagnostic accuracy.

Method: Comprehensive systematic review and meta-analysis focusing on deep learning approaches, especially generative models, with quantitative data extraction on datasets, architectures, and performance metrics.

Result: Deep learning generative models demonstrate potential for reducing motion artifacts and improving image quality, but face limitations including limited generalizability, reliance on paired training data, and risk of visual distortions.

Conclusion: AI-driven methods show significant potential for MRI quality improvement, but require standardized datasets, reporting protocols, and more advanced DL techniques to reduce data dependency and enhance diagnostic accuracy.

Abstract: Background: To systematically review and perform a meta-analysis of artificial intelligence (AI)-driven methods for detecting and correcting magnetic resonance imaging (MRI) motion artifacts, assessing current developments, effectiveness, challenges, and future research directions. Methods: A comprehensive systematic review and meta-analysis were conducted, focusing on deep learning (DL) approaches, particularly generative models, for the detection and correction of MRI motion artifacts. Quantitative data were extracted regarding utilized datasets, DL architectures, and performance metrics. Results: DL, particularly generative models, show promise for reducing motion artifacts and improving image quality; however, limited generalizability, reliance on paired training data, and risk of visual distortions remain key challenges that motivate standardized datasets and reporting. Conclusions: AI-driven methods, particularly DL generative models, show significant potential for improving MRI image quality by effectively addressing motion artifacts. However, critical challenges must be addressed, including the need for comprehensive public datasets, standardized reporting protocols for artifact levels, and more advanced, adaptable DL techniques to reduce reliance on extensive paired datasets. Addressing these aspects could substantially enhance MRI diagnostic accuracy, reduce healthcare costs, and improve patient care outcomes.

[163] GeoSplat: A Deep Dive into Geometry-Constrained Gaussian Splatting

Yangming Li, Chaoyu Liu, Lihao Liu, Simon Masnou, Carola-Bibian Schönlieb

Main category: cs.CV

TL;DR: GeoSplat is a geometry-constrained optimization framework that uses first-order and second-order geometric priors to improve Gaussian splatting training, including better initialization, gradient updates, and densification with noise-robust estimation methods.

DetailsMotivation: Previous works used low-order geometric priors (like normal vectors) that were unreliably estimated by noise-sensitive methods, limiting the performance of Gaussian splatting optimization.

Method: The framework exploits both first-order and second-order geometric quantities, initializes Gaussian primitive scales using principal curvatures, and introduces efficient noise-robust estimation methods based on geometric structures like local manifolds.

Result: Extensive experiments on multiple datasets for novel view synthesis show that GeoSplat significantly improves Gaussian splatting performance and outperforms previous baselines.

Conclusion: GeoSplat provides a general geometry-constrained optimization framework that effectively addresses limitations of previous methods by incorporating robust geometric priors throughout the Gaussian splatting training pipeline.

Abstract: A few recent works explored incorporating geometric priors to regularize the optimization of Gaussian splatting, further improving its performance. However, those early studies mainly focused on the use of low-order geometric priors (e.g., normal vector), and they are also unreliably estimated by noise-sensitive methods, like local principal component analysis. To address their limitations, we first present GeoSplat, a general geometry-constrained optimization framework that exploits both first-order and second-order geometric quantities to improve the entire training pipeline of Gaussian splatting, including Gaussian initialization, gradient update, and densification. As an example, we initialize the scales of 3D Gaussian primitives in terms of principal curvatures, leading to a better coverage of the object surface than random initialization. Secondly, based on certain geometric structures (e.g., local manifold), we introduce efficient and noise-robust estimation methods that provide dynamic geometric priors for our framework. We conduct extensive experiments on multiple datasets for novel view synthesis, showing that our framework: GeoSplat, significantly improves the performance of Gaussian splatting and outperforms previous baselines.

[164] Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction

Djamel Eddine Boukhari

Main category: cs.CV

TL;DR: A hybrid CNN-Transformer model called Scale-Interaction Transformer (SIT) achieves state-of-the-art performance in facial beauty prediction by explicitly modeling multi-scale feature interactions.

DetailsMotivation: Convolutional Neural Networks often process information at fixed scales and may overlook critical inter-dependencies between features at different levels of granularity in facial beauty prediction tasks.

Method: The SIT architecture combines CNNs for feature extraction with Transformers for relational modeling. It uses a multi-scale module with parallel convolutions to capture facial characteristics at varying receptive fields, then processes these representations as sequences through a Transformer encoder with self-attention.

Result: The model achieves a Pearson Correlation of 0.9187 on the SCUT-FBP5500 benchmark dataset, establishing a new state-of-the-art performance and outperforming previous methods.

Conclusion: Explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance facial beauty prediction, and hybrid CNN-Transformer models show great potential for complex image regression tasks requiring holistic, context-aware understanding.

Abstract: Automated Facial Beauty Prediction (FBP) is a challenging computer vision task due to the complex interplay of local and global facial features that influence human perception. While Convolutional Neural Networks (CNNs) excel at feature extraction, they often process information at a fixed scale, potentially overlooking the critical inter-dependencies between features at different levels of granularity. To address this limitation, we introduce the Scale-Interaction Transformer (SIT), a novel hybrid deep learning architecture that synergizes the feature extraction power of CNNs with the relational modeling capabilities of Transformers. The SIT first employs a multi-scale module with parallel convolutions to capture facial characteristics at varying receptive fields. These multi-scale representations are then framed as a sequence and processed by a Transformer encoder, which explicitly models their interactions and contextual relationships via a self-attention mechanism. We conduct extensive experiments on the widely-used SCUT-FBP5500 benchmark dataset, where the proposed SIT model establishes a new state-of-the-art. It achieves a Pearson Correlation of 0.9187, outperforming previous methods. Our findings demonstrate that explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance FBP. The success of the SIT architecture highlights the potential of hybrid CNN-Transformer models for complex image regression tasks that demand a holistic, context-aware understanding.

[165] Robust Experts: the Effect of Adversarial Training on CNNs with Sparse Mixture-of-Experts Layers

Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, J. Marius Zöllner

Main category: cs.CV

TL;DR: Using sparse mixture-of-experts layers in CNNs improves adversarial robustness without increasing inference cost, with some individual experts outperforming the full MoE model due to specialized robust subpaths.

DetailsMotivation: Robustifying CNNs against adversarial attacks is challenging and often requires resource-intensive countermeasures. The paper explores using sparse MoE layers to improve robustness by increasing model capacity without additional inference costs.

Method: Replace selected residual blocks or convolutional layers with sparse mixture-of-experts layers in ResNet architectures trained on CIFAR-100. Combine with adversarial training and analyze routing behavior using switch loss for balancing.

Result: Inserting a single MoE layer in deeper stages leads to consistent robustness improvements under PGD and AutoPGD attacks. Switch loss causes routing collapse onto overused experts, concentrating adversarial training on these paths and making them more robust. Some individual experts outperform the gated MoE model in robustness.

Conclusion: Sparse MoE layers effectively improve CNN robustness without inference cost overhead. Robust subpaths emerge through specialization, with some individual experts becoming particularly robust due to concentrated adversarial training on frequently used paths.

Abstract: Robustifying convolutional neural networks (CNNs) against adversarial attacks remains challenging and often requires resource-intensive countermeasures. We explore the use of sparse mixture-of-experts (MoE) layers to improve robustness by replacing selected residual blocks or convolutional layers, thereby increasing model capacity without additional inference cost. On ResNet architectures trained on CIFAR-100, we find that inserting a single MoE layer in the deeper stages leads to consistent improvements in robustness under PGD and AutoPGD attacks when combined with adversarial training. Furthermore, we discover that when switch loss is used for balancing, it causes routing to collapse onto a small set of overused experts, thereby concentrating adversarial training on these paths and inadvertently making them more robust. As a result, some individual experts outperform the gated MoE model in robustness, suggesting that robust subpaths emerge through specialization. Our code is available at https://github.com/KASTEL-MobilityLab/robust-sparse-moes.

[166] Semi-supervised Deep Transfer for Regression without Domain Alignment

Mainak Biswas, Ambedkar Dukkipati, Devarajan Sridharan

Main category: cs.CV

TL;DR: CRAFT is a source-free semi-supervised domain adaptation method for regression tasks that improves model performance when source data is unavailable and labeled target data is scarce, achieving up to 9% RMSE improvement over fine-tuned models.

DetailsMotivation: Real-world applications like medicine face challenges where source models don't generalize to domain-shifted target data, and source data cannot be shared due to privacy concerns or computational costs. Labeled target data is often limited, especially in neuroscience settings with continuous-valued outputs.

Method: CRAFT builds upon Contradistinguisher framework but extends it for source-free semi-supervised transfer in regression tasks. It learns a shared model across labeled source and unlabeled target samples without intermediate representation alignment, leveraging unlabeled target data.

Result: CRAFT yielded up to 9% improvement in RMSE over fine-tuned models when labeled training examples were scarce. It outperformed four competing state-of-the-art source-free domain adaptation models by more than 3% in neuroscience applications (EEG gaze prediction and structural MRI brain age prediction).

Conclusion: CRAFT is an efficient approach for source-free, semi-supervised deep transfer for regression that is particularly valuable in biology and medicine where privacy concerns and data scarcity are common challenges.

Abstract: Deep learning models deployed in real-world applications (e.g., medicine) face challenges because source models do not generalize well to domain-shifted target data. Many successful domain adaptation (DA) approaches require full access to source data. Yet, such requirements are unrealistic in scenarios where source data cannot be shared either because of privacy concerns or because it is too large and incurs prohibitive storage or computational costs. Moreover, resource constraints may limit the availability of labeled targets. We illustrate this challenge in a neuroscience setting where source data are unavailable, labeled target data are meager, and predictions involve continuous-valued outputs. We build upon Contradistinguisher (CUDA), an efficient framework that learns a shared model across the labeled source and unlabeled target samples, without intermediate representation alignment. Yet, CUDA was designed for unsupervised DA, with full access to source data, and for classification tasks. We develop CRAFT – a Contradistinguisher-based Regularization Approach for Flexible Training – for source-free (SF), semi-supervised transfer of pretrained models in regression tasks. We showcase the efficacy of CRAFT in two neuroscience settings: gaze prediction with electroencephalography (EEG) data and ``brain age’’ prediction with structural MRI data. For both datasets, CRAFT yielded up to 9% improvement in root-mean-squared error (RMSE) over fine-tuned models when labeled training examples were scarce. Moreover, CRAFT leveraged unlabeled target data and outperformed four competing state-of-the-art source-free domain adaptation models by more than 3%. Lastly, we demonstrate the efficacy of CRAFT on two other real-world regression benchmarks. We propose CRAFT as an efficient approach for source-free, semi-supervised deep transfer for regression that is ubiquitous in biology and medicine.

[167] A Scalable Attention-Based Approach for Image-to-3D Texture Mapping

Arianna Rampini, Kanika Madan, Bruno Roy, AmirHossein Zamani, Derek Cheung

Main category: cs.CV

TL;DR: A transformer-based framework that generates high-fidelity 3D textures directly from a single image and mesh in 0.2s, eliminating UV mapping and differentiable rendering while outperforming state-of-the-art methods.

DetailsMotivation: Existing texture generation methods are slow, rely on UV maps, and often fail to remain faithful to reference images, creating challenges for realistic 3D content creation.

Method: Uses transformer-based framework with triplane representation and depth-based backprojection losses to predict 3D texture field directly from single image and mesh.

Result: Generates high-fidelity textures in 0.2s per shape, outperforming state-of-the-art baselines in fidelity to input image and perceptual quality through extensive evaluations.

Conclusion: The method enables scalable, high-quality, and controllable 3D content creation with practical efficiency and superior performance compared to existing approaches.

Abstract: High-quality textures are critical for realistic 3D content creation, yet existing generative methods are slow, rely on UV maps, and often fail to remain faithful to a reference image. To address these challenges, we propose a transformer-based framework that predicts a 3D texture field directly from a single image and a mesh, eliminating the need for UV mapping and differentiable rendering, and enabling faster texture generation. Our method integrates a triplane representation with depth-based backprojection losses, enabling efficient training and faster inference. Once trained, it generates high-fidelity textures in a single forward pass, requiring only 0.2s per shape. Extensive qualitative, quantitative, and user preference evaluations demonstrate that our method outperforms state-of-the-art baselines on single-image texture reconstruction in terms of both fidelity to the input image and perceptual quality, highlighting its practicality for scalable, high-quality, and controllable 3D content creation.

[168] SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

Chaolei Wang, Yang Luo, Jing Du, Siyu Chen, Yiping Chen, Ting Han

Main category: cs.CV

TL;DR: SGS-3D is a training-free refinement method that uses a “split-then-grow” framework to improve 3D instance segmentation by purifying ambiguous 2D-to-3D lifted masks with geometric primitives and growing them into complete instances.

DetailsMotivation: Existing 3D instance segmentation methods based on 2D-to-3D lifting struggle with precise instance-level segmentation due to accumulated errors from ambiguous semantic guidance and insufficient depth constraints during the lifting process.

Method: Proposes a “split-then-grow” framework that first purifies and splits ambiguous lifted masks using geometric primitives through a mask filtering strategy, then grows them into complete instances by exploiting spatial continuity and high-level features, particularly for resolving semantic ambiguity between distinct objects.

Result: Experimental results on ScanNet200, ScanNet++, and KITTI-360 show that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments.

Conclusion: SGS-3D effectively addresses the limitations of 2D-to-3D lifting approaches by jointly fusing semantic and geometric information, enabling precise 3D instance segmentation without requiring additional training, and demonstrates strong performance across multiple benchmark datasets.

Abstract: Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel “split-then-grow” framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available in the supplementary materials.

[169] SL-SLR: Self-Supervised Representation Learning for Sign Language Recognition

Ariel Basso Madjoukeng, Jérôme Fink, Pierre Poitier, Edith Belise Kenmogne, Benoit Frenay

Main category: cs.CV

TL;DR: A self-supervised learning framework for sign language recognition that addresses issues with contrastive learning by using free-negative pairs and a new data augmentation technique, achieving significant accuracy improvements.

DetailsMotivation: Contrastive learning methods in sign language recognition treat all video parts equally and struggle with similar negative pairs due to shared movements between signs, leading to non-discriminative features.

Method: Proposes a self-supervised framework with two components: (1) a new self-supervised approach using free-negative pairs, and (2) a novel data augmentation technique designed to work together.

Result: Shows considerable accuracy gains compared to several contrastive and self-supervised methods across linear evaluation, semi-supervised learning, and transferability between sign languages.

Conclusion: The proposed framework effectively addresses the limitations of traditional contrastive learning in SLR by focusing on relevant video parts and handling similar negative pairs, resulting in improved performance across multiple evaluation scenarios.

Abstract: Sign language recognition (SLR) is a machine learning task aiming to identify signs in videos. Due to the scarcity of annotated data, unsupervised methods like contrastive learning have become promising in this field. They learn meaningful representations by pulling positive pairs (two augmented versions of the same instance) closer and pushing negative pairs (different from the positive pairs) apart. In SLR, in a sign video, only certain parts provide information that is truly useful for its recognition. Applying contrastive methods to SLR raises two issues: (i) contrastive learning methods treat all parts of a video in the same way, without taking into account the relevance of certain parts over others; (ii) shared movements between different signs make negative pairs highly similar, complicating sign discrimination. These issues lead to learning non-discriminative features for sign recognition and poor results in downstream tasks. In response, this paper proposes a self-supervised learning framework designed to learn meaningful representations for SLR. This framework consists of two key components designed to work together: (i) a new self-supervised approach with free-negative pairs; (ii) a new data augmentation technique. This approach shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.

[170] Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet

Mohammad Saeid, Amir Salarpour, Pedram MohajerAnsari

Main category: cs.CV

TL;DR: This paper introduces ModelNet-R, a refined version of ModelNet40 to address dataset limitations, and proposes Point-SkipNet, a lightweight graph-based neural network for efficient 3D point cloud classification.

DetailsMotivation: The ModelNet40 dataset has limitations including inconsistent labeling, 2D data, size mismatches, and poor class differentiation that hinder model performance in 3D point cloud classification applications like autonomous driving and robotics.

Method: The paper introduces ModelNet-R as a refined benchmark dataset and proposes Point-SkipNet, a lightweight graph-based neural network that uses efficient sampling, neighborhood grouping, and skip connections to reduce computational overhead.

Result: Extensive experiments show significant performance improvements on ModelNet-R. Point-SkipNet achieves state-of-the-art accuracy with substantially lower parameter count compared to contemporary models.

Conclusion: The research demonstrates the crucial role of dataset quality in optimizing model efficiency for 3D point cloud classification, providing both an improved benchmark and an efficient network architecture.

Abstract: The classification of 3D point clouds is crucial for applications such as autonomous driving, robotics, and augmented reality. However, the commonly used ModelNet40 dataset suffers from limitations such as inconsistent labeling, 2D data, size mismatches, and inadequate class differentiation, which hinder model performance. This paper introduces ModelNet-R, a meticulously refined version of ModelNet40 designed to address these issues and serve as a more reliable benchmark. Additionally, this paper proposes Point-SkipNet, a lightweight graph-based neural network that leverages efficient sampling, neighborhood grouping, and skip connections to achieve high classification accuracy with reduced computational overhead. Extensive experiments demonstrate that models trained in ModelNet-R exhibit significant performance improvements. Notably, Point-SkipNet achieves state-of-the-art accuracy on ModelNet-R with a substantially lower parameter count compared to contemporary models. This research highlights the crucial role of dataset quality in optimizing model efficiency for 3D point cloud classification. For more details, see the code at: https://github.com/m-saeid/ModeNetR_PointSkipNet.

[171] Symbolic Graphics Programming with Large Language Models

Yamei Chen, Haoquan Zhang, Yangyi Huang, Zeju Qiu, Kaipeng Zhang, Yandong Wen, Weiyang Liu

Main category: cs.CV

TL;DR: This paper studies LLMs’ ability to generate symbolic graphics programs (SVGs) from natural language descriptions, introduces a benchmark (SGP-GenBench), and proposes a reinforcement learning approach with verifiable rewards to improve SVG generation quality.

DetailsMotivation: To explore how well LLMs can generate precise symbolic graphics programs (SGPs) that render into visual content, and to understand how LLMs comprehend the visual world through program synthesis.

Method: Introduced SGP-GenBench benchmark covering object fidelity, scene fidelity, and compositionality. Proposed reinforcement learning with verifiable rewards using format-validity gates and cross-modal rewards via vision encoders (SigLIP for text-image, DINO for image-image alignment).

Result: Frontier proprietary models outperformed open-source models, with performance correlating with general coding capabilities. The RL approach applied to Qwen-2.5-7B substantially improved SVG generation quality and semantics, achieving performance comparable to frontier systems.

Conclusion: Symbolic graphics programming provides a precise and interpretable framework for studying cross-modal grounding in LLMs, with RL inducing better object decomposition and contextual details for improved scene coherence.

Abstract: Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs’ ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

[172] COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

Yassine Taoudi-Benchekroun, Klim Troyan, Pascal Sager, Stefan Gerber, Lukas Tuggener, Benjamin Grewe

Main category: cs.CV

TL;DR: COGITAO is a modular framework for generating millions of rule-based visual tasks to study compositionality and generalization in AI models, showing current models fail to generalize novel combinations despite good in-domain performance.

DetailsMotivation: Address the persistent limitation in machine learning models' ability to compose learned concepts and apply them in novel settings, which is key to human intelligence.

Method: Developed a modular data generation framework that constructs rule-based tasks applying transformations to objects in grid environments, supporting composition over 28 interoperable transformations with adjustable depth and extensive control over grid parameters.

Result: Created millions of unique task rules across difficulty ranges, with baseline experiments showing state-of-the-art vision models consistently fail to generalize to novel combinations of familiar elements despite strong in-domain performance.

Conclusion: COGITAO provides an open-source benchmark to support continued research on compositionality and generalization in visual domains, highlighting current model limitations that need to be addressed.

Abstract: The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI’s problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules – surpassing concurrent datasets by several orders of magnitude – across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.

[173] WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, Tong He

Main category: cs.CV

TL;DR: WinT3R is a feed-forward reconstruction model that achieves real-time high-quality 3D reconstruction and camera pose estimation through a sliding window mechanism and global camera token pool.

DetailsMotivation: Previous methods face a trade-off between reconstruction quality and real-time performance, limiting practical applications that require both high-quality results and online processing capabilities.

Method: Uses a sliding window mechanism for efficient information exchange between frames and maintains a global camera token pool with compact camera representations to enhance pose estimation reliability while maintaining computational efficiency.

Result: Achieves state-of-the-art performance in online reconstruction quality, camera pose estimation accuracy, and reconstruction speed across diverse datasets.

Conclusion: WinT3R successfully addresses the quality-speed trade-off in 3D reconstruction, enabling real-time high-quality reconstruction with precise camera pose estimation through innovative architectural designs.

Abstract: We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.

[174] FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases

Matteo Poggi, Fabio Tosi

Main category: cs.CV

TL;DR: FlowSeek is a lightweight optical flow framework that achieves state-of-the-art performance with minimal hardware requirements, using only a single consumer GPU for training.

DetailsMotivation: To develop an optical flow solution that requires significantly fewer hardware resources than current methods while maintaining high accuracy and cross-dataset generalization capabilities.

Method: Combines latest optical flow network designs with single-image depth foundation models and classical low-dimensional motion parametrization to create a compact architecture.

Result: Achieves superior cross-dataset generalization on Sintel Final and KITTI with 10-15% improvement over previous state-of-the-art SEA-RAFT, while using 8x less hardware resources.

Conclusion: FlowSeek demonstrates that high-performance optical flow can be achieved with minimal hardware requirements through innovative architectural design combining modern and classical approaches.

Abstract: We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8x lower compared to most recent methods, and still achieves superior cross-dataset generalization on Sintel Final and KITTI, with a relative improvement of 10 and 15% over the previous state-of-the-art SEA-RAFT, as well as on Spring and LayeredFlow datasets.

[175] Representation-Centric Survey of Skeletal Action Recognition and the ANUBIS Benchmark

Yang Liu, Jiyao Yang, Madhawa Perera, Pan Ji, Dongwoo Kim, Min Xu, Tianyang Wang, Saeed Anwar, Tom Gedeon, Lei Wang, Zhenyue Qin

Main category: cs.CV

TL;DR: Survey of skeleton-based action recognition methods categorized by input representations, introduces ANUBIS dataset with multi-view, multi-person, fine-grained actions, and benchmarks models showing strong action-feature dependencies.

DetailsMotivation: Address fragmentation in skeleton-based action recognition research and lack of evaluation under modern real-world challenges like multi-view perspectives, complex interactions, and contemporary behaviors.

Method: Systematic categorization of state-of-the-art methods by input feature types (joint coordinates, bone vectors, motion flows, extended representations) and creation of ANUBIS dataset with multi-view recordings, back-view perspectives, complex multi-person interactions, fine-grained/violent actions, and contemporary social behaviors.

Result: Benchmarking shows strong action-feature dependencies, limitations of naive multi-representational fusion, and need for task-aware, semantically aligned integration strategies across 102 action categories.

Conclusion: Provides comprehensive foundation and practical benchmarking resource to guide development of robust, generalizable skeleton-based action recognition systems for complex real-world scenarios.

Abstract: 3D skeleton-based human action recognition has emerged as a powerful alternative to traditional RGB and depth-based approaches, offering robustness to environmental variations, computational efficiency, and enhanced privacy. Despite remarkable progress, current research remains fragmented across diverse input representations and lacks evaluation under scenarios that reflect modern real-world challenges. This paper presents a representation-centric survey of skeleton-based action recognition, systematically categorizing state-of-the-art methods by their input feature types: joint coordinates, bone vectors, motion flows, and extended representations, and analyzing how these choices influence spatial-temporal modeling strategies. Building on the insights from this review, we introduce ANUBIS, a large-scale, challenging skeleton action dataset designed to address critical gaps in existing benchmarks. ANUBIS incorporates multi-view recordings with back-view perspectives, complex multi-person interactions, fine-grained and violent actions, and contemporary social behaviors. We benchmark a diverse set of state-of-the-art models on ANUBIS and conduct an in-depth analysis of how different feature types affect recognition performance across 102 action categories. Our results show strong action-feature dependencies, highlight the limitations of na"ive multi-representational fusion, and point toward the need for task-aware, semantically aligned integration strategies. This work offers both a comprehensive foundation and a practical benchmarking resource, aiming to guide the next generation of robust, generalizable skeleton-based action recognition systems for complex real-world scenarios. The dataset website, benchmarking framework, and download link are available at \href{https://yliu1082.github.io/ANUBIS/}{https://yliu1082.github.io/ANUBIS/

[176] Net2Brain: A Toolbox to compare artificial vision models with human brain responses

Domenic Bersch, Kshitij Dwivedi, Martina Vilas, Radoslaw M. Cichy, Gemma Roig

Main category: cs.CV

TL;DR: Net2Brain is a toolbox that compares artificial neural networks and human brain recordings using representational similarity analysis, supporting over 600 DNNs across diverse vision tasks.

DetailsMotivation: Existing toolboxes have limited functionality and focus only on small subsets of supervised image classification models, lacking comprehensive comparison capabilities between DNNs and brain data.

Method: The toolbox extracts activations from diverse DNNs, computes representational dissimilarity matrices (RDMs), and compares them to brain recordings using RSA and weighted RSA in specific ROIs and with searchlight search.

Result: Net2Brain enables extraction from 600+ DNNs across various vision tasks (semantic segmentation, depth estimation, action recognition) and supports adding new stimulus datasets and brain recordings.

Conclusion: The toolbox provides comprehensive functionality for testing cognitive computational neuroscience hypotheses by bridging artificial neural networks and human brain data analysis.

Abstract: We introduce Net2Brain, a graphical and command-line user interface toolbox for comparing the representational spaces of artificial deep neural networks (DNNs) and human brain recordings. While different toolboxes facilitate only single functionalities or only focus on a small subset of supervised image classification models, Net2Brain allows the extraction of activations of more than 600 DNNs trained to perform a diverse range of vision-related tasks (e.g semantic segmentation, depth estimation, action recognition, etc.), over both image and video datasets. The toolbox computes the representational dissimilarity matrices (RDMs) over those activations and compares them to brain recordings using representational similarity analysis (RSA), weighted RSA, both in specific ROIs and with searchlight search. In addition, it is possible to add a new data set of stimuli and brain recordings to the toolbox for evaluation. We demonstrate the functionality and advantages of Net2Brain with an example showcasing how it can be used to test hypotheses of cognitive computational neuroscience.

[177] FAGC:Feature Augmentation on Geodesic Curve in the Pre-Shape Space

Yuexing Han, Gan Hu, Guanxin Wan, Bing Wang

Main category: cs.CV

TL;DR: FAGC is a feature augmentation method that uses geodesic curves in pre-shape space to generate new features for small-sample learning tasks, significantly improving model performance.

DetailsMotivation: Existing data augmentation methods suffer from information loss and perform poorly in small-sample scenarios, limiting their application in deep learning.

Method: Extract image features using pre-trained network, project to pre-shape space by removing position/scale information, construct optimal geodesic curve, and interpolate new feature vectors along the curve.

Result: Extensive experiments show FAGC significantly improves performance of deep learning and machine learning methods in small-sample tasks.

Conclusion: FAGC is an effective and versatile feature augmentation method that overcomes limitations of traditional approaches in small-sample scenarios.

Abstract: Due to the constraints on model performance imposed by the size of the training data, data augmentation has become an essential technique in deep learning. However, most existing data augmentation methods are affected by information loss and perform poorly in small-sample scenarios, which limits their application. To overcome the limitation, we propose a Feature Augmentation method on Geodesic Curve in the pre-shape space, called the FAGC. First, a pre-trained neural network model is employed to extract features from the input images. Then, the image features as a vector is projected into the pre-shape space by removing its position and scale information. In the pre-shape space, an optimal Geodesic curve is constructed to fit the feature vectors. Finally, new feature vectors are generated for model learning by interpolating along the constructed Geodesic curve. We conducted extensive experiments to demonstrate the effectiveness and versatility of the FAGC. The results demonstrate that applying the FAGC to deep learning or machine learning methods can significantly improve their performance in small-sample tasks.

[178] FADE: A Dataset for Detecting Falling Objects around Buildings in Video

Zhigang Tu, Zhengbo Zhang, Zitao Gao, Chunluan Zhou, Junsong Yuan, Bo Du

Main category: cs.CV

TL;DR: A new large-scale dataset FADE and detection method for falling object detection around buildings, addressing challenges of small object size, fast motion, and lack of standardized data.

DetailsMotivation: Falling objects from buildings pose serious safety risks but detection is challenging due to small object size, fast motion, and lack of large-scale datasets for training and evaluation.

Method: Created FADE dataset with 2,611 videos from 25 scenes featuring 8 object categories, 4 weather conditions, and 4 resolutions. Developed novel detection method that leverages motion information and generates small-sized high-quality detection proposals.

Result: The method was evaluated on the FADE dataset and compared against state-of-the-art approaches in generic object detection, video object detection, and moving object detection.

Conclusion: The FADE dataset and proposed detection method provide a comprehensive benchmark solution for falling object detection around buildings, with publicly available dataset and code.

Abstract: Objects falling from buildings, a frequently occurring event in daily life, can cause severe injuries to pedestrians due to the high impact force they exert. Surveillance cameras are often installed around buildings to detect falling objects, but such detection remains challenging due to the small size and fast motion of the objects. Moreover, the field of falling object detection around buildings (FODB) lacks a large-scale dataset for training learning-based detection methods and for standardized evaluation. To address these challenges, we propose a large and diverse video benchmark dataset named FADE. Specifically, FADE contains 2,611 videos from 25 scenes, featuring 8 falling object categories, 4 weather conditions, and 4 video resolutions. Additionally, we develop a novel detection method for FODB that effectively leverages motion information and generates small-sized yet high-quality detection proposals. The efficacy of our method is evaluated on the proposed FADE dataset by comparing it with state-of-the-art approaches in generic object detection, video object detection, and moving object detection. The dataset and code are publicly available at https://fadedataset.github.io/FADE.github.io/.

[179] SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

Yuanyang Yin, Yaqi Zhao, Yajie Zhang, Yuanxing Zhang, Ke Lin, Jiahao Wang, Xin Tao, Pengfei Wan, Wentao Zhang, Feng Zhao

Main category: cs.CV

TL;DR: SEA is a token-level supervision alignment method that improves visual-text alignment in multimodal LLMs, especially benefiting smaller models with 7.61% average performance gain for Gemma-2B.

DetailsMotivation: Current MLLMs suffer from suboptimal modality alignment due to simple adapter architectures and image-level supervision, which limits LLMs' ability to interpret visual features, particularly for smaller models with capacity constraints.

Method: Proposes Supervised Embedding Alignment (SEA) - a token-level supervision alignment method that enables precise visual-text alignment during pretraining with minimal computational overhead while preserving language capabilities.

Result: SEA consistently improves performance across various model sizes, with smaller models benefiting the most (7.61% average gain for Gemma-2B). Comprehensive analyses reveal critical insights into adapter’s role in multimodal integration.

Conclusion: SEA establishes a foundation for developing more effective alignment strategies for future multimodal systems, addressing fundamental limitations in current MLLM alignment approaches.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities by integrating visual and textual inputs, yet modality alignment remains one of the most challenging aspects. Current MLLMs typically rely on simple adapter architectures and pretraining approaches to bridge vision encoders with large language models (LLM), guided by image-level supervision. We identify this paradigm often leads to suboptimal alignment between modalities, significantly constraining the LLM’s ability to properly interpret and reason with visual features particularly for smaller language models. This limitation degrades overall performance-particularly for smaller language models where capacity constraints are more pronounced and adaptation capabilities are limited. To address this fundamental limitation, we propose Supervised Embedding Alignment (SEA), a token-level supervision alignment method that enables more precise visual-text alignment during pretraining. SEA introduces minimal computational overhead while preserving language capabilities and substantially improving cross-modal understanding. Our comprehensive analyses reveal critical insights into the adapter’s role in multimodal integration, and extensive experiments demonstrate that SEA consistently improves performance across various model sizes, with smaller models benefiting the most (average performance gain of 7.61% for Gemma-2B). This work establishes a foundation for developing more effective alignment strategies for future multimodal systems.

[180] Automated detection of underdiagnosed medical conditions via opportunistic imaging

Asad Aali, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Laura T Derry, David Svec, Jason Hom, Robert D. Boutin, Akshay S. Chaudhari

Main category: cs.CV

TL;DR: Deep learning analysis of 2,674 abdominal CT scans reveals significant under-documentation of sarcopenia, hepatic steatosis, and ascites in ICD coding despite detection through opportunistic CT imaging.

DetailsMotivation: To address the underdiagnosis of conditions like sarcopenia, hepatic steatosis, and ascites by leveraging opportunistic CT scans and identifying discrepancies between imaging findings and clinical documentation.

Method: Used deep learning methods to analyze 2,674 inpatient CT scans, comparing imaging phenotypes from opportunistic CT with corresponding radiology reports and ICD coding documentation.

Result: Only 0.5% of sarcopenia cases, 3.2% of hepatic steatosis cases, and 30.7% of ascites cases detected through imaging or radiology reports were properly ICD-coded.

Conclusion: Opportunistic CT has significant potential to improve diagnostic precision and accuracy of risk adjustment models, advancing precision medicine by capturing conditions that are currently under-documented.

Abstract: Abdominal computed tomography (CT) scans are frequently performed in clinical settings. Opportunistic CT involves repurposing routine CT images to extract diagnostic information and is an emerging tool for detecting underdiagnosed conditions such as sarcopenia, hepatic steatosis, and ascites. This study utilizes deep learning methods to promote accurate diagnosis and clinical documentation. We analyze 2,674 inpatient CT scans to identify discrepancies between imaging phenotypes (characteristics derived from opportunistic CT scans) and their corresponding documentation in radiology reports and ICD coding. Through our analysis, we find that only 0.5%, 3.2%, and 30.7% of scans diagnosed with sarcopenia, hepatic steatosis, and ascites (respectively) through either opportunistic imaging or radiology reports were ICD-coded. Our findings demonstrate opportunistic CT’s potential to enhance diagnostic precision and accuracy of risk adjustment models, offering advancements in precision medicine.

[181] PKF: Probabilistic Data Association Kalman Filter for Multi-Object Tracking

Hanwen Cao, George J. Pappas, Nikolay Atanasov

Main category: cs.CV

TL;DR: A new Kalman filter with probabilistic data association using variational inference and EM, achieving better tracking accuracy than JPDAF while maintaining real-time performance.

DetailsMotivation: To improve multi-object tracking by developing a more accurate probabilistic data association method that maintains computational efficiency.

Method: Formulated variational inference problem with data association as latent variable, applied EM to obtain Kalman filter with expanded measurement vector, computed association probabilities using matrix permanents, and implemented ambiguity check to reduce computation.

Result: Achieved lower tracking errors than JPDAF in simulations, better HOTA than previous Kalman-filter methods on real-world datasets (MOT17, MOT20, DanceTrack), top-10 ranking on MOT17/MOT20, and 250+ fps tracking on laptop CPU.

Conclusion: The proposed probabilistic Kalman filter provides superior tracking accuracy while maintaining real-time performance, demonstrating effectiveness for multi-object tracking without requiring deep features or velocities.

Abstract: In this paper, we derive a new Kalman filter with probabilistic data association between measurements and states. We formulate a variational inference problem to approximate the posterior density of the state conditioned on the measurement data. We view the unknown data association as a latent variable and apply Expectation Maximization (EM) to obtain a filter with update step in the same form as the Kalman filter but with expanded measurement vector of all potential associations. We show that the association probabilities can be computed as permanents of matrices with measurement likelihood entries. We also propose an ambiguity check that associates only a subset of ambiguous measurements and states probabilistically, thus reducing the association time and preventing low-probability measurements from harming the estimation accuracy. Experiments in simulation show that our filter achieves lower tracking errors than the well-established joint probabilistic data association filter (JPDAF), while running at comparable rate. We also demonstrate the effectiveness of our filter in multi-object tracking (MOT) on multiple real-world datasets, including MOT17, MOT20, and DanceTrack. We achieve better higher order tracking accuracy (HOTA) than previous Kalman-filter methods and remain real-time. Associating only bounding boxes without deep features or velocities, our method ranks top-10 on both MOT17 and MOT20 in terms of HOTA. Given offline detections, our algorithm tracks at 250+ fps on a single laptop CPU. Code is available at https://github.com/hwcao17/pkf.

[182] HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation

Lingxiao Li, Kaixuan Fan, Boqing Gong, Xiangyu Yue

Main category: cs.CV

TL;DR: HypDAE is a novel hyperbolic diffusion autoencoder approach for few-shot image generation that operates in hyperbolic space to better balance category consistency and diversity while providing enhanced control over semantic attributes.

DetailsMotivation: Few-shot image generation faces challenges in balancing category consistency and image diversity, with existing methods offering limited control over generated image attributes.

Method: Uses hyperbolic space to capture hierarchical relationships among images, leveraging pre-trained foundation models and generating images by varying stochastic subcodes or semantic codes with radius adjustment in hyperbolic disk for control.

Result: Significantly outperforms prior methods in achieving better balance between category preservation and diversity, with exceptional quality and highly controllable/interpretable generation process.

Conclusion: HypDAE provides a superior approach for few-shot image generation with improved control, interpretability, and performance compared to existing methods.

Abstract: Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. A key challenge in this task is balancing category consistency and image diversity, which often compete with each other. Moreover, existing methods offer limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying stochastic subcodes or semantic codes. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a better balance between preserving category-relevant features and promoting image diversity with limited data. Furthermore, HypDAE offers a highly controllable and interpretable generation process.

[183] DirectorLLM for Human-Centric Video Generation

Kunpeng Song, Tingbo Hou, Zecheng He, Haoyu Ma, Jialiang Wang, Animesh Sinha, Sam Tsai, Yaqiao Luo, Xiaoliang Dai, Li Chen, Xide Xia, Peizhao Zhang, Peter Vajda, Ahmed Elgammal, Felix Juefei-Xu

Main category: cs.CV

TL;DR: DirectorLLM uses LLMs as video directors to generate human pose instructions for more realistic text-to-video generation, outperforming existing methods in motion fidelity and prompt faithfulness.

DetailsMotivation: As text-to-video models advance, there's growing demand for high-quality human motion and interaction. Current models need better authenticity in human motions, so the authors aim to enhance realism by leveraging LLMs as motion directors.

Method: Extends LLMs from text generators to video directors and human motion simulators. Uses Llama 3 resources to train DirectorLLM to generate detailed human pose instructions that guide video generation. The LLM creates informative outlines for human-centric scenes, which are used as conditions by video renderers (UNet, DiT).

Result: Experiments show DirectorLLM outperforms existing models in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness in both automatic benchmarks and human evaluations.

Conclusion: DirectorLLM successfully transforms LLMs into effective video directors for human motion simulation, providing a flexible module that can work with different video renderers to produce more realistic and prompt-following human-centric videos.

Abstract: In this paper, we introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos. As foundational text-to-video models rapidly evolve, the demand for high-quality human motion and interaction grows. To address this need and enhance the authenticity of human motions, we extend the LLM from a text generator to a video director and human motion simulator. Utilizing open-source resources from Llama 3, we train the DirectorLLM to generate detailed instructional signals, such as human poses, to guide video generation. This approach offloads the simulation of human motion from the video generator to the LLM, effectively creating informative outlines for human-centric scenes. These signals are used as conditions by the video renderer, facilitating more realistic and prompt-following video generation. As an independent LLM module, it can be applied to different video renderers, including UNet and DiT, with minimal effort. Experiments on automatic evaluation benchmarks and human evaluations show that our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.

[184] Spoof Trace Discovery for Deep Learning Based Explainable Face Anti-Spoofing

Haoyuan Zhang, Xiangyu Zhu, Li Gao, Jiawei Pan, Kai Pang, Guoying Zhao, Zhen Lei

Main category: cs.CV

TL;DR: The paper proposes X-FAS (eXplainable Face Anti-Spoofing) to address the lack of explanation in existing face anti-spoofing systems, introducing SPTD method that discovers spoof concepts and provides reliable explanations.

DetailsMotivation: Current face anti-spoofing models only classify faces as fake without explaining why, which undermines trustworthiness and causes user confusion when requests are denied without justification.

Method: Proposes SPTD (Spoof Trace Discovery), an X-FAS method that discovers spoof concepts and provides explanations based on these discovered concepts. Also creates an X-FAS benchmark with expert-annotated spoof traces for evaluation.

Result: Experimental results demonstrate SPTD’s ability to generate reliable explanations on face anti-spoofing datasets, outperforming previous XAI methods both quantitatively and qualitatively on the proposed benchmark.

Conclusion: Incorporating XAI into face anti-spoofing through the X-FAS framework and SPTD method successfully addresses the explanation gap, providing trustworthy and understandable justifications for spoof detection decisions.

Abstract: With the rapid growth usage of face recognition in people’s daily life, face anti-spoofing becomes increasingly important to avoid malicious attacks. Recent face anti-spoofing models can reach a high classification accuracy on multiple datasets but these models can only tell people “this face is fake” while lacking the explanation to answer “why it is fake”. Such a system undermines trustworthiness and causes user confusion, as it denies their requests without providing any explanations. In this paper, we incorporate XAI into face anti-spoofing and propose a new problem termed X-FAS (eXplainable Face Anti-Spoofing) empowering face anti-spoofing models to provide an explanation. We propose SPTD (SPoof Trace Discovery), an X-FAS method which can discover spoof concepts and provide reliable explanations on the basis of discovered concepts. To evaluate the quality of X-FAS methods, we propose an X-FAS benchmark with annotated spoof traces by experts. We analyze SPTD explanations on face anti-spoofing dataset and compare SPTD quantitatively and qualitatively with previous XAI methods on proposed X-FAS benchmark. Experimental results demonstrate SPTD’s ability to generate reliable explanations.

[185] PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Bo Li

Main category: cs.CV

TL;DR: PromptGuard is a novel content moderation technique that uses optimized safety soft prompts to prevent text-to-image models from generating NSFW content while maintaining benign image quality and inference efficiency.

DetailsMotivation: Text-to-image models are vulnerable to misuse for generating NSFW content (sexually explicit, violent, political, disturbing images), raising serious ethical concerns that need to be addressed without compromising model performance.

Method: Optimizes a universal safety soft prompt that functions as an implicit system prompt within the T2I model’s textual embedding space, using a divide-and-conquer strategy with category-specific soft prompts combined into holistic safety guidance.

Result: Achieves 3.8 times faster performance than prior methods, reduces unsafe ratio to 5.84%, and effectively mitigates NSFW content while preserving high-quality benign outputs across five datasets.

Conclusion: PromptGuard provides an efficient and effective content moderation solution that doesn’t alter inference efficiency or require proxy models, outperforming eight state-of-the-art defenses in preventing unsafe image generation.

Abstract: Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model’s textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy, which optimizes category-specific soft prompts and combines them into holistic safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 3.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.

[186] Disentangled Clothed Avatar Generation with Layered Representation

Weitian Zhang, Yichao Yan, Sijing Wu, Manwen Liao, Xiaokang Yang

Main category: cs.CV

TL;DR: LayerAvatar is a feed-forward diffusion-based method for generating component-disentangled clothed avatars using layered UV feature plane representation, enabling high-resolution rendering and expressive animation.

DetailsMotivation: Previous methods struggled with generating avatars with disentangled components (body, hair, clothes), which is crucial for applications in virtual/augmented reality and filmmaking.

Method: Proposes a layered UV feature plane representation with semantic labels for different components, trained using a single-stage diffusion model with constraint terms to handle occlusion of the human body layer.

Result: Extensive experiments show impressive performance in generating disentangled clothed avatars and successful component transfer applications.

Conclusion: LayerAvatar successfully addresses the challenge of component-disentangled avatar generation and enables high-quality, animatable avatars with practical applications.

Abstract: Clothed avatar generation has wide applications in virtual and augmented reality, filmmaking, and more. Previous methods have achieved success in generating diverse digital avatars, however, generating avatars with disentangled components (\eg, body, hair, and clothes) has long been a challenge. In this paper, we propose LayerAvatar, the first feed-forward diffusion-based method for generating component-disentangled clothed avatars. To achieve this, we first propose a layered UV feature plane representation, where components are distributed in different layers of the Gaussian-based UV feature plane with corresponding semantic labels. This representation supports high-resolution and real-time rendering, as well as expressive animation including controllable gestures and facial expressions. Based on the well-designed representation, we train a single-stage diffusion model and introduce constrain terms to address the severe occlusion problem of the innermost human body layer. Extensive experiments demonstrate the impressive performances of our method in generating disentangled clothed avatars, and we further explore its applications in component transfer. The project page is available at: https://olivia23333.github.io/LayerAvatar/

[187] ActiveGAMER: Active GAussian Mapping through Efficient Rendering

Liyan Chen, Huangying Zhan, Kevin Chen, Xiangyu Xu, Qingan Yan, Changjiang Cai, Yi Xu

Main category: cs.CV

TL;DR: ActiveGAMER is an active mapping system using 3D Gaussian Splatting for real-time scene mapping and exploration, outperforming traditional NeRF-based methods with superior efficiency and reconstruction quality.

DetailsMotivation: Traditional NeRF-based active mapping methods are computationally demanding and limit performance. There's a need for more efficient approaches that can achieve high-quality real-time scene mapping and exploration in complex environments.

Method: Leverages 3D Gaussian Splatting for efficient rendering, uses a rendering-based information gain module for next-best-view planning, and integrates a balanced framework with coarse-to-fine exploration, post-refinement, and global-local keyframe selection.

Result: Achieves state-of-the-art geometric and photometric accuracy and completeness in autonomous environment exploration and reconstruction, significantly surpassing existing approaches on benchmark datasets like Replica and MP3D.

Conclusion: ActiveGAMER demonstrates that 3D Gaussian Splatting enables highly efficient and effective active mapping with superior reconstruction quality compared to traditional methods, making it suitable for real-time applications in complex environments.

Abstract: We introduce ActiveGAMER, an active mapping system that utilizes 3D Gaussian Splatting (3DGS) to achieve high-quality, real-time scene mapping and exploration. Unlike traditional NeRF-based methods, which are computationally demanding and restrict active mapping performance, our approach leverages the efficient rendering capabilities of 3DGS, allowing effective and efficient exploration in complex environments. The core of our system is a rendering-based information gain module that dynamically identifies the most informative viewpoints for next-best-view planning, enhancing both geometric and photometric reconstruction accuracy. ActiveGAMER also integrates a carefully balanced framework, combining coarse-to-fine exploration, post-refinement, and a global-local keyframe selection strategy to maximize reconstruction completeness and fidelity. Our system autonomously explores and reconstructs environments with state-of-the-art geometric and photometric accuracy and completeness, significantly surpassing existing approaches in both aspects. Extensive evaluations on benchmark datasets such as Replica and MP3D highlight ActiveGAMER’s effectiveness in active mapping tasks.

[188] 3D Densification for Multi-Map Monocular VSLAM in Endoscopy

X. Anadón, Javier Rodríguez-Puigvert, J. M. M. Montiel

Main category: cs.CV

TL;DR: A method to densify sparse monocular endoscopic SLAM maps by combining CudaSIFT-SLAM with NN LightDepth depth predictions, using LMedS for alignment and outlier filtering.

DetailsMotivation: Sparse multi-maps from monocular endoscopic SLAM are poor for environment representation - they are noisy, contain inaccurate 3D points with outliers, and have unacceptably low density for clinical applications.

Method: Combine sparse CudaSIFT-SLAM submaps with NN LightDepth dense depth predictions, align them using robust LMedS method to handle scale ambiguity and filter outliers.

Result: Achieves 4.15 mm RMS accuracy in densified maps on C3VD phantom colon dataset with affordable computing time, with qualitative results on real colonoscopy from Endomapper dataset.

Conclusion: The proposed system successfully mitigates scale ambiguity in monocular depth estimation while filtering outliers, producing reliable densified 3D maps suitable for clinical applications.

Abstract: Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.

[189] WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

Zhongyu Yang, Jun Chen, Dannong Xu, Junjie Fei, Xiaoqian Shen, Liangbing Zhao, Chun-Mei Feng, Mohamed Elhoseiny

Main category: cs.CV

TL;DR: WikiAutoGen is a novel system for automated multimodal Wikipedia-style article generation that integrates both text and images, outperforming previous text-only methods by 8%-29% on a new benchmark.

DetailsMotivation: Traditional knowledge discovery requires significant human effort, and existing multi-agent frameworks focus only on text generation, overlooking the importance of multimodal content for informativeness and engagement.

Method: WikiAutoGen retrieves and integrates relevant images alongside text, and uses a multi-perspective self-reflection mechanism to critically assess retrieved content from diverse viewpoints for improved factual accuracy and comprehensiveness.

Result: Experimental results show WikiAutoGen outperforms previous methods by 8%-29% on the WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles.

Conclusion: The system successfully automates multimodal Wikipedia-style article generation with enhanced reliability, breadth, and visual appeal, addressing limitations of text-only approaches.

Abstract: Knowledge discovery and collection are intelligence-intensive tasks that traditionally require significant human effort to ensure high-quality outputs. Recent research has explored multi-agent frameworks for automating Wikipedia-style article generation by retrieving and synthesizing information from the internet. However, these methods primarily focus on text-only generation, overlooking the importance of multimodal content in enhancing informativeness and engagement. In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. Unlike prior approaches, WikiAutoGen retrieves and integrates relevant images alongside text, enriching both the depth and visual appeal of generated content. To further improve factual accuracy and comprehensiveness, we propose a multi-perspective self-reflection mechanism, which critically assesses retrieved content from diverse viewpoints to enhance reliability, breadth, and coherence, etc. Additionally, we introduce WikiSeek, a benchmark comprising Wikipedia articles with topics paired with both textual and image-based representations, designed to evaluate multimodal knowledge generation on more challenging topics. Experimental results show that WikiAutoGen outperforms previous methods by 8%-29% on our WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles. Our code and examples are available at https://wikiautogen.github.io/

[190] Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs

Zitian Wang, Yue Liao, Kang Rong, Fengyun Rao, Yibo Yang, Si Liu

Main category: cs.CV

TL;DR: IPA is a scalable preference alignment framework that automatically constructs instruction-oriented preferences to enhance multimodal comprehension in MLLMs, going beyond hallucination mitigation.

DetailsMotivation: Existing preference alignment methods focus mainly on hallucination factors but overlook essential multimodal comprehension capabilities, limiting their effectiveness in improving overall instruction fulfillment.

Method: Automated preference construction with verification process to identify instruction-oriented factors, plus progressive preference collection pipeline using model self-evolution and reference-guided refinement.

Result: Experiments on Qwen2VL-7B show effectiveness across multiple benchmarks including hallucination evaluation, visual question answering, and text understanding tasks.

Conclusion: IPA successfully enhances general comprehension capabilities in MLLMs by focusing on instruction fulfillment efficacy rather than just hallucination mitigation.

Abstract: Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning. While existing preference alignment methods predominantly target hallucination factors, they overlook the factors essential for multi-modal comprehension capabilities, often narrowing their improvements on hallucination mitigation. To bridge this gap, we propose Instruction-oriented Preference Alignment (IPA), a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy. Our method involves an automated preference construction coupled with a dedicated verification process that identifies instruction-oriented factors, avoiding significant variability in response representations. Additionally, IPA incorporates a progressive preference collection pipeline, further recalling challenging samples through model self-evolution and reference-guided refinement. Experiments conducted on Qwen2VL-7B demonstrate IPA’s effectiveness across multiple benchmarks, including hallucination evaluation, visual question answering, and text understanding tasks, highlighting its capability to enhance general comprehension.

[191] RailGoerl24: Görlitz Rail Test Center CV Dataset 2024

Rustam Tagiew, Ilkay Wunderlich, Mark Sastuba, Kilian Göller, Steffen Seitz

Main category: cs.CV

TL;DR: RailGoerl24 dataset with 12,205 frames and 33,556 person annotations for developing driverless train obstacle detection systems

DetailsMotivation: Lack of sufficient publicly available railway-specific datasets for training machine learning algorithms in obstacle detection, especially for human detection in railway environments

Method: Created a visual dataset using on-board Full HD cameras at a railway test center, including terrestrial LiDAR scans and comprehensive boxwise annotations for person class

Result: Produced RailGoerl24 dataset with 12,205 frames and 33,556 person annotations, available publicly for research and development

Conclusion: RailGoerl24 addresses the data scarcity problem in railway obstacle detection and supports development of driverless train operation systems beyond just collision prediction tasks

Abstract: Driverless train operation for open tracks on urban guided transport and mainline railways requires, among other things automatic detection of actual and potential obstacles, especially humans, in the danger zone of the train’s path. Machine learning algorithms have proven to be powerful state-of-the-art tools for this task. However, these algorithms require large amounts of high-quality annotated data containing human beings in railway-specific environments as training data. Unfortunately, the amount of publicly available datasets is not yet sufficient and is significantly inferior to the datasets in the road domain. Therefore, this paper presents RailGoerl24, an on-board visual light Full HD camera dataset of 12205 frames recorded in a railway test center of T"UV S"UD Rail, in G"orlitz, Germany. Its main purpose is to support the development of driverless train operation for guided transport. RailGoerl24 also includes a terrestrial LiDAR scan covering parts of the area used to acquire the RGB data. In addition to the raw data, the dataset contains 33556 boxwise annotations in total for the object class ‘person’. The faces of recorded actors are not blurred or altered in any other way. RailGoerl24, available at data.fid-move.de/dataset/railgoerl24, can also be used for tasks beyond collision prediction.

[192] Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from H&E Whole Slide Images of Cutaneous Squamous Cell Carcinoma

Lucas Sancéré, Carina Lorenz, Doris Helbig, Oana-Diana Persa, Sonja Dengler, Alexander Kreuter, Martim Laimer, Anne Fröhlich, Jennifer Landsberg, Johannes Brägelmann, Katarzyna Bozek

Main category: cs.CV

TL;DR: Histo-Miner is a deep learning pipeline for skin tissue analysis that generates two annotated datasets and achieves state-of-the-art performance in nucleus segmentation/classification and tumor segmentation, with applications in predicting immunotherapy response.

DetailsMotivation: There is a lack of labeled datasets and open source pipelines specifically tailored for analysis of skin tissue in digital pathology, particularly for cutaneous squamous cell carcinoma.

Method: Developed a deep learning pipeline using convolutional neural networks and vision transformers for nucleus segmentation/classification and tumor region segmentation on two datasets (47,392 annotated cell nuclei and 144 tumor-segmented WSIs from cSCC patients).

Result: Achieved mPQ of 0.569 for nucleus segmentation, macro-averaged F1 of 0.832 for nucleus classification, and mIoU of 0.884 for tumor segmentation. Successfully predicted immunotherapy response in 45 patients using tissue morphology features.

Conclusion: Histo-Miner provides an effective tool for skin tissue analysis with clinical applicability, offering interpretable features for therapy response prediction and biological insights.

Abstract: Recent advancements in digital pathology have enabled comprehensive analysis of Whole-Slide Images (WSI) from tissue samples, leveraging high-resolution microscopy and computational capabilities. Despite this progress, there is a lack of labeled datasets and open source pipelines specifically tailored for analysis of skin tissue. Here we propose Histo-Miner, a deep learning-based pipeline for analysis of skin WSIs and generate two datasets with labeled nuclei and tumor regions. We develop our pipeline for the analysis of patient samples of cutaneous squamous cell carcinoma (cSCC), a frequent non-melanoma skin cancer. Utilizing the two datasets, comprising 47,392 annotated cell nuclei and 144 tumor-segmented WSIs respectively, both from cSCC patients, Histo-Miner employs convolutional neural networks and vision transformers for nucleus segmentation and classification as well as tumor region segmentation. Performance of trained models positively compares to state of the art with multi-class Panoptic Quality (mPQ) of 0.569 for nucleus segmentation, macro-averaged F1 of 0.832 for nucleus classification and mean Intersection over Union (mIoU) of 0.884 for tumor region segmentation. From these predictions we generate a compact feature vector summarizing tissue morphology and cellular interactions, which can be used for various downstream tasks. Here, we use Histo-Miner to predict cSCC patient response to immunotherapy based on pre-treatment WSIs from 45 patients. Histo-Miner identifies percentages of lymphocytes, the granulocyte to lymphocyte ratio in tumor vicinity and the distances between granulocytes and plasma cells in tumors as predictive features for therapy response. This highlights the applicability of Histo-Miner to clinically relevant scenarios, providing direct interpretation of the classification and insights into the underlying biology.

[193] Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach

Xiaoran Yin, Xu Luo, Hao Wu, Lianli Gao, Jingkuan Song

Main category: cs.CV

TL;DR: FPWC framework uses world modeling and iterative planning with code execution to improve mobile device automation, achieving 44.4% higher success rate than state-of-the-art methods.

DetailsMotivation: Current reactive policies for mobile device automation focus only on immediate visual observations, leading to suboptimal decision-making in multi-step tasks due to limited environmental information.

Method: Proposes Foresighted Planning with World Model-Driven Code Execution (FPWC) - develops a task-oriented, refinable world model at task outset, then generates foresighted actions through iterative planning within this model, executed as executable code.

Result: Extensive experiments in simulated environments and real mobile devices show FPWC outperforms previous approaches, achieving 44.4% relative improvement in task success rate compared to state-of-the-art in simulated environment.

Conclusion: The FPWC framework successfully addresses limitations of reactive policies by enhancing global environmental understanding through world modeling and structured reasoning, significantly improving mobile automation performance.

Abstract: The automatic control of mobile devices is essential for efficiently performing complex tasks that involve multiple sequential steps. However, these tasks pose significant challenges due to the limited environmental information available at each step, primarily through visual observations. As a result, current approaches, which typically rely on reactive policies, focus solely on immediate observations and often lead to suboptimal decision-making. To address this problem, we propose \textbf{Foresighted Planning with World Model-Driven Code Execution (FPWC)},a framework that prioritizes natural language understanding and structured reasoning to enhance the agent’s global understanding of the environment by developing a task-oriented, refinable \emph{world model} at the outset of the task. Foresighted actions are subsequently generated through iterative planning within this world model, executed in the form of executable code. Extensive experiments conducted in simulated environments and on real mobile devices demonstrate that our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate compared to the state-of-the-art in the simulated environment. Code and demo are provided in the supplementary material.

[194] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model

Eshan Ramesh, Takayuki Nishio

Main category: cs.CV

TL;DR: LatentCSI uses WiFi CSI measurements with a pretrained latent diffusion model to generate high-quality environmental images, outperforming traditional methods in efficiency and quality while enabling text-guided control.

DetailsMotivation: To overcome the computational complexity and limitations of prior GAN-based approaches for generating images from WiFi CSI measurements, seeking a more efficient and high-quality solution.

Method: Uses a lightweight neural network to map CSI amplitudes directly into the latent space of a pretrained LDM, then applies denoising diffusion with text guidance before decoding to obtain high-resolution images.

Result: Outperforms comparable baselines trained on ground-truth images in both computational efficiency and perceptual quality, while providing text-guided controllability.

Conclusion: LatentCSI provides an efficient, high-quality approach for WiFi CSI-to-image generation that bypasses pixel-space challenges and offers practical advantages through text-based guidance.

Abstract: We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM’s denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM’s pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.

[195] Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization

Jingfeng Guo, Jian Liu, Jinnan Chen, Shiwei Mao, Changrong Hu, Puhua Jiang, Junlin Yu, Jing Xu, Qi Liu, Lixin Xu, Zhuo Chen, Chunchao Guo

Main category: cs.CV

TL;DR: Auto-Connect introduces connectivity-preserving tokenization for automatic rigging, ensuring skeletal connectivity through special tokens, topology-aware rewards, and geodesic features for improved skinning quality.

DetailsMotivation: Previous automatic rigging methods either predict bone positions as two joints or first predict points before determining connectivity, which can lead to topological inaccuracies and poor skeletal structures.

Method: Uses connectivity-preserving tokenization with special tokens for endpoints, topology-aware reward function with Direct Preference Optimization, and implicit geodesic features for latent top-k bone selection to improve skinning quality.

Result: The approach significantly enhances topological accuracy, generates anatomically plausible skeletal structures with superior deformation properties, and effectively mitigates common skinning artifacts.

Conclusion: Auto-Connect’s combination of connectivity-preserving tokenization, reward-guided fine-tuning, and geodesic-aware bone selection enables consistent generation of high-quality skeletal rigs with improved topological correctness.

Abstract: We introduce Auto-Connect, a novel approach for automatic rigging that explicitly preserves skeletal connectivity through a connectivity-preserving tokenization scheme. Unlike previous methods that predict bone positions represented as two joints or first predict points before determining connectivity, our method employs special tokens to define endpoints for each joint’s children and for each hierarchical layer, effectively automating connectivity relationships. This approach significantly enhances topological accuracy by integrating connectivity information directly into the prediction framework. To further guarantee high-quality topology, we implement a topology-aware reward function that quantifies topological correctness, which is then utilized in a post-training phase through reward-guided Direct Preference Optimization. Additionally, we incorporate implicit geodesic features for latent top-k bone selection, which substantially improves skinning quality. By leveraging geodesic distance information within the model’s latent space, our approach intelligently determines the most influential bones for each vertex, effectively mitigating common skinning artifacts. This combination of connectivity-preserving tokenization, reward-guided fine-tuning, and geodesic-aware bone selection enables our model to consistently generate more anatomically plausible skeletal structures with superior deformation properties.

[196] YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception

Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou, Xinhu Zheng, Guiguang Ding, Shaoyi Du, Zongze Wu, Yue Gao

Main category: cs.CV

TL;DR: YOLOv13 introduces HyperACE for global high-order correlations and FullPAD for full-pipeline feature distribution, achieving SOTA performance with fewer parameters.

DetailsMotivation: Previous YOLO models lack global multi-to-multi high-order correlation modeling, limiting detection performance in complex scenarios.

Method: Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism, Full-Pipeline Aggregation-and-Distribution (FullPAD) paradigm, and depthwise separable convolutions to replace vanilla large-kernel convolutions.

Result: Achieves state-of-the-art performance on MS COCO with fewer parameters and FLOPs. YOLOv13-N improves mAP by 3.0% over YOLO11-N and 1.5% over YOLOv12-N.

Conclusion: YOLOv13 provides an accurate and lightweight object detector that effectively captures global high-order correlations while maintaining computational efficiency.

Abstract: The YOLO series models reign supreme in real-time object detection due to their superior accuracy and computational efficiency. However, both the convolutional architectures of YOLO11 and earlier versions and the area-based self-attention mechanism introduced in YOLOv12 are limited to local information aggregation and pairwise correlation modeling, lacking the capability to capture global multi-to-multi high-order correlations, which limits detection performance in complex scenarios. In this paper, we propose YOLOv13, an accurate and lightweight object detector. To address the above-mentioned challenges, we propose a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism that adaptively exploits latent high-order correlations and overcomes the limitation of previous methods that are restricted to pairwise correlation modeling based on hypergraph computation, achieving efficient global cross-location and cross-scale feature fusion and enhancement. Subsequently, we propose a Full-Pipeline Aggregation-and-Distribution (FullPAD) paradigm based on HyperACE, which effectively achieves fine-grained information flow and representation synergy within the entire network by distributing correlation-enhanced features to the full pipeline. Finally, we propose to leverage depthwise separable convolutions to replace vanilla large-kernel convolutions, and design a series of blocks that significantly reduce parameters and computational complexity without sacrificing performance. We conduct extensive experiments on the widely used MS COCO benchmark, and the experimental results demonstrate that our method achieves state-of-the-art performance with fewer parameters and FLOPs. Specifically, our YOLOv13-N improves mAP by 3.0% over YOLO11-N and by 1.5% over YOLOv12-N. The code and models of our YOLOv13 model are available at: https://github.com/iMoonLab/yolov13.

[197] Colorectal Cancer Tumor Grade Segmentation in Digital Histopathology Images: From Giga to Mini Challenge

Alper Bahcekapili, Duygu Arslan, Umut Ozdemir, Berkay Ozkirli, Emre Akbas, Ahmet Acar, Gozde B. Akar, Bingdou He, Shuoyu Xu, Umit Mert Caglar, Alptekin Temizel, Guillaume Picaud, Marc Chaumont, Gérard Subsol, Luc Téot, Fahad Alsharekh, Shahad Alghannam, Hexiang Mao, Wenhua Zhang

Main category: cs.CV

TL;DR: ICIP Grand Challenge on Colorectal Cancer Tumor Grading and Segmentation using METU CCTGS dataset with 103 whole-slide images and expert annotations, where 6 out of 39 teams outperformed Swin Transformer baseline.

DetailsMotivation: Address subjective histopathological grading of colorectal cancer prone to observer variability and global shortage of trained pathologists by promoting automated and standardized solutions.

Method: Organized challenge using publicly available METU CCTGS dataset with expert pixel-level annotations for five tissue classes, evaluated submissions using metrics like macro F-score and mIoU.

Result: Six teams outperformed the Swin Transformer baseline (62.92 F-score) among 39 participating teams.

Conclusion: The challenge successfully promoted automated solutions for colorectal cancer tumor grading and segmentation, demonstrating improved performance over baseline methods.

Abstract: Colorectal cancer (CRC) is the third most diagnosed cancer and the second leading cause of cancer-related death worldwide. Accurate histopathological grading of CRC is essential for prognosis and treatment planning but remains a subjective process prone to observer variability and limited by global shortages of trained pathologists. To promote automated and standardized solutions, we organized the ICIP Grand Challenge on Colorectal Cancer Tumor Grading and Segmentation using the publicly available METU CCTGS dataset. The dataset comprises 103 whole-slide images with expert pixel-level annotations for five tissue classes. Participants submitted segmentation masks via Codalab, evaluated using metrics such as macro F-score and mIoU. Among 39 participating teams, six outperformed the Swin Transformer baseline (62.92 F-score). This paper presents an overview of the challenge, dataset, and the top-performing methods

[198] Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework

Wang Wang, Mingyu Shi, Jun Jiang, Wenqian Ma, Chong Liu, Yasutaka Narazaki, Xuguang Wang

Main category: cs.CV

TL;DR: A framework for generating synthetic 3D bridge point clouds with complete annotations to overcome data incompleteness issues in real-world bridge inspection, achieving 84.2% mIoU in semantic segmentation.

DetailsMotivation: Traditional manual bridge inspection methods are inefficient, and real-world 3D point cloud data suffers from incompleteness due to missing labels and scanning occlusions, limiting the application of data-driven approaches.

Method: Proposes a systematic framework that automatically generates complete 3D point clouds with component-level instance annotations, high-fidelity color, and precise normal vectors. Can also simulate diverse and physically realistic incomplete point clouds for training segmentation and completion networks.

Result: PointNet++ model trained with synthetic data achieves 84.2% mIoU in real-world bridge semantic segmentation. Fine-tuned KT-Net shows superior performance on component completion tasks.

Conclusion: The research provides an innovative methodology and foundational dataset for 3D visual analysis of bridge structures, significantly advancing automated infrastructure management and maintenance.

Abstract: As critical transportation infrastructure, bridges face escalating challenges from aging and deterioration, while traditional manual inspection methods suffer from low efficiency. Although 3D point cloud technology provides a new data-driven paradigm, its application potential is often constrained by the incompleteness of real-world data, which results from missing labels and scanning occlusions. To overcome the bottleneck of insufficient generalization in existing synthetic data methods, this paper proposes a systematic framework for generating 3D bridge data. This framework can automatically generate complete point clouds featuring component-level instance annotations, high-fidelity color, and precise normal vectors. It can be further extended to simulate the creation of diverse and physically realistic incomplete point clouds, designed to support the training of segmentation and completion networks, respectively. Experiments demonstrate that a PointNet++ model trained with our synthetic data achieves a mean Intersection over Union (mIoU) of 84.2% in real-world bridge semantic segmentation. Concurrently, a fine-tuned KT-Net exhibits superior performance on the component completion task. This research offers an innovative methodology and a foundational dataset for the 3D visual analysis of bridge structures, holding significant implications for advancing the automated management and maintenance of infrastructure.

[199] BayesSDF: Surface-Based Laplacian Uncertainty Estimation for 3D Geometry with Neural Signed Distance Fields

Rushil Desai

Main category: cs.CV

TL;DR: BayesSDF is a probabilistic framework for uncertainty estimation in neural implicit 3D representations using Signed Distance Functions, addressing the lack of principled uncertainty quantification in current models.

DetailsMotivation: Current neural implicit surface models lack principled uncertainty quantification, limiting their reliability in real-world applications where accurate surface estimation is critical for scientific simulation downstream tasks.

Method: BayesSDF applies a Laplace approximation over SDF weights and derives Hessian-based metrics to estimate local geometric instability, leveraging the continuous and differentiable properties of Signed Distance Functions.

Result: Empirical demonstrations show that BayesSDF’s uncertainty estimates correlate strongly with surface reconstruction error across both synthetic and real-world benchmarks.

Conclusion: BayesSDF enables surface-aware uncertainty quantification, laying the groundwork for more robust, interpretable, and actionable 3D perception systems.

Abstract: Accurate surface estimation is critical for downstream tasks in scientific simulation, and quantifying uncertainty in implicit neural 3D representations still remains a substantial challenge due to computational inefficiencies, scalability issues, and geometric inconsistencies. However, current neural implicit surface models do not offer a principled way to quantify uncertainty, limiting their reliability in real-world applications. Inspired by recent probabilistic rendering approaches, we introduce BayesSDF, a novel probabilistic framework for uncertainty estimation in neural implicit 3D representations. Unlike radiance-based models such as Neural Radiance Fields (NeRF) or 3D Gaussian Splatting, Signed Distance Functions (SDFs) provide continuous, differentiable surface representations, making them especially well-suited for uncertainty-aware modeling. BayesSDF applies a Laplace approximation over SDF weights and derives Hessian-based metrics to estimate local geometric instability. We empirically demonstrate that these uncertainty estimates correlate strongly with surface reconstruction error across both synthetic and real-world benchmarks. By enabling surface-aware uncertainty quantification, BayesSDF lays the groundwork for more robust, interpretable, and actionable 3D perception systems.

[200] Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Enrico Vompa, Tanel Tammet, Mohit Vaishnav

Main category: cs.CV

TL;DR: VLMs struggle with abstract reasoning due to an ‘alignment gap’ where they fail to outperform linear separability of their own visual embeddings. A contrastive fine-tuning method activates dormant reasoning pathways to surpass this limitation.

DetailsMotivation: To determine whether VLM failures on abstract reasoning tasks stem from flawed perception or faulty reasoning, and to address the performance gap between generative capabilities and linear separability of visual representations.

Method: Introduces Linear Separability Ceiling (LSC) diagnostic framework and augments standard next-token prediction with contrastive objective during fine-tuning to improve linear structure of representations.

Result: Most state-of-the-art VLMs fail to generatively outperform linear separability of their own representations. The proposed contrastive fine-tuning method systematically improves representation structure to significantly surpass LSC.

Conclusion: The alignment gap is not a fundamental limitation but a solvable issue. Activating dormant reasoning pathways through contrastive objectives can bridge the gap between perception and reasoning in VLMs.

Abstract: A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM’s raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive “alignment gap”, where most models fail to generatively outperform the linear separability of their own representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable alignment issue. By augmenting standard next-token prediction with a contrastive objective, our fine-tuning method activates dormant reasoning pathways, systematically improving the linear structure of representations to significantly surpass the LSC.

[201] RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images

Xiaozheng Jiang, Wei Zhang, Xuerui Mao

Main category: cs.CV

TL;DR: RS-TinyNet is a novel multi-stage feature fusion model designed specifically for tiny object detection in remote sensing imagery, achieving state-of-the-art performance with 4.0% AP and 6.5% AP75 improvements.

DetailsMotivation: Tiny object detection in remote sensing faces challenges due to limited spatial information, weak feature representations, and dense distributions in complex backgrounds, where mainstream detectors underperform.

Method: Proposes RS-TinyNet with two novel designs: tiny object saliency modeling and feature integrity reconstruction. Includes three modules: multi-dimensional collaborative attention (MDCA) for saliency enhancement, auxiliary reversible branch (ARB) for information preservation, and progressive fusion detection head (PFDH) for multi-level feature fusion.

Result: Achieves 4.0% AP and 6.5% AP75 improvements over SOTA detectors on AI-TOD dataset. Superior performance validated on DIOR benchmark across diverse remote sensing scenarios.

Conclusion: The multi-stage feature fusion strategy provides an effective and practical solution for tiny object detection in complex remote sensing environments.

Abstract: Detecting tiny objects in remote sensing (RS) imagery has been a long-standing challenge due to their extremely limited spatial information, weak feature representations, and dense distributions across complex backgrounds. Despite numerous efforts devoted, mainstream detectors still underperform in such scenarios. To bridge this gap, we introduce RS-TinyNet, a multi-stage feature fusion and enhancement model explicitly tailored for RS tiny object detection in various RS scenarios. RS-TinyNet comes with two novel designs: tiny object saliency modeling and feature integrity reconstruction. Guided by these principles, we design three step-wise feature enhancement modules. Among them, the multi-dimensional collaborative attention (MDCA) module employs multi-dimensional attention to enhance the saliency of tiny objects. Additionally, the auxiliary reversible branch (ARB) and a progressive fusion detection head (PFDH) module are introduced to preserve information flow and fuse multi-level features to bridge semantic gaps and retain structural detail. Comprehensive experiments on public RS dataset AI-TOD show that our RS-TinyNet surpasses existing state-of-the-art (SOTA) detectors by 4.0% AP and 6.5% AP75. Evaluations on DIOR benchmark dataset further validate its superior detection performance in diverse RS scenarios. These results demonstrate that the proposed multi-stage feature fusion strategy offers an effective and practical solution for tiny object detection in complex RS environments.

[202] 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Yung-Hsu Yang, Luigi Piccinelli, Mattia Segu, Siyuan Li, Rui Huang, Yuqian Fu, Marc Pollefeys, Hermann Blum, Zuria Bauer

Main category: cs.CV

TL;DR: First end-to-end 3D monocular open-set object detector that generalizes to new environments and object categories, achieving SOTA results on both closed-set and open-set benchmarks.

DetailsMotivation: Existing monocular 3D object detection methods are limited to closed-set settings with same scenes/object categories, but real-world applications require handling novel environments and categories.

Method: Lifts open-set 2D detection to 3D space using designed 3D bounding box head, conditions object queries with geometry prior, and uses canonical image space for efficient cross-dataset training.

Result: Achieves new state-of-the-art results on both closed-set (Omni3D) and open-set (Omni3D to Argoverse 2, ScanNet) settings.

Conclusion: 3D-MOOD successfully addresses monocular 3D object detection in open-set scenarios, enabling generalization to diverse real-world applications with novel environments and object categories.

Abstract: Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.

[203] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images

Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong

Main category: cs.CV

TL;DR: GCRPNet - a graph-enhanced contextual and regional perception network based on Mamba architecture for salient object detection in remote sensing images, achieving SOTA performance.

DetailsMotivation: Existing ViT and CNN methods struggle to effectively integrate global and local features for SOD in optical remote sensing images, which face challenges like significant scale variations and low target-background contrast.

Method: Uses visual state space (VSS) encoder for multi-scale feature extraction, DS-HGAM module for cross-layer interaction and structural perception, and LEVSS decoder with adaptive scanning strategy and MCAEM for enhanced local modeling.

Result: Extensive experiments demonstrate state-of-the-art performance, validating the model’s effectiveness and superiority.

Conclusion: GCRPNet successfully overcomes limitations of previous methods by simultaneously capturing long-range dependencies and enhancing regional feature representation through innovative graph-enhanced modules.

Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model’s structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba’s local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

[204] Online 3D Gaussian Splatting Modeling with Novel View Selection

Byeonggwon Lee, Junkyu Park, Khang Truong Giang, Soohwan Song

Main category: cs.CV

TL;DR: Novel method for online 3D Gaussian Splatting from RGB-only frames using adaptive view selection to improve model completeness by integrating both keyframes and optimal non-keyframes.

DetailsMotivation: Existing methods rely solely on keyframes which are insufficient for complete scene reconstruction, and online processing constraints limit the use of many frames or extensive training iterations.

Method: Adaptive view selection that analyzes reconstruction quality online to select optimal non-keyframes for additional training, combined with an online multi-view stereo approach to ensure 3D consistency throughout the modeling process.

Result: Outperforms state-of-the-art methods, delivering exceptional performance in complex outdoor scenes with significantly enhanced completeness.

Conclusion: The proposed approach effectively addresses the limitations of keyframe-only methods by intelligently selecting additional viewpoints, resulting in more complete and high-quality 3DGS models for online RGB-only reconstruction.

Abstract: This study addresses the challenge of generating online 3D Gaussian Splatting (3DGS) models from RGB-only frames. Previous studies have employed dense SLAM techniques to estimate 3D scenes from keyframes for 3DGS model construction. However, these methods are limited by their reliance solely on keyframes, which are insufficient to capture an entire scene, resulting in incomplete reconstructions. Moreover, building a generalizable model requires incorporating frames from diverse viewpoints to achieve broader scene coverage. However, online processing restricts the use of many frames or extensive training iterations. Therefore, we propose a novel method for high-quality 3DGS modeling that improves model completeness through adaptive view selection. By analyzing reconstruction quality online, our approach selects optimal non-keyframes for additional training. By integrating both keyframes and selected non-keyframes, the method refines incomplete regions from diverse viewpoints, significantly enhancing completeness. We also present a framework that incorporates an online multi-view stereo approach, ensuring consistency in 3D information throughout the 3DGS modeling process. Experimental results demonstrate that our method outperforms state-of-the-art methods, delivering exceptional performance in complex outdoor scenes.

[205] TPA: Temporal Prompt Alignment for Fetal Congenital Heart Defect Classification

Darya Taratynova, Alya Almsouti, Beknur Kalmakhanbet, Numan Saeed, Mohammad Yaqub

Main category: cs.CV

TL;DR: TPA is a novel framework for fetal congenital heart defect classification in ultrasound videos that combines temporal modeling, prompt-aware contrastive learning, and uncertainty quantification to achieve state-of-the-art performance with improved calibration.

DetailsMotivation: Current automated methods for CHD detection in ultrasound videos neglect temporal information, limit to binary classification, and lack prediction calibration, which hinders clinical reliability.

Method: Temporal Prompt Alignment (TPA) extracts frame features with an image encoder, aggregates them with a temporal extractor, aligns video representations with class-specific text prompts via contrastive loss, and uses CVAESM module for uncertainty quantification and style modulation.

Result: TPA achieves 85.40% macro F1 score for CHD diagnosis, reduces expected calibration error by 5.38% and adaptive ECE by 6.8%, and boosts macro F1 by 4.73% on EchoNet-Dynamic’s three-class task.

Conclusion: TPA effectively addresses limitations of current methods by integrating temporal modeling, prompt learning, and uncertainty quantification, demonstrating superior performance and improved calibration for clinical applications.

Abstract: Congenital heart defect (CHD) detection in ultrasound videos is hindered by image noise and probe positioning variability. While automated methods can reduce operator dependence, current machine learning approaches often neglect temporal information, limit themselves to binary classification, and do not account for prediction calibration. We propose Temporal Prompt Alignment (TPA), a method leveraging foundation image-text model and prompt-aware contrastive learning to classify fetal CHD on cardiac ultrasound videos. TPA extracts features from each frame of video subclips using an image encoder, aggregates them with a trainable temporal extractor to capture heart motion, and aligns the video representation with class-specific text prompts via a margin-hinge contrastive loss. To enhance calibration for clinical reliability, we introduce a Conditional Variational Autoencoder Style Modulation (CVAESM) module, which learns a latent style vector to modulate embeddings and quantifies classification uncertainty. Evaluated on a private dataset for CHD detection and on a large public dataset, EchoNet-Dynamic, for systolic dysfunction, TPA achieves state-of-the-art macro F1 scores of 85.40% for CHD diagnosis, while also reducing expected calibration error by 5.38% and adaptive ECE by 6.8%. On EchoNet-Dynamic’s three-class task, it boosts macro F1 by 4.73% (from 53.89% to 58.62%). Temporal Prompt Alignment (TPA) is a framework for fetal congenital heart defect (CHD) classification in ultrasound videos that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification.

[206] Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings

Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu

Main category: cs.CV

TL;DR: Drawing2CAD is a framework that converts 2D engineering drawings into parametric CAD models using sequence-to-sequence learning, preserving geometric precision and design intent.

DetailsMotivation: Traditional CAD generative methods diverge from industrial workflows that start with 2D drawings, and automatic generation from vector drawings remains underexplored despite being critical for engineering design.

Method: Uses a dual-decoder transformer architecture with network-friendly vector primitive representation, decoupling command type and parameter generation while maintaining correspondence, and employs soft target distribution loss for CAD parameter flexibility.

Result: Developed CAD-VGDrawing dataset and demonstrated effectiveness through experiments, with code and dataset publicly available.

Conclusion: The framework successfully bridges the gap between 2D engineering drawings and parametric CAD models, preserving geometric precision and design intent throughout the transformation process.

Abstract: Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.

[207] InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information

Guohui Zhang, Jiangtong Tan, Linjiang Huang, Zhonghang Yuan, Naishan Zheng, Jie Huang, Feng Zhao

Main category: cs.CV

TL;DR: InfoScale is a plug-and-play framework that addresses three key information-related challenges in diffusion models for variable-scale image generation: frequency loss from dilated convolution, inflexible attention mechanisms, and misaligned initial noise distribution.

DetailsMotivation: Diffusion models suffer performance degradation when generating images at resolutions different from their training scale due to information handling issues across varying resolutions.

Method: Proposes three modules: Progressive Frequency Compensation for high-frequency information recovery, Adaptive Information Aggregation for flexible attention, and Noise Adaptation for proper initial noise distribution in variable-scale generation.

Result: Extensive experiments demonstrate that InfoScale effectively enables diffusion models to generate high-quality images at various resolutions beyond their training scale.

Conclusion: InfoScale provides an information-centric solution that successfully addresses the fundamental challenges of variable-scale image generation in diffusion models through targeted modules for frequency compensation, adaptive attention, and noise distribution.

Abstract: Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbf{InfoScale}, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.

[208] Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework

Furong Jia, Lanxin Liu, Ce Hou, Fan Zhang, Xinyan Liu, Yu Liu

Main category: cs.CV

TL;DR: First interpretable geo-localization framework using concept bottlenecks to enhance both accuracy and explainability by aligning image-location embeddings with geographic concepts.

DetailsMotivation: Current geo-localization models lack interpretability, and existing concept-based methods don't align well with geo-alignment objectives, leading to poor interpretability and performance.

Method: Proposes Concept-Aware Alignment Module that projects image and location embeddings onto shared geographic concepts (climate, landmarks, architecture) and minimizes concept-level loss for better alignment in concept-specific subspace.

Result: Outperforms GeoCLIP in geo-localization accuracy and improves performance across various geospatial prediction tasks while providing richer semantic insights.

Conclusion: Successfully integrates interpretability into geo-localization through concept bottlenecks, achieving both better performance and explainable geographic decision-making.

Abstract: Worldwide geo-localization involves determining the exact geographic location of images captured globally, typically guided by geographic cues such as climate, landmarks, and architectural styles. Despite advancements in geo-localization models like GeoCLIP, which leverages images and location alignment via contrastive learning for accurate predictions, the interpretability of these models remains insufficiently explored. Current concept-based interpretability methods fail to align effectively with Geo-alignment image-location embedding objectives, resulting in suboptimal interpretability and performance. To address this gap, we propose a novel framework integrating global geo-localization with concept bottlenecks. Our method inserts a Concept-Aware Alignment Module that jointly projects image and location embeddings onto a shared bank of geographic concepts (e.g., tropical climate, mountain, cathedral) and minimizes a concept-level loss, enhancing alignment in a concept-specific subspace and enabling robust interpretability. To our knowledge, this is the first work to introduce interpretability into geo-localization. Extensive experiments demonstrate that our approach surpasses GeoCLIP in geo-localization accuracy and boosts performance across diverse geospatial prediction tasks, revealing richer semantic insights into geographic decision-making processes.

[209] Towards High-Fidelity, Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation

Lydia Kin Ching Chau, Zhi Yu, Ruowei Jiang

Main category: cs.CV

TL;DR: Real-time virtual makeup try-on framework with high-fidelity cosmetic transfer, identity preservation, and temporal consistency

DetailsMotivation: Existing methods struggle with disentangling semitransparent cosmetics from skin tones, causing identity shifts and fairness concerns. Current approaches lack real-time capabilities and temporal consistency, limiting practical adoption.

Method: Decouple makeup transfer into two steps: transparent makeup mask extraction and graphics-based mask rendering. Use pseudo-ground-truth data from graphics-based rendering and unsupervised k-means clustering. Specialized training objectives include alpha-weighted reconstruction and lip color losses.

Result: Achieves robust makeup transfer across diverse poses, expressions, and skin tones while preserving temporal smoothness. Outperforms existing baselines in capturing fine details, maintaining temporal stability, and preserving identity integrity.

Conclusion: The proposed framework enables real-time, high-fidelity virtual makeup try-on with improved identity preservation and temporal consistency, addressing key limitations of existing methods.

Abstract: We present a novel framework for real-time virtual makeup try-on that achieves high-fidelity, identity-preserving cosmetic transfer with robust temporal consistency. In live makeup transfer applications, it is critical to synthesize temporally coherent results that accurately replicate fine-grained makeup and preserve user’s identity. However, existing methods often struggle to disentangle semitransparent cosmetics from skin tones and other identify features, causing identity shifts and raising fairness concerns. Furthermore, current methods lack real-time capabilities and fail to maintain temporal consistency, limiting practical adoption. To address these challenges, we decouple makeup transfer into two steps: transparent makeup mask extraction and graphics-based mask rendering. After the makeup extraction step, the makeup rendering can be performed in real time, enabling live makeup try-on. Our makeup extraction model trained on pseudo-ground-truth data generated via two complementary methods: a graphics-based rendering pipeline and an unsupervised k-means clustering approach. To further enhance transparency estimation and color fidelity, we propose specialized training objectives, including alpha-weighted reconstruction and lip color losses. Our method achieves robust makeup transfer across diverse poses, expressions, and skin tones while preserving temporal smoothness. Extensive experiments demonstrate that our approach outperforms existing baselines in capturing fine details, maintaining temporal stability, and preserving identity integrity.

[210] Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Sohee Kim, Soohyun Ryu, Joonhyung Park, Eunho Yang

Main category: cs.CV

TL;DR: LVLMs often mistakenly treat text inputs as visual content. The paper identifies specific FFN neurons that detect visual absence and uses them to create a detection module that improves response accuracy by filtering out ungrounded text inputs.

DetailsMotivation: Large Vision-Language Models frequently make errors by assuming text inputs without visual evidence are part of the image, leading to incorrect responses. The researchers wanted to investigate whether LVLMs have internal mechanisms to detect when textual concepts lack visual grounding.

Method: Identified Visual Absence-aware (VA) neurons in Feed-Forward Networks that signal visual absence through distinctive activation patterns. Developed a detection module to classify input tokens as visually grounded or not, and used this to refine outputs by reinterpreting prompts or replacing absent tokens during generation.

Result: The method effectively reduces LVLMs’ tendency to falsely presume visual presence of text input. Extensive experiments demonstrate the approach works across various LVLM architectures.

Conclusion: LVLMs contain specific internal neurons capable of detecting visual absence, and leveraging these neurons through a systematic detection module can significantly improve model performance by preventing erroneous assumptions about text inputs being visually grounded.

Abstract: Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models’ tendency to falsely presume the visual presence of text input and its generality across various LVLMs.

[211] SLENet: A Guidance-Enhanced Network for Underwater Camouflaged Object Detection

Xinxin Huang, Han Sun, Ningzhong Liu, Huiyu Zhou, Yinan Yao

Main category: cs.CV

TL;DR: Proposes SLENet framework and DeepCamo dataset for underwater camouflaged object detection, achieving state-of-the-art performance through semantic localization and enhancement modules.

DetailsMotivation: Underwater camouflaged object detection is crucial for marine ecology but faces challenges from optical distortions, water turbidity, and complex marine organism traits, making accurate identification difficult.

Method: Developed SLENet with Gamma-Asymmetric Enhancement module and Localization Guidance Branch to enhance multi-scale features and generate semantic-rich location maps, guided by Multi-Scale Supervised Decoder for accurate predictions.

Result: SLENet demonstrates superior performance on DeepCamo dataset and three benchmark COD datasets, showing high generality for broader camouflaged object detection tasks.

Conclusion: The proposed framework effectively addresses underwater camouflaged object detection challenges and establishes a strong benchmark for future research in this domain.

Abstract: Underwater Camouflaged Object Detection (UCOD) aims to identify objects that blend seamlessly into underwater environments. This task is critically important to marine ecology. However, it remains largely underexplored and accurate identification is severely hindered by optical distortions, water turbidity, and the complex traits of marine organisms. To address these challenges, we introduce the UCOD task and present DeepCamo, a benchmark dataset designed for this domain. We also propose Semantic Localization and Enhancement Network (SLENet), a novel framework for UCOD. We first benchmark state-of-the-art COD models on DeepCamo to reveal key issues, upon which SLENet is built. In particular, we incorporate Gamma-Asymmetric Enhancement (GAE) module and a Localization Guidance Branch (LGB) to enhance multi-scale feature representation while generating a location map enriched with global semantic information. This map guides the Multi-Scale Supervised Decoder (MSSD) to produce more accurate predictions. Experiments on our DeepCamo dataset and three benchmark COD datasets confirm SLENet’s superior performance over SOTA methods, and underscore its high generality for the broader COD task.

[212] Improving Vessel Segmentation with Multi-Task Learning and Auxiliary Data Available Only During Model Training

Daniel Sobotka, Alexander Herold, Matthias Perkonigg, Lucian Beer, Nina Bastati, Alina Sablatnig, Ahmed Ba-Ssalamah, Georg Langs

Main category: cs.CV

TL;DR: Multi-task learning framework for liver vessel segmentation in non-contrast MRI using auxiliary contrast-enhanced data during training to reduce annotation requirements.

DetailsMotivation: Liver vessel segmentation is crucial for analyzing vascular changes in liver diseases, but existing methods require contrast-enhanced MRI which is not always available. Non-contrast MRI is more common but challenging to segment without large annotated datasets.

Method: Proposes a multi-task learning framework that leverages paired native and contrast-enhanced MRI data (with and without vessel annotations) during training. The approach uses auxiliary contrast-enhanced data to improve feature representation through shared task structure.

Result: Auxiliary contrast-enhanced data improves vessel segmentation accuracy in non-contrast MRI, even when not available during inference. Benefits are most significant when few annotations are available for training. The approach was also validated for brain tumor segmentation, showing cross-domain applicability.

Conclusion: An auxiliary informative imaging modality can effectively augment expert annotations and improve segmentation performance even when only available during training, reducing the need for large-scale annotated datasets.

Abstract: Liver vessel segmentation in magnetic resonance imaging data is important for the computational analysis of vascular remodelling, associated with a wide spectrum of diffuse liver diseases. Existing approaches rely on contrast enhanced imaging data, but the necessary dedicated imaging sequences are not uniformly acquired. Images without contrast enhancement are acquired more frequently, but vessel segmentation is challenging, and requires large-scale annotated data. We propose a multi-task learning framework to segment vessels in liver MRI without contrast. It exploits auxiliary contrast enhanced MRI data available only during training to reduce the need for annotated training examples. Our approach draws on paired native and contrast enhanced data with and without vessel annotations for model training. Results show that auxiliary data improves the accuracy of vessel segmentation, even if they are not available during inference. The advantage is most pronounced if only few annotations are available for training, since the feature representation benefits from the shared task structure. A validation of this approach to augment a model for brain tumor segmentation confirms its benefits across different domains. An auxiliary informative imaging modality can augment expert annotations even if it is only available during training.

[213] GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Yixuan Li

Main category: cs.CV

TL;DR: GeoArena is an open platform that addresses data leakage and privacy issues in image geolocalization evaluation by using in-the-wild images and human pairwise judgments instead of exact coordinates.

DetailsMotivation: Current image geolocalization evaluation methods suffer from data leakage (LVLMs pretrained on test data) and privacy concerns with exact coordinate metrics that neglect reasoning processes.

Method: Developed GeoArena platform that allows users to upload in-the-wild images and uses pairwise human judgments to evaluate which model outputs better align with human expectations.

Result: Deployed online for 2 months, collected thousands of voting records, established a leaderboard comparing different LVLMs on image geolocalization performance.

Conclusion: GeoArena provides a more accurate, privacy-preserving, and human-centered benchmarking approach for evaluating image geolocalization capabilities of large vision-language models.

Abstract: Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model’s actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.

[214] AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-Based Person Anomaly Search

Hao Ju, Hu Zhang, Zhedong Zheng

Main category: cs.CV

TL;DR: AnomalyLMM is the first framework using Large Multi-modal Models for text-based person anomaly search, addressing fine-grained cross-modal alignment and sparse anomaly samples through a training-free adaptation approach.

DetailsMotivation: Text-based person anomaly search is crucial for public safety but faces challenges in fine-grained cross-modal alignment and sparse real-world anomaly samples. Current LMMs have untapped potential for this task due to domain gaps and lack of efficient adaptation strategies.

Method: A coarse-to-fine pipeline integrating LMMs with masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking. This training-free approach enables zero-shot focus on subtle anomaly cues without additional training.

Result: Achieved +0.96% improvement in Recall@1 accuracy on the PAB dataset compared to competitive baselines. The method demonstrates interpretable alignment between textual anomalies and visual behaviors across diverse scenarios like falling, collision, and being hit.

Conclusion: AnomalyLMM successfully bridges generative world knowledge with retrieval-centric anomaly detection, providing an effective zero-shot solution for text-based person anomaly search with interpretable results and potential for future research.

Abstract: With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.

[215] Aesthetic Image Captioning with Saliency Enhanced MLLMs

Yilin Tao, Jiashui Huang, Huaze Xu, Ling Shao

Main category: cs.CV

TL;DR: ASE-MLLM is a novel framework that integrates aesthetic saliency into multimodal large language models for improved aesthetic image captioning, achieving state-of-the-art performance.

DetailsMotivation: Existing AIC methods using MLLMs don't specifically adapt to focus on aesthetic content and primarily rely on fine-tuning without explicit aesthetic saliency incorporation.

Method: Proposes ASE-MLLM with Image Aesthetic Saliency Module (IASM) to extract aesthetic features and IAS-ViT encoder that fuses aesthetic saliency with original image features via cross-attention.

Result: Significantly outperforms traditional methods and generic MLLMs on mainstream AIC benchmarks, achieving state-of-the-art performance.

Conclusion: First framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks, demonstrating the effectiveness of explicit aesthetic feature incorporation.

Abstract: Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.

cs.AI

[216] Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning

Brennen Hill

Main category: cs.AI

TL;DR: The paper argues for developing explicit hierarchical World Models using LLM-generated scaffolding to overcome exploration and reward challenges in complex multi-agent tasks like robotic soccer, enabling more efficient strategic learning.

DetailsMotivation: Current AI systems struggle with complex long-horizon multi-agent tasks due to intractable exploration spaces and sparse rewards in structurally-flat simulators. The development of sophisticated explicit World Models remains a key bottleneck for advancing agent capabilities.

Method: Proposes hierarchical scaffolding where complex goals are decomposed into structured subgoals, leveraging Large Language Models to dynamically generate this hierarchical scaffold and create language-configurable task layers that provide intrinsic curriculum and dense learning signals.

Result: Systematic review of 2024 multi-agent soccer research shows a decisive trend towards integrating symbolic/hierarchical methods with MARL, where approaches implicitly or explicitly construct task-based world models to guide agent learning.

Conclusion: Building environments with explicit, language-configurable task layers can bridge low-level reactive behaviors and high-level strategic team play, creating a powerful generalizable framework for training next-generation intelligent agents with greater sample efficiency.

Abstract: The convergence of Language models, Agent models, and World models represents a critical frontier for artificial intelligence. While recent progress has focused on scaling Language and Agent models, the development of sophisticated, explicit World Models remains a key bottleneck, particularly for complex, long-horizon multi-agent tasks. In domains such as robotic soccer, agents trained via standard reinforcement learning in high-fidelity but structurally-flat simulators often fail due to intractable exploration spaces and sparse rewards. This position paper argues that the next frontier in developing capable agents lies in creating environments that possess an explicit, hierarchical World Model. We contend that this is best achieved through hierarchical scaffolding, where complex goals are decomposed into structured, manageable subgoals. Drawing evidence from a systematic review of 2024 research in multi-agent soccer, we identify a clear and decisive trend towards integrating symbolic and hierarchical methods with multi-agent reinforcement learning (MARL). These approaches implicitly or explicitly construct a task-based world model to guide agent learning. We then propose a paradigm shift: leveraging Large Language Models to dynamically generate this hierarchical scaffold, effectively using language to structure the World Model on the fly. This language-driven world model provides an intrinsic curriculum, dense and meaningful learning signals, and a framework for compositional learning, enabling Agent Models to acquire sophisticated, strategic behaviors with far greater sample efficiency. By building environments with explicit, language-configurable task layers, we can bridge the gap between low-level reactive behaviors and high-level strategic team play, creating a powerful and generalizable framework for training the next generation of intelligent agents.

[217] The Ethical Compass of the Machine: Evaluating Large Language Models for Decision Support in Construction Project Management

Somtochukwu Azie, Yiping Meng

Main category: cs.AI

TL;DR: Study evaluates LLMs’ ethical viability in construction project management, finding they perform well on legal compliance but struggle with contextual nuance, accountability, and transparency. Experts recommend human oversight rather than autonomous AI for ethical decisions.

DetailsMotivation: To critically evaluate the ethical viability and reliability of Large Language Models (LLMs) when applied to ethically sensitive, high-risk decision-making contexts in construction project management.

Method: Mixed-methods research design: quantitative performance testing of two leading LLMs against 12 real-world ethical scenarios using Ethical Decision Support Assessment Checklist (EDSAC), plus qualitative analysis of semi-structured interviews with 12 industry experts.

Result: LLMs demonstrate adequate performance in structured domains like legal compliance but show significant deficiencies in handling contextual nuance, ensuring accountability, and providing transparent reasoning. Stakeholders expressed reservations about autonomous AI for ethical judgments.

Conclusion: LLMs are currently best positioned as decision-support aids rather than autonomous ethical agents, requiring robust human-in-the-loop oversight. The study introduces EDSAC as a replicable methodology for ethical AI evaluation.

Abstract: The integration of Artificial Intelligence (AI) into construction project management (CPM) is accelerating, with Large Language Models (LLMs) emerging as accessible decision-support tools. This study aims to critically evaluate the ethical viability and reliability of LLMs when applied to the ethically sensitive, high-risk decision-making contexts inherent in CPM. A mixed-methods research design was employed, involving the quantitative performance testing of two leading LLMs against twelve real-world ethical scenarios using a novel Ethical Decision Support Assessment Checklist (EDSAC), and qualitative analysis of semi-structured interviews with 12 industry experts to capture professional perceptions. The findings reveal that while LLMs demonstrate adequate performance in structured domains such as legal compliance, they exhibit significant deficiencies in handling contextual nuance, ensuring accountability, and providing transparent reasoning. Stakeholders expressed considerable reservations regarding the autonomous use of AI for ethical judgments, strongly advocating for robust human-in-the-loop oversight. To our knowledge, this is one of the first studies to empirically test the ethical reasoning of LLMs within the construction domain. It introduces the EDSAC framework as a replicable methodology and provides actionable recommendations, emphasising that LLMs are currently best positioned as decision-support aids rather than autonomous ethical agents.

[218] ProToM: Promoting Prosocial Behaviour via Theory of Mind-Informed Feedback

Matteo Bortoletto, Yichao Zhou, Lance Ying, Tianmin Shu, Andreas Bulling

Main category: cs.AI

TL;DR: ProToM is an AI system that uses Theory of Mind to provide targeted feedback promoting prosocial behavior in multi-agent systems, outperforming LLM baselines in efficiency and human preference.

DetailsMotivation: Humans struggle with knowing when and how to assist others when pursuing independent goals, which hinders cooperation. The paper aims to develop an AI system that promotes prosocial behavior through effective feedback.

Method: ProToM uses Bayesian inverse planning to infer agents’ goals, then selects feedback by maximizing expected utility based on the inferred goal distribution. It’s evaluated in Doors, Keys, and Gems and Overcooked environments.

Result: ProToM achieved higher success rates, shorter completion times, and was consistently preferred by human users compared to state-of-the-art LLMs, which produced poorly-timed feedback with higher communication overhead.

Conclusion: Theory of Mind-informed approaches like ProToM provide more effective, context-sensitive feedback for promoting prosocial behavior than current large language models, demonstrating the value of goal inference and utility-maximizing feedback selection.

Abstract: While humans are inherently social creatures, the challenge of identifying when and how to assist and collaborate with others - particularly when pursuing independent goals - can hinder cooperation. To address this challenge, we aim to develop an AI system that provides useful feedback to promote prosocial behaviour - actions that benefit others, even when not directly aligned with one’s own goals. We introduce ProToM, a Theory of Mind-informed facilitator that promotes prosocial actions in multi-agent systems by providing targeted, context-sensitive feedback to individual agents. ProToM first infers agents’ goals using Bayesian inverse planning, then selects feedback to communicate by maximising expected utility, conditioned on the inferred goal distribution. We evaluate our approach against baselines in two multi-agent environments: Doors, Keys, and Gems, as well as Overcooked. Our results suggest that state-of-the-art large language and reasoning models fall short of communicating feedback that is both contextually grounded and well-timed - leading to higher communication overhead and task speedup. In contrast, ProToM provides targeted and helpful feedback, achieving a higher success rate, shorter task completion times, and is consistently preferred by human users.

[219] Maestro: Joint Graph & Config Optimization for Reliable AI Agents

Wenxiao Wang, Priyatham Kattakinda, Soheil Feizi

Main category: cs.AI

TL;DR: Maestro is a holistic optimizer for LLM agents that jointly searches over both graph structures and node configurations, outperforming existing prompt optimizers by addressing structural failure modes that configuration-only tuning cannot fix.

DetailsMotivation: Existing LLM agent optimizers only tune configurations while keeping the graph structure fixed, leaving structural failure modes unaddressed. There's a need for a holistic approach that optimizes both graph structure and node configurations.

Method: Maestro is a framework-agnostic optimizer that jointly searches over agent graphs (modules and information flow) and node configurations (models, prompts, tools, control knobs). It uses reflective textual feedback from traces to prioritize edits and maximize agent quality within rollout/token budgets.

Result: On IFBench and HotpotQA benchmarks, Maestro surpassed leading prompt optimizers (MIPROv2, GEPA, GEPA+Merge) by average margins of 12%, 4.9%, and 4.86% respectively. Even in prompt-only optimization, it led by 9.65%, 2.37%, and 2.41%. Achieved results with fewer rollouts than GEPA.

Conclusion: Joint graph and configuration search is essential for addressing structural failure modes in LLM agents that prompt tuning alone cannot fix. Maestro demonstrates significant performance improvements across multiple benchmarks and real-world applications.

Abstract: Building reliable LLM agents requires decisions at two levels: the graph (which modules exist and how information flows) and the configuration of each node (models, prompts, tools, control knobs). Most existing optimizers tune configurations while holding the graph fixed, leaving structural failure modes unaddressed. We introduce Maestro, a framework-agnostic holistic optimizer for LLM agents that jointly searches over graphs and configurations to maximize agent quality, subject to explicit rollout/token budgets. Beyond numeric metrics, Maestro leverages reflective textual feedback from traces to prioritize edits, improving sample efficiency and targeting specific failure modes. On the IFBench and HotpotQA benchmarks, Maestro consistently surpasses leading prompt optimizers–MIPROv2, GEPA, and GEPA+Merge–by an average of 12%, 4.9%, and 4.86%, respectively; even when restricted to prompt-only optimization, it still leads by 9.65%, 2.37%, and 2.41%. Maestro achieves these results with far fewer rollouts than GEPA. We further show large gains on two applications (interviewer & RAG agents), highlighting that joint graph & configuration search addresses structural failure modes that prompt tuning alone cannot fix.

[220] Towards Personalized Explanations for Health Simulations: A Mixed-Methods Framework for Stakeholder-Centric Summarization

Philippe J. Giabbanelli, Ameeta Agrawal

Main category: cs.AI

TL;DR: A framework for generating tailored explanations of health simulation models using LLMs to address diverse stakeholder needs

DetailsMotivation: Health simulation models are complex and inaccessible to stakeholders who need them most, and current LLM approaches provide generic summaries that don't meet varied informational needs

Method: Mixed-methods design: eliciting stakeholder explanation needs and stylistic preferences, optimizing LLMs via controllable attribute tuning, and comprehensive evaluation metrics

Result: A step-by-step framework to guide LLMs in generating tailored explanations for different health stakeholders

Conclusion: The framework addresses the gap in systematic understanding of stakeholder needs and enables personalized explanations of health simulations

Abstract: Modeling & Simulation (M&S) approaches such as agent-based models hold significant potential to support decision-making activities in health, with recent examples including the adoption of vaccines, and a vast literature on healthy eating behaviors and physical activity behaviors. These models are potentially usable by different stakeholder groups, as they support policy-makers to estimate the consequences of potential interventions and they can guide individuals in making healthy choices in complex environments. However, this potential may not be fully realized because of the models’ complexity, which makes them inaccessible to the stakeholders who could benefit the most. While Large Language Models (LLMs) can translate simulation outputs and the design of models into text, current approaches typically rely on one-size-fits-all summaries that fail to reflect the varied informational needs and stylistic preferences of clinicians, policymakers, patients, caregivers, and health advocates. This limitation stems from a fundamental gap: we lack a systematic understanding of what these stakeholders need from explanations and how to tailor them accordingly. To address this gap, we present a step-by-step framework to identify stakeholder needs and guide LLMs in generating tailored explanations of health simulations. Our procedure uses a mixed-methods design by first eliciting the explanation needs and stylistic preferences of diverse health stakeholders, then optimizing the ability of LLMs to generate tailored outputs (e.g., via controllable attribute tuning), and then evaluating through a comprehensive range of metrics to further improve the tailored generation of summaries.

[221] An Approach to Grounding AI Model Evaluations in Human-derived Criteria

Sasha Mitts

Main category: cs.AI

TL;DR: Proposes human-derived evaluation criteria to enhance AI benchmarks for physical world modeling, identifying key cognitive skills like Prioritization, Memorizing, Discerning, and Contextualizing through interviews and surveys.

DetailsMotivation: Traditional AI benchmarks fall short in capturing nuanced capabilities, particularly in physical world modeling, and need better interpretability and applicability aligned with human cognitive processes.

Method: Conducted in-depth interviews and large-scale surveys using Perception Test and OpenEQA benchmarks to identify critical cognitive skills for AI and human reasoning.

Result: Found that participants perceive AI as lacking interpretive and empathetic skills but hold high performance expectations. Developed a framework for human-aligned benchmark design.

Conclusion: Highlights importance of user-centered evaluation in AI development, providing actionable guidelines for researchers to align AI capabilities with human cognition and improve benchmarking practices.

Abstract: In the rapidly evolving field of artificial intelligence (AI), traditional benchmarks can fall short in attempting to capture the nuanced capabilities of AI models. We focus on the case of physical world modeling and propose a novel approach to augment existing benchmarks with human-derived evaluation criteria, aiming to enhance the interpretability and applicability of model behaviors. Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys to identify key cognitive skills, such as Prioritization, Memorizing, Discerning, and Contextualizing, that are critical for both AI and human reasoning. Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance. By integrating insights from our findings into benchmark design, we offer a framework for developing more human-aligned means of defining and measuring progress. This work underscores the importance of user-centered evaluation in AI development, providing actionable guidelines for researchers and practitioners aiming to align AI capabilities with human cognitive processes. Our approach both enhances current benchmarking practices and sets the stage for future advancements in AI model evaluation.

[222] What-If Analysis of Large Language Models: Explore the Game World Using Proactive Thinking

Yuan Sui, Yanming Zhang, Yi Liao, Yu Gu, Guohua Tang, Zhongqian Sun, Wei Yang, Bryan Hooi

Main category: cs.AI

TL;DR: WiA-LLM introduces what-if analysis to LLMs, enabling proactive thinking and future forecasting through reinforcement learning, achieving 74.2% accuracy in game-state prediction with 2x gains over baselines.

DetailsMotivation: LLMs currently lack proactive thinking capabilities - they can't explore hypothetical futures or forecast consequences before acting, which limits their utility in dynamic, high-stakes decision-making scenarios.

Method: Integrates What-If Analysis (WIA) with reinforcement learning to enable LLMs to dynamically simulate outcomes of potential actions. Validated in Honor of Kings game environment with rapid state changes.

Result: Achieves 74.2% accuracy in forecasting game-state changes (2x improvement over baselines), with particularly significant gains in high-difficulty scenarios requiring accurate foresight.

Conclusion: WiA-LLM represents the first formal integration of what-if analysis in LLMs, providing a scalable framework for proactive reasoning and robust decision-making in dynamic environments.

Abstract: Large language models (LLMs) excel at processing information reactively but lack the ability to systemically explore hypothetical futures. They cannot ask, “what if we take this action? how will it affect the final outcome” and forecast its potential consequences before acting. This critical gap limits their utility in dynamic, high-stakes scenarios like strategic planning, risk assessment, and real-time decision making. To bridge this gap, we propose WiA-LLM, a new paradigm that equips LLMs with proactive thinking capabilities. Our approach integrates What-If Analysis (WIA), a systematic approach for evaluating hypothetical scenarios by changing input variables. By leveraging environmental feedback via reinforcement learning, WiA-LLM moves beyond reactive thinking. It dynamically simulates the outcomes of each potential action, enabling the model to anticipate future states rather than merely react to the present conditions. We validate WiA-LLM in Honor of Kings (HoK), a complex multiplayer game environment characterized by rapid state changes and intricate interactions. The game’s real-time state changes require precise multi-step consequence prediction, making it an ideal testbed for our approach. Experimental results demonstrate WiA-LLM achieves a remarkable 74.2% accuracy in forecasting game-state changes (up to two times gain over baselines). The model shows particularly significant gains in high-difficulty scenarios where accurate foresight is critical. To our knowledge, this is the first work to formally explore and integrate what-if analysis capabilities within LLMs. WiA-LLM represents a fundamental advance toward proactive reasoning in LLMs, providing a scalable framework for robust decision-making in dynamic environments with broad implications for strategic applications.

[223] TalkToAgent: A Human-centric Explanation of Reinforcement Learning Agents with Large Language Models

Haechang Kim, Hao Chen, Can Li, Jong Min Lee

Main category: cs.AI

TL;DR: TalkToAgent is a multi-agent LLM framework that provides interactive natural language explanations for RL policies using five specialized agents to map user queries to XRL tools and generate various explanation types.

DetailsMotivation: There's a gap between complex RL policies and domain experts due to limited comprehensibility of XRL results and isolated coverage of current approaches, leaving users uncertain about which tools to use.

Method: A multi-agent LLM framework with five specialized agents (Coordinator, Explainer, Coder, Evaluator, Debugger) that automatically maps user queries to relevant XRL tools and provides explanations through key state variables, expected outcomes, or counterfactual explanations.

Result: Successfully mapped user queries into XRL tasks with high accuracy, minimized failures in counterfactual generation through coder-debugger interactions, and effectively interpreted agent’s actions within the problem domain.

Conclusion: TalkToAgent addresses the comprehensibility gap in XRL by providing interactive natural language explanations through a multi-agent LLM framework, successfully bridging the understanding between RL policies and domain experts.

Abstract: Explainable Reinforcement Learning (XRL) has emerged as a promising approach in improving the transparency of Reinforcement Learning (RL) agents. However, there remains a gap between complex RL policies and domain experts, due to the limited comprehensibility of XRL results and isolated coverage of current XRL approaches that leave users uncertain about which tools to employ. To address these challenges, we introduce TalkToAgent, a multi-agent Large Language Models (LLM) framework that delivers interactive, natural language explanations for RL policies. The architecture with five specialized LLM agents (Coordinator, Explainer, Coder, Evaluator, and Debugger) enables TalkToAgent to automatically map user queries to relevant XRL tools and clarify an agent’s actions in terms of either key state variables, expected outcomes, or counterfactual explanations. Moreover, our approach extends previous counterfactual explanations by deriving alternative scenarios from qualitative behavioral descriptions, or even new rule-based policies. We validated TalkToAgent on quadruple-tank process control problem, a well-known nonlinear control benchmark. Results demonstrated that TalkToAgent successfully mapped user queries into XRL tasks with high accuracy, and coder-debugger interactions minimized failures in counterfactual generation. Furthermore, qualitative evaluation confirmed that TalkToAgent effectively interpreted agent’s actions and contextualized their meaning within the problem domain.

[224] Collaboration and Conflict between Humans and Language Models through the Lens of Game Theory

Mukul Singh, Arjun Radhakrishna, Sumit Gulwani

Main category: cs.AI

TL;DR: Language models achieve top-tier performance in iterated prisoner’s dilemma, matching or exceeding classical strategies while demonstrating strong cooperative properties and rapid adaptability to opponent strategy changes.

DetailsMotivation: To understand how language models behave in long-term multi-party interactions and cooperative settings, addressing gaps in prior work that focused on isolated or short-term game-theoretic contexts.

Method: Conducted Axelrod-style tournaments pitting language model agents against 240 classical strategies in iterated prisoner’s dilemma, with controlled strategy switch experiments to test adaptability.

Result: Language models performed on par with or better than best-known classical strategies, exhibiting niceness, provocability, generosity, and rapid detection/response to opponent strategy changes within few rounds.

Conclusion: Language models show sophisticated long-term cooperative behaviors and adaptability, providing foundation for studying their role in complex human-AI social environments.

Abstract: Language models are increasingly deployed in interactive online environments, from personal chat assistants to domain-specific agents, raising questions about their cooperative and competitive behavior in multi-party settings. While prior work has examined language model decision-making in isolated or short-term game-theoretic contexts, these studies often neglect long-horizon interactions, human-model collaboration, and the evolution of behavioral patterns over time. In this paper, we investigate the dynamics of language model behavior in the iterated prisoner’s dilemma (IPD), a classical framework for studying cooperation and conflict. We pit model-based agents against a suite of 240 well-established classical strategies in an Axelrod-style tournament and find that language models achieve performance on par with, and in some cases exceeding, the best-known classical strategies. Behavioral analysis reveals that language models exhibit key properties associated with strong cooperative strategies - niceness, provocability, and generosity while also demonstrating rapid adaptability to changes in opponent strategy mid-game. In controlled “strategy switch” experiments, language models detect and respond to shifts within only a few rounds, rivaling or surpassing human adaptability. These results provide the first systematic characterization of long-term cooperative behaviors in language model agents, offering a foundation for future research into their role in more complex, mixed human-AI social environments.

[225] Cloning a Conversational Voice AI Agent from Call,Recording Datasets for Telesales

Krittanon Kaewtawee, Wachiravit Modecrua, Krittin Pachtrachai, Touchapon Kraisingkorn

Main category: cs.AI

TL;DR: A methodology for cloning conversational voice AI agents from call recordings, integrating speech recognition, LLM dialogue management, and speech synthesis, achieving near-human performance in routine tasks but lagging in persuasion.

DetailsMotivation: To automate repetitive tasks in domains like customer service and healthcare by creating voice assistants that can understand and generate human dialogue in real time, reducing operational costs and providing constant support.

Method: General methodology using call transcripts to clone voice AI agents, involving domain selection, knowledge extraction, prompt engineering, and integrating automatic speech recognition, large language model dialogue management, and text-to-speech synthesis into a streaming pipeline.

Result: The cloned AI agent approaches human performance in routine call aspects but underperforms in persuasion and objection handling. Blind tests evaluated against 22 criteria covering introduction, product communication, sales drive, objection handling, and closing.

Conclusion: The approach generalizes to any domain with call transcripts, with identified shortcomings in persuasion leading to prompt refinement. Future research directions include large-scale simulation and automated evaluation.

Abstract: Recent advances in language and speech modelling have made it possible to build autonomous voice assistants that understand and generate human dialogue in real time. These systems are increasingly being deployed in domains such as customer service and healthcare care, where they can automate repetitive tasks, reduce operational costs, and provide constant support around the clock. In this paper, we present a general methodology for cloning a conversational voice AI agent from a corpus of call recordings. Although the case study described in this paper uses telesales data to illustrate the approach, the underlying process generalizes to any domain where call transcripts are available. Our system listens to customers over the telephone, responds with a synthetic voice, and follows a structured playbook learned from top performing human agents. We describe the domain selection, knowledge extraction, and prompt engineering used to construct the agent, integrating automatic speech recognition, a large language model based dialogue manager, and text to speech synthesis into a streaming inference pipeline. The cloned agent is evaluated against human agents on a rubric of 22 criteria covering introduction, product communication, sales drive, objection handling, and closing. Blind tests show that the AI agent approaches human performance in routine aspects of the call while underperforming in persuasion and objection handling. We analyze these shortcomings and refine the prompt accordingly. The paper concludes with design lessons and avenues for future research, including large scale simulation and automated evaluation.

[226] OSC: Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration

Jusheng Zhang, Yijia Fan, Kaitong Cai, Xiaofei Sun, Keze Wang

Main category: cs.AI

TL;DR: OSC is a knowledge-aware adaptive collaboration framework that enhances cognitive synergy in multi-agent LLM systems by introducing Collaborator Knowledge Models for real-time cognitive gap analysis and adaptive communication adjustments.

DetailsMotivation: Prior work has advanced agent selection and result aggregation, but efficient linguistic interactions for deep collaboration among expert agents remain a critical bottleneck that needs to be addressed.

Method: OSC introduces Collaborator Knowledge Models (CKM) to enable agents to dynamically perceive collaborators’ cognitive states, perform real-time cognitive gap analysis, and adaptively adjust communication behaviors including content focus, detail level, and expression style.

Result: Experiments on complex reasoning and problem-solving benchmarks demonstrate that OSC significantly improves task performance and communication efficiency, transforming individual agents into a deeply collaborative cognitive team.

Conclusion: OSC not only optimizes multi-agent collaboration but also offers new insights into LLM agent interaction behaviors, serving as a pivotal intermediate layer between selection and aggregation processes.

Abstract: This paper introduces OSC (Orchestrating Cognitive Synergy), a knowledge-aware adaptive collaboration framework designed to enhance cognitive synergy in multi-agent systems with large language models. While prior work has advanced agent selection and result aggregation, efficient linguistic interactions for deep collaboration among expert agents remain a critical bottleneck. OSC addresses this gap as a pivotal intermediate layer between selection and aggregation, introducing Collaborator Knowledge Models (CKM) to enable each agent to dynamically perceive its collaborators’ cognitive states. Through real-time cognitive gap analysis, agents adaptively adjust communication behaviors, including content focus, detail level, and expression style, using learned strategies. Experiments on complex reasoning and problem-solving benchmarks demonstrate that OSC significantly improves task performance and communication efficiency, transforming “parallel-working individuals’’ into a “deeply collaborative cognitive team.’’ This framework not only optimizes multi-agent collaboration but also offers new insights into LLM agent interaction behaviors.

[227] Dynamic Speculative Agent Planning

Yilin Guan, Wenyue Hua, Qingfeng Lan, Sun Fei, Dujian Ding, Devang Acharya, Chi Wang, William Yang Wang

Main category: cs.AI

TL;DR: DSP is a reinforcement learning framework that provides lossless acceleration for LLM agents with 30% cost reduction and up to 60% reduction in unnecessary costs, while allowing users to control the latency-cost tradeoff.

DetailsMotivation: Large language model agents face prohibitive latency and inference costs in deployment, and existing acceleration methods either sacrifice performance fidelity, require extensive offline training, or offer minimal user control over the acceleration-cost tradeoff.

Method: Dynamic Speculative Planning (DSP) - an asynchronous online reinforcement learning framework that optimizes a joint objective balancing end-to-end latency against dollar cost without requiring additional pre-deployment preparation.

Result: DSP achieves comparable efficiency to the fastest lossless acceleration method while reducing total cost by 30% and unnecessary cost by up to 60% on standard agent benchmarks.

Conclusion: DSP provides a practical solution for lossless acceleration of LLM agents with substantial cost savings and flexible user control over the latency-cost tradeoff continuum.

Abstract: Despite their remarkable success in complex tasks propelling widespread adoption, large language-model-based agents still face critical deployment challenges due to prohibitive latency and inference costs. While recent work has explored various methods to accelerate inference, existing approaches suffer from significant limitations: they either fail to preserve performance fidelity, require extensive offline training of router modules, or incur excessive operational costs. Moreover, they provide minimal user control over the tradeoff between acceleration and other performance metrics. To address these gaps, we introduce Dynamic Speculative Planning (DSP), an asynchronous online reinforcement learning framework that provides lossless acceleration with substantially reduced costs without requiring additional pre-deployment preparation. DSP explicitly optimizes a joint objective balancing end-to-end latency against dollar cost, allowing practitioners to adjust a single parameter that steers the system toward faster responses, cheaper operation, or any point along this continuum. Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest lossless acceleration method while reducing total cost by 30% and unnecessary cost up to 60%. Our code and data are available through https://github.com/guanyilin428/Dynamic-Speculative-Planning.

[228] SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing

Hongyi Jing, Jiafu Chen, Chen Rao, Ziqiang Dang, Jiajie Teng, Tianyi Chu, Juncheng Mo, Shuo Fang, Huaizhong Lin, Rui Lv, Chenguang Ma, Lei Zhao

Main category: cs.AI

TL;DR: SparkUI-Parser is a novel MLLM-based framework that improves GUI perception with continuous coordinate modeling, achieving higher accuracy and faster inference than previous discrete coordinate methods.

DetailsMotivation: Existing MLLMs for GUI perception suffer from low grounding accuracy, slow inference due to discrete coordinate modeling, and limited capability to parse entire interfaces rather than just predefined elements.

Method: Uses continuous coordinate modeling with a pre-trained MLLM, adding a token router and coordinate decoder. Includes a rejection mechanism based on modified Hungarian matching to identify non-existent elements and reduce false positives.

Result: Outperforms state-of-the-art methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks, achieving higher accuracy and faster inference speed.

Conclusion: SparkUI-Parser successfully addresses limitations of previous GUI perception methods by enabling continuous coordinate modeling and full interface parsing, making it more suitable for broad applications and downstream tasks.

Abstract: The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at https://github.com/antgroup/SparkUI-Parser.

[229] Towards Ontology-Based Descriptions of Conversations with Qualitatively-Defined Concepts

Barbara Gendron, Gaël Guibon, Mathieu D’aquin

Main category: cs.AI

TL;DR: Ontology-based framework for controlling LLM conversational proficiency levels using formal CEFR definitions and fine-tuning

DetailsMotivation: Addressing the challenge of ensuring predictable and user-personalized responses from LLMs by providing formal, quantitative definitions for qualitative conversational features

Method: Leveraging linguistic descriptors to derive quantitative definitions for qualitative concepts, formalizing them in description logic, incorporating into an ontology, and using this to guide LLM fine-tuning for controlled text generation

Result: Experimental results show consistent and explainable proficiency-level definitions that improve transparency in conversational AI

Conclusion: The ontology-based approach successfully enables formal definition and control of conversational features, providing a framework for predictable and transparent LLM behavior in conversational settings

Abstract: The controllability of Large Language Models (LLMs) when used as conversational agents is a key challenge, particularly to ensure predictable and user-personalized responses. This work proposes an ontology-based approach to formally define conversational features that are typically qualitative in nature. By leveraging a set of linguistic descriptors, we derive quantitative definitions for qualitatively-defined concepts, enabling their integration into an ontology for reasoning and consistency checking. We apply this framework to the task of proficiency-level control in conversations, using CEFR language proficiency levels as a case study. These definitions are then formalized in description logic and incorporated into an ontology, which guides controlled text generation of an LLM through fine-tuning. Experimental results demonstrate that our approach provides consistent and explainable proficiency-level definitions, improving transparency in conversational AI.

[230] Internet 3.0: Architecture for a Web-of-Agents with it’s Algorithm for Ranking Agents

Rajesh Tembarai Krishnamachari, Srividya Rajesh

Main category: cs.AI

TL;DR: DOVIS protocol enables privacy-preserving agent ranking for Web of Agents using usage and competence metrics with theoretical guarantees.

DetailsMotivation: AI agents need performance-based ranking but current systems lack transparent interaction networks and have fragmented private usage data.

Method: Five-layer DOVIS protocol (Discovery, Orchestration, Verification, Incentives, Semantics) with AgentRank-UC algorithm combining usage frequency and competence metrics.

Result: Simulation shows convergence, robustness, and Sybil resistance, enabling scalable trustworthy agent ranking.

Conclusion: Coordinated protocols and performance-aware ranking are viable for building a trustworthy Agentic Web ecosystem.

Abstract: AI agents – powered by reasoning-capable large language models (LLMs) and integrated with tools, data, and web search – are poised to transform the internet into a \emph{Web of Agents}: a machine-native ecosystem where autonomous agents interact, collaborate, and execute tasks at scale. Realizing this vision requires \emph{Agent Ranking} – selecting agents not only by declared capabilities but by proven, recent performance. Unlike Web~1.0’s PageRank, a global, transparent network of agent interactions does not exist; usage signals are fragmented and private, making ranking infeasible without coordination. We propose \textbf{DOVIS}, a five-layer operational protocol (\emph{Discovery, Orchestration, Verification, Incentives, Semantics}) that enables the collection of minimal, privacy-preserving aggregates of usage and performance across the ecosystem. On this substrate, we implement \textbf{AgentRank-UC}, a dynamic, trust-aware algorithm that combines \emph{usage} (selection frequency) and \emph{competence} (outcome quality, cost, safety, latency) into a unified ranking. We present simulation results and theoretical guarantees on convergence, robustness, and Sybil resistance, demonstrating the viability of coordinated protocols and performance-aware ranking in enabling a scalable, trustworthy Agentic Web.

[231] Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework

Jie Chen, Jinhao Jiang, Yingqian Min, Zican Dong, Shijie Wang, Wayne Xin Zhao, Ji-Rong Wen

Main category: cs.AI

TL;DR: Sticker-TTS is a test-time scaling framework that coordinates three collaborative large reasoning models to iteratively explore and refine solutions using historical experience, achieving superior performance on mathematical reasoning benchmarks.

DetailsMotivation: Current test-time scaling methods rely on redundant sampling and ignore historical experience utilization, limiting computational efficiency during inference.

Method: Proposes a framework with three collaborative LRMs that extract, refine, and reuse critical information (called ‘stickers’) across multiple reasoning rounds. Uses a two-stage optimization strategy combining imitation learning with self-improvement for progressive refinement.

Result: Extensive evaluations on AIME-24, AIME-25, and OlymMATH benchmarks show Sticker-TTS consistently surpasses strong baselines including self-consistency and reinforcement learning approaches under comparable inference budgets.

Conclusion: The framework demonstrates the effectiveness of sticker-guided historical experience utilization for improving computational efficiency and performance in complex reasoning tasks.

Abstract: Large reasoning models (LRMs) have exhibited strong performance on complex reasoning tasks, with further gains achievable through increased computational budgets at inference. However, current test-time scaling methods predominantly rely on redundant sampling, ignoring the historical experience utilization, thereby limiting computational efficiency. To overcome this limitation, we propose Sticker-TTS, a novel test-time scaling framework that coordinates three collaborative LRMs to iteratively explore and refine solutions guided by historical attempts. At the core of our framework are distilled key conditions-termed stickers-which drive the extraction, refinement, and reuse of critical information across multiple rounds of reasoning. To further enhance the efficiency and performance of our framework, we introduce a two-stage optimization strategy that combines imitation learning with self-improvement, enabling progressive refinement. Extensive evaluations on three challenging mathematical reasoning benchmarks, including AIME-24, AIME-25, and OlymMATH, demonstrate that Sticker-TTS consistently surpasses strong baselines, including self-consistency and advanced reinforcement learning approaches, under comparable inference budgets. These results highlight the effectiveness of sticker-guided historical experience utilization. Our code and data are available at https://github.com/RUCAIBox/Sticker-TTS.

[232] Finding your MUSE: Mining Unexpected Solutions Engine

Nir Sweed, Hanit Hakim, Ben Wolfson, Hila Lifshitz, Dafna Shahaf

Main category: cs.AI

TL;DR: A methodology for building Functional Concept Graphs (FCGs) to overcome cognitive fixation in innovation, with MUSE algorithm for generating creative inspirations from patent data.

DetailsMotivation: Innovators often get stuck on existing solutions or early ideas, limiting exploration of novel alternatives. This cognitive fixation hinders creative problem-solving.

Method: Construct large-scale Functional Concept Graphs (FCGs) with explicit abstraction relations from patent data, and develop MUSE algorithm to generate creative inspirations by leveraging these graphs.

Result: Created high-quality FCGs from 500K patents with explicit abstraction relations, overcoming limitations of prior work. Released the computed FCG dataset for further research.

Conclusion: FCGs provide a powerful framework for supporting abstraction, problem reframing, and analogical inspiration, enabling more effective creative problem-solving by breaking cognitive fixation patterns.

Abstract: Innovators often exhibit cognitive fixation on existing solutions or nascent ideas, hindering the exploration of novel alternatives. This paper introduces a methodology for constructing Functional Concept Graphs (FCGs), interconnected representations of functional elements that support abstraction, problem reframing, and analogical inspiration. Our approach yields large-scale, high-quality FCGs with explicit abstraction relations, overcoming limitations of prior work. We further present MUSE, an algorithm leveraging FCGs to generate creative inspirations for a given problem. We demonstrate our method by computing an FCG on 500K patents, which we release for further research.

[233] Evaluation and Comparison Semantics for ODRL

Jaime Osvaldo Salas, Paolo Pareti, Semih Yumuşak, Soulmaz Gheisari, Luis-Daniel Ibáñez, George Konstantinidis

Main category: cs.AI

TL;DR: A formal semantics for ODRL based on query answering, enabling policy evaluation and comparison for digital rights management.

DetailsMotivation: ODRL is the de facto standard for governing digital resource access but lacks a comprehensive formal semantics, making policy evaluation and comparison difficult.

Method: Developed a simple and intuitive formal semantics based on query answering, refining previous formalizations and aligning with ODRL 2.2 specification.

Result: Created a formal evaluation framework that allows detecting equivalent, more restrictive, or more permissive policies in data sharing scenarios.

Conclusion: The proposed semantics provides a solid foundation for policy analysis and comparison, addressing a critical gap in ODRL implementation and standardization.

Abstract: We consider the problem of evaluating, and comparing computational policies in the Open Digital Rights Language (ODRL), which has become the de facto standard for governing the access and usage of digital resources. Although preliminary progress has been made on the formal specification of the language’s features, a comprehensive formal semantics of ODRL is still missing. In this paper, we provide a simple and intuitive formal semantics for ODRL that is based on query answering. Our semantics refines previous formalisations, and is aligned with the latest published specification of the language (2.2). Building on our evaluation semantics, and motivated by data sharing scenarios, we also define and study the problem of comparing two policies, detecting equivalent, more restrictive or more permissive policies.

[234] LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

Yinglin Duan, Zhengxia Zou, Tongwei Gu, Wei Jia, Zhan Zhao, Luyi Xu, Xinzhu Liu, Hao Jiang, Kang Chen, Shuang Qiu

Main category: cs.AI

TL;DR: LatticeWorld is a 3D world generation framework that uses lightweight LLMs and Unreal Engine 5 to create dynamic, interactive 3D environments from multimodal inputs, achieving 90x efficiency gains over manual methods.

DetailsMotivation: To bridge the sim-to-real gap and enable convenient creation of realistic 3D simulations for applications like embodied AI, autonomous driving, and entertainment by automating the industrial production pipeline.

Method: Proposes LatticeWorld framework that combines lightweight LLaMA-2-7B LLMs with Unreal Engine 5 rendering engine to generate dynamic 3D environments from textual descriptions and visual instructions as multimodal inputs.

Result: Achieves superior accuracy in scene layout generation and visual fidelity, with over 90x increase in industrial production efficiency compared to traditional manual methods while maintaining high creative quality.

Conclusion: LatticeWorld demonstrates an effective approach to automated 3D world generation that significantly streamlines production pipelines while delivering high-quality, interactive environments with realistic physics and multi-agent interactions.

Abstract: Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a $90\times$ increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18

[235] MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts

Zinan Zeng, Sen Ye, Zijian Cai, Heng Wang, Yuhan Liu, Haokai Zhang, Minnan Luo

Main category: cs.AI

TL;DR: MMoE is a multi-modal network that uses graph, text, and metadata features with Mixture-of-Experts architecture for robust spoiler detection in movie reviews, achieving state-of-the-art performance.

DetailsMotivation: Existing spoiler detection methods only use text content and ignore valuable metadata and user information. Spoiler language is genre-specific, creating domain generalization challenges.

Method: Extracts graph features from user-movie network, text features from review content, and meta features from review metadata. Uses Mixture-of-Experts architecture to handle genre-specific spoilers and expert fusion layer to integrate multi-modal features.

Result: Achieves SOTA performance on two datasets with 2.56% and 8.41% improvements in accuracy and F1-score. Demonstrates superior robustness and generalization.

Conclusion: MMoE effectively leverages multi-modal information and domain adaptation to significantly improve spoiler detection performance across different movie genres.

Abstract: Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews’ text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user’s information of a review could be helpful. Besides, the spoiler language of movie reviews tends to be genre-specific, thus posing a domain generalization challenge for existing methods. To this end, we propose MMoE, a multi-modal network that utilizes information from multiple modalities to facilitate robust spoiler detection and adopts Mixture-of-Experts to enhance domain generalization. MMoE first extracts graph, text, and meta feature from the user-movie network, the review’s textual content, and the review’s metadata respectively. To handle genre-specific spoilers, we then adopt Mixture-of-Experts architecture to process information in three modalities to promote robustness. Finally, we use an expert fusion layer to integrate the features from different perspectives and make predictions based on the fused embedding. Experiments demonstrate that MMoE achieves state-of-the-art performance on two widely-used spoiler detection datasets, surpassing previous SOTA methods by 2.56% and 8.41% in terms of accuracy and F1-score. Further experiments also demonstrate MMoE’s superiority in robustness and generalization. Our code is available at https://github.com/zzqbjt/Spoiler-Detection.

[236] Low-Dimensional Federated Knowledge Graph Embedding via Knowledge Distillation

Xiaoxiong Zhang, Zhiwei Zeng, Xin Zhou, Zhiqi Shen

Main category: cs.AI

TL;DR: FedKD is a knowledge distillation method for federated knowledge graph embedding that compresses high-dimensional models to low-dimensional ones while maintaining performance and reducing communication costs.

DetailsMotivation: High-dimensional embeddings in federated learning improve performance but create storage and communication challenges. Existing compression methods require multiple trainings which are inefficient for federated settings.

Method: Uses knowledge distillation with adaptive temperature scaling to help low-dimensional student models mimic high-dimensional teacher models. Separately adjusts positive/negative triple scores and dynamically weights KD loss.

Result: Extensive experiments on three datasets demonstrate FedKD’s effectiveness in compressing embeddings while maintaining performance.

Conclusion: FedKD provides an efficient compression solution specifically designed for federated knowledge graph embedding, addressing communication costs and storage issues while preserving model performance.

Abstract: Federated Knowledge Graph Embedding (FKGE) aims to facilitate collaborative learning of entity and relation embeddings from distributed Knowledge Graphs (KGs) across multiple clients, while preserving data privacy. Training FKGE models with higher dimensions is typically favored due to their potential for achieving superior performance. However, high-dimensional embeddings present significant challenges in terms of storage resource and inference speed. Unlike traditional KG embedding methods, FKGE involves multiple client-server communication rounds, where communication efficiency is critical. Existing embedding compression methods for traditional KGs may not be directly applicable to FKGE as they often require multiple model trainings which potentially incur substantial communication costs. In this paper, we propose a light-weight component based on Knowledge Distillation (KD) which is titled FedKD and tailored specifically for FKGE methods. During client-side local training, FedKD facilitates the low-dimensional student model to mimic the score distribution of triples from the high-dimensional teacher model using KL divergence loss. Unlike traditional KD way, FedKD adaptively learns a temperature to scale the score of positive triples and separately adjusts the scores of corresponding negative triples using a predefined temperature, thereby mitigating teacher over-confidence issue. Furthermore, we dynamically adjust the weight of KD loss to optimize the training process. Extensive experiments on three datasets support the effectiveness of FedKD.

[237] Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu

Main category: cs.AI

TL;DR: Antidote is a post-fine-tuning defense that uses one-shot pruning to remove harmful weights from safety-aligned LLMs compromised by harmful fine-tuning attacks, working regardless of training hyper-parameters.

DetailsMotivation: Existing defenses fail when specific training hyper-parameters (large learning rate or many epochs) are used in harmful fine-tuning attacks that break LLM safety alignment.

Method: A post-fine-tuning pruning stage that removes harmful parameters responsible for generating harmful content, based on the philosophy that removing harmful weights can recover model safety regardless of how they were formed.

Result: Empirical results show Antidote reduces harmful score while maintaining accuracy on downstream tasks, remaining effective across different training hyper-parameters.

Conclusion: Antidote provides an effective and hyper-parameter-agnostic defense against harmful fine-tuning attacks through simple one-shot pruning after fine-tuning.

Abstract: Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks – a few harmful data mixed in the fine-tuning dataset can break the LLMs’s safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} – a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks. Code is available at https://github.com/git-disl/Antidote.

[238] Neural Network Verification with PyRAT

Augustin Lemesle, Julien Lehmann, Tristan Le Gall

Main category: cs.AI

TL;DR: PyRAT is an abstract interpretation-based tool for verifying neural network safety and robustness, achieving second place in VNN-Comp 2024.

DetailsMotivation: As AI systems become increasingly deployed in critical domains like health, transport, and energy, there is a growing need to provide safety guarantees and build trust in these systems.

Method: PyRAT uses abstract interpretation to analyze neural networks by computing reachable states from input through various abstraction techniques, providing fast and accurate analysis.

Result: The tool has been successfully used in multiple collaborations to ensure safety guarantees and demonstrated strong performance by achieving second place in the VNN-Comp 2024 competition.

Conclusion: PyRAT provides an effective approach for verifying neural network safety and robustness through abstract interpretation, making it suitable for critical AI applications where safety guarantees are essential.

Abstract: As AI systems are becoming more and more popular and used in various critical domains (health, transport, energy, …), the need to provide guarantees and trust of their safety is undeniable. To this end, we present PyRAT, a tool based on abstract interpretation to verify the safety and the robustness of neural networks. In this paper, we describe the different abstractions used by PyRAT to find the reachable states of a neural network starting from its input as well as the main features of the tool to provide fast and accurate analysis of neural networks. PyRAT has already been used in several collaborations to ensure safety guarantees, with its second place at the VNN-Comp 2024 showcasing its performance.

[239] GUI Agents: A Survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt

Main category: cs.AI

TL;DR: A comprehensive survey of GUI agents powered by Large Foundation Models, covering benchmarks, evaluation metrics, architectures, training methods, and proposing a unified framework for perception, reasoning, planning, and acting capabilities.

DetailsMotivation: The growing interest and fundamental importance of GUI agents that automate human-computer interaction by emulating human actions like clicking, typing, and navigating visual elements across diverse platforms.

Method: Provides a comprehensive survey that categorizes GUI agent benchmarks, evaluation metrics, architectures, and training methods. Proposes a unified framework delineating perception, reasoning, planning, and acting capabilities.

Result: Identifies important open challenges and discusses key future directions for GUI agent development and research.

Conclusion: Serves as a basis for practitioners and researchers to understand current progress, techniques, benchmarks, and critical open problems in GUI agent technology.

Abstract: Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.

[240] Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu

Main category: cs.AI

TL;DR: V-Droid is a mobile GUI task automation agent that uses LLMs as verifiers instead of generators, achieving state-of-the-art performance on multiple benchmarks with significantly faster execution speed.

DetailsMotivation: Previous mobile agents use LLMs to directly generate actions at each step, which can be inefficient and error-prone. V-Droid introduces a novel paradigm where LLMs serve as verifiers to evaluate candidate actions before making decisions.

Method: The framework includes: 1) discretized action space construction with prefilling-only workflow for faster verification, 2) pair-wise progress preference training to enhance verifier decision-making, and 3) scalable human-agent joint annotation for efficient data collection.

Result: V-Droid achieves 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9% respectively. It also achieves 4.3s per step latency, which is 6.1X faster than existing agents.

Conclusion: The verifier-driven approach using LLMs for action evaluation rather than direct generation significantly improves both success rates and execution speed in mobile GUI task automation, demonstrating the effectiveness of this novel paradigm.

Abstract: We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier’s decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid obtains a substantial task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves a remarkably low latency of 4.3s per step, which is 6.1X faster compared with existing mobile agents. The source code is available at https://github.com/V-Droid-Agent/V-Droid.

[241] Epistemic Skills: Reasoning about Knowledge and Oblivion

Xiaolong Liang, Yì N. Wáng

Main category: cs.AI

TL;DR: A weighted epistemic logic framework that models knowledge acquisition (upskilling) and oblivion (downskilling) with group knowledge concepts, including knowability/forgettability analysis and de re/de dicto distinctions.

DetailsMotivation: To develop a formal system that captures the dynamic processes of gaining knowledge and forgetting, while incorporating group epistemic concepts and providing computational complexity analysis.

Method: Uses a system of weighted models with an “epistemic skills” metric to represent epistemic capacities. Models knowledge acquisition as upskilling and oblivion as downskilling processes.

Result: The framework enables exploration of knowability (potential to gain knowledge) and forgettability (potential to lapse into oblivion), and supports analysis of epistemic de re vs de dicto expressions.

Conclusion: The paper establishes a comprehensive epistemic logic system that dynamically models knowledge changes, provides computational complexity insights for model checking and satisfiability problems, and offers theoretical foundations for understanding knowledge dynamics.

Abstract: This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of knowability’’ and ``forgettability,’’ defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

[242] ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

Shuai Wang, Ivona Najdenkoska, Hongyi Zhu, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Main category: cs.AI

TL;DR: ArtRAG is a training-free framework that combines knowledge graphs with retrieval-augmented generation to provide multi-perspective artwork explanations, outperforming existing methods on art interpretation tasks.

DetailsMotivation: Current multimodal LLMs fail to capture nuanced interpretations required for fine art analysis, which demands reasoning across cultural, historical, and stylistic perspectives beyond simple object recognition.

Method: ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific texts, organizing entities like artists, movements, and historical events. It uses a multi-granular structured retriever to select relevant subgraphs to guide MLLM generation.

Result: Experiments on SemArt and Artpedia datasets show ArtRAG outperforms heavily trained baselines. Human evaluations confirm it generates coherent, insightful, and culturally enriched interpretations.

Conclusion: The framework successfully enables MLLMs to produce contextually grounded, culturally informed art descriptions without additional training, demonstrating the value of structured knowledge integration for art interpretation.

Abstract: Understanding visual art requires reasoning across multiple perspectives – cultural, historical, and stylistic – beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.

[243] Translating Federated Learning Algorithms in Python into CSP Processes Using ChatGPT

Miroslav Popovic, Marko Popovic, Miodrag Djukic, Ilija Basicevic

Main category: cs.AI

TL;DR: A framework that uses ChatGPT to automatically translate Python federated learning algorithms into CSP processes for model checking verification.

DetailsMotivation: To automate the translation of federated learning algorithms from Python to CSP processes, making verification more accessible and reducing manual effort.

Method: Using ChatGPT to translate Python FL algorithms into CSP processes with minimal context, then verifying correctness with the PAT model checker.

Result: Successful automated translation and verification of both centralized and decentralized federated learning algorithms.

Conclusion: ChatGPT can effectively automate the translation process for formal verification of federated learning algorithms, reducing manual translation effort.

Abstract: The Python Testbed for Federated Learning Algorithms is a simple Python FL framework that is easy to use by ML&AI developers who do not need to be professional programmers and is also amenable to LLMs. In the previous research, generic federated learning algorithms provided by this framework were manually translated into the CSP processes and algorithms’ safety and liveness properties were automatically verified by the model checker PAT. In this paper, a simple translation process is introduced wherein the ChatGPT is used to automate the translation of the mentioned federated learning algorithms in Python into the corresponding CSP processes. Within the process, the minimality of the used context is estimated based on the feedback from ChatGPT. The proposed translation process was experimentally validated by successful translation (verified by the model checker PAT) of both generic centralized and decentralized federated learning algorithms.

[244] Don’t Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning

William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane

Main category: cs.AI

TL;DR: SEAT: A fine-tuning method that preserves LLMs’ ignorance awareness (ability to express uncertainty) while learning new knowledge, preventing hallucination-causing activation displacement.

DetailsMotivation: Conventional fine-tuning causes catastrophic forgetting of aligned capabilities like epistemic uncertainty expression, leading to hallucinations despite preserving performance on seen data.

Method: SEAT combines sparse tuning to constrain activation drift and entity perturbation to counter knowledge entanglement during fine-tuning.

Result: SEAT significantly outperforms baselines in preserving ignorance awareness while maintaining optimal fine-tuning performance on both real-world and synthetic datasets.

Conclusion: SEAT offers a robust solution for LLM fine-tuning that maintains critical aligned capabilities like uncertainty awareness when learning new knowledge instances.

Abstract: Existing work on mitigating catastrophic forgetting during large language models (LLMs) fine-tuning for new knowledge instances has primarily focused on preserving performance on previously seen data, while critically overlooking the collapse of essential capabilities instilled through alignment, most notably the model’s ability to faithfully express epistemic uncertainty (a property we term ‘Ignorance Awareness’). In this work, we formalize the notion of Ignorance Awareness and illustrate that conventional fine-tuning methods can result in substantial activation displacement. This displacement undermines the critical capability of ignorance awareness, leading to undesirable behaviors such as hallucinations. To address this challenge, we introduce SEAT, a simple and principled fine-tuning approach that not only enables the model to effectively acquire new knowledge instances but also preserves its aligned ignorance awareness. SEAT integrates two key components: (1) sparse tuning that constrains activation drift, and (2) a novel entity perturbation method designed to counter knowledge entanglement. Experimental results demonstrate that, across both real-world and synthetic datasets, SEAT significantly outperforms baselines in preserving ignorance awareness while retaining optimal fine-tuning performance, offering a more robust solution for LLM fine-tuning.

[245] DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang

Main category: cs.AI

TL;DR: DiMo-GUI is a training-free framework for GUI grounding that separates visual elements into text and icons, then uses dynamic region zooming to refine ambiguous predictions without additional training.

DetailsMotivation: GUI grounding faces challenges from visual element diversity, spatial clutter, and language ambiguity, requiring methods that can handle these complexities without extensive training.

Method: Splits GUI into textual and iconic elements for independent reasoning, uses dynamic focal region generation centered on initial predictions, and incrementally zooms into subregions for refinement.

Result: Consistent improvements over baseline inference pipelines on standard GUI grounding benchmarks, demonstrating effective disambiguation of crowded layouts.

Conclusion: Combining modality separation with region-focused reasoning provides an effective training-free solution for GUI grounding challenges.

Abstract: Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model’s initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.

[246] Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

Jiahuan Pei, Fanghua Ye, Xin Sun, Wentao Deng, Koen Hindriks, Junxiao Wang

Main category: cs.AI

TL;DR: WikiHowAgent is a multi-agent LLM system that simulates teacher-learner conversations for procedural learning, using a large dataset of 114k+ conversations across diverse domains with comprehensive evaluation metrics.

DetailsMotivation: Existing LLM-based educational systems lack scalability, fail to leverage large-scale course content, and have limited frameworks for assessing pedagogic quality.

Method: A multi-agent workflow with teacher and learner agents, interaction manager, and evaluator that simulates interactive teaching-learning conversations grounded in WikiHow tutorials across 17 domains and 727 topics.

Result: Created a dataset of 114,296 teacher-learner conversations from 14,287 tutorials. The workflow demonstrated effectiveness across diverse setups and provided insights into LLM capabilities across different domains.

Conclusion: WikiHowAgent offers a scalable framework for procedural learning with comprehensive pedagogic assessment, and all datasets and implementations are open-sourced for community use.

Abstract: Large language models (LLMs) have advanced virtual educators and learners, bridging NLP with AI4Education. Existing work often lacks scalability and fails to leverage diverse, large-scale course content, with limited frameworks for assessing pedagogic quality. To this end, we propose WikiHowAgent, a multi-agent workflow leveraging LLMs to simulate interactive teaching-learning conversations. It integrates teacher and learner agents, an interaction manager, and an evaluator to facilitate procedural learning and assess pedagogic quality. We introduce a dataset of 114,296 teacher-learner conversations grounded in 14,287 tutorials across 17 domains and 727 topics. Our evaluation protocol combines computational and rubric-based metrics with human judgment alignment. Results demonstrate the workflow’s effectiveness in diverse setups, offering insights into LLM capabilities across domains. Our datasets and implementations are fully open-sourced.

[247] MeLA: A Metacognitive LLM-Driven Architecture for Automatic Heuristic Design

Zishang Qiu, Xinan Chen, Long Chen, Ruibin Bai

Main category: cs.AI

TL;DR: MeLA is a metacognitive LLM architecture that evolves prompts instead of code for automatic heuristic design, outperforming traditional methods by using performance feedback to refine generative strategies.

DetailsMotivation: Traditional evolutionary methods for automatic heuristic design operate directly on code, which can be inefficient. The authors propose using LLMs with evolved prompts to generate better heuristics through a metacognitive approach.

Method: MeLA uses prompt evolution driven by a metacognitive framework: problem analyzer creates initial prompts, error diagnosis repairs faulty code, and metacognitive search optimizes prompts based on heuristic effectiveness.

Result: In comprehensive experiments across benchmark and real-world problems, MeLA consistently generated more effective and robust heuristics, significantly outperforming state-of-the-art methods.

Conclusion: The research demonstrates the potential of using cognitive science as a blueprint for AI architecture, showing that metacognitive regulation of LLM problem-solving enables more robust and interpretable automatic heuristic design.

Abstract: This paper introduces MeLA, a Metacognitive LLM-Driven Architecture that presents a new paradigm for Automatic Heuristic Design (AHD). Traditional evolutionary methods operate directly on heuristic code; in contrast, MeLA evolves the instructional prompts used to guide a Large Language Model (LLM) in generating these heuristics. This process of “prompt evolution” is driven by a novel metacognitive framework where the system analyzes performance feedback to systematically refine its generative strategy. MeLA’s architecture integrates a problem analyzer to construct an initial strategic prompt, an error diagnosis system to repair faulty code, and a metacognitive search engine that iteratively optimizes the prompt based on heuristic effectiveness. In comprehensive experiments across both benchmark and real-world problems, MeLA consistently generates more effective and robust heuristics, significantly outperforming state-of-the-art methods. Ultimately, this research demonstrates the profound potential of using cognitive science as a blueprint for AI architecture, revealing that by enabling an LLM to metacognitively regulate its problem-solving process, we unlock a more robust and interpretable path to AHD.

[248] FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang

Main category: cs.AI

TL;DR: FutureX is a dynamic live benchmark for evaluating LLM agents on future prediction tasks, featuring real-time updates and automated pipelines to prevent data contamination, with comprehensive evaluation of 25 models.

DetailsMotivation: No large-scale benchmark exists for evaluating LLM agents on future prediction due to challenges with real-time updates and timely answer retrieval, despite the importance of this complex analytical task.

Method: Created FutureX benchmark with automated pipeline for question gathering and answer collection, supporting real-time daily updates. Evaluated 25 LLM/agent models including reasoning, search capabilities, and external tool integration.

Result: Comprehensive evaluation assessed agents’ adaptive reasoning and performance in dynamic environments, with in-depth analysis of failure modes including vulnerability to fake web pages and temporal validity issues.

Conclusion: FutureX establishes a dynamic, contamination-free evaluation standard to drive development of LLM agents capable of performing at professional human analyst levels in complex reasoning and predictive thinking.

Abstract: Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce $\textbf{FutureX}$, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents’ adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents’ failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.

[249] MHSNet:An MoE-based Hierarchical Semantic Representation Network for Accurate Duplicate Resume Detection with Large Language Model

Yu Li, Zulong Chen, Wenjian Xu, Hong Wen, Yipeng Yu, Man Lung Yiu, Yuyu Yin

Main category: cs.AI

TL;DR: MHSNet is a multi-level identity verification framework that uses fine-tuned BGE-M3 with contrastive learning and Mixture-of-Experts to detect duplicate resumes from third-party websites, addressing challenges of semantic complexity and information incompleteness.

DetailsMotivation: To improve the quality of third-party resumes and enrich company talent pools by detecting duplicates between fetched resumes and existing ones, despite challenges like semantic complexity, structural heterogeneity, and information incompleteness.

Method: Proposes MHSNet framework that fine-tunes BGE-M3 using contrastive learning. Uses Mixture-of-Experts (MoE) to generate multi-level sparse and dense representations for resumes, enabling computation of multi-level semantic similarities. Employs state-aware MoE to handle diverse incomplete resumes.

Result: Experimental results verify the effectiveness of MHSNet in resume duplication detection.

Conclusion: MHSNet provides an effective solution for resume duplication detection that addresses the challenges of semantic complexity and information incompleteness in third-party resumes.

Abstract: To maintain the company’s talent pool, recruiters need to continuously search for resumes from third-party websites (e.g., LinkedIn, Indeed). However, fetched resumes are often incomplete and inaccurate. To improve the quality of third-party resumes and enrich the company’s talent pool, it is essential to conduct duplication detection between the fetched resumes and those already in the company’s talent pool. Such duplication detection is challenging due to the semantic complexity, structural heterogeneity, and information incompleteness of resume texts. To this end, we propose MHSNet, an multi-level identity verification framework that fine-tunes BGE-M3 using contrastive learning. With the fine-tuned , Mixture-of-Experts (MoE) generates multi-level sparse and dense representations for resumes, enabling the computation of corresponding multi-level semantic similarities. Moreover, the state-aware Mixture-of-Experts (MoE) is employed in MHSNet to handle diverse incomplete resumes. Experimental results verify the effectiveness of MHSNet

[250] Graph RAG as Human Choice Model: Building a Data-Driven Mobility Agent with Preference Chain

Kai Hu, Parfait Atchade-Adelomou, Carlo Adornetto, Adrian Mora-Carrero, Luis Alonso-Pastor, Ariel Noyman, Yubo Liu, Kent Larson

Main category: cs.AI

TL;DR: Preference Chain method combines Graph RAG with LLMs to improve human behavior simulation in transportation systems, outperforming standard LLMs in real-world alignment.

DetailsMotivation: Address limitations of existing generative agents in producing consistent, context-sensitive, and realistic human behavioral outputs for urban environments, especially in data-scarce newly developed areas.

Method: Introduces Preference Chain - a novel approach integrating Graph Retrieval-Augmented Generation (RAG) with Large Language Models to enhance context-aware simulation of human behavior in transportation systems.

Result: Experiments on Replica dataset show Preference Chain outperforms standard LLM in aligning with real-world transportation mode choices.

Conclusion: The method provides a promising framework for simulating complex human behavior in data-scarce environments, with applications in urban mobility modeling, personalized travel analysis, and traffic forecasting, despite limitations like slow inference and hallucination risks.

Abstract: Understanding human behavior in urban environments is a crucial field within city sciences. However, collecting accurate behavioral data, particularly in newly developed areas, poses significant challenges. Recent advances in generative agents, powered by Large Language Models (LLMs), have shown promise in simulating human behaviors without relying on extensive datasets. Nevertheless, these methods often struggle with generating consistent, context-sensitive, and realistic behavioral outputs. To address these limitations, this paper introduces the Preference Chain, a novel method that integrates Graph Retrieval-Augmented Generation (RAG) with LLMs to enhance context-aware simulation of human behavior in transportation systems. Experiments conducted on the Replica dataset demonstrate that the Preference Chain outperforms standard LLM in aligning with real-world transportation mode choices. The development of the Mobility Agent highlights potential applications of proposed method in urban mobility modeling for emerging cities, personalized travel behavior analysis, and dynamic traffic forecasting. Despite limitations such as slow inference and the risk of hallucination, the method offers a promising framework for simulating complex human behavior in data-scarce environments, where traditional data-driven models struggle due to limited data availability.

[251] AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

Lang Mei, Zhihan Yang, Chong Chen

Main category: cs.AI

TL;DR: AI-SearchPlanner is a novel RL framework that uses a small trainable LLM for search planning to enhance frozen QA models, outperforming existing end-to-end approaches in effectiveness and efficiency.

DetailsMotivation: Existing RL-based search agents use a single LLM for both search planning and QA, limiting optimization of both capabilities. Real-world systems use large frozen LLMs for QA quality, so a dedicated small planner is more effective.

Method: Proposes AI-SearchPlanner with three innovations: 1) Decoupled architecture separating planner and generator, 2) Dual-reward alignment for search planning, 3) Pareto optimization of planning utility and cost.

Result: Extensive experiments show AI-SearchPlanner outperforms existing RL-based search agents in effectiveness and efficiency, with strong generalization across diverse frozen QA models and data domains.

Conclusion: The framework successfully enhances frozen QA models by focusing search planning optimization through a dedicated small LLM, achieving better performance than end-to-end approaches.

Abstract: Recent studies have explored integrating Large Language Models (LLMs) with search engines to leverage both the LLMs’ internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbf{AI-SearchPlanner}, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.

[252] UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Yaohui Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Qihua Han, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi

Main category: cs.AI

TL;DR: UI-TARS-2 is a native GUI agent model that addresses data scalability, multi-turn RL, GUI-only limitations, and environment stability through systematic training methods, achieving state-of-the-art performance on various benchmarks.

DetailsMotivation: Address challenges in autonomous GUI agents including data scalability, multi-turn reinforcement learning limitations, GUI-only operation constraints, and environment stability issues.

Method: Uses a systematic training methodology with: data flywheel for scalable data generation, stabilized multi-turn RL framework, hybrid GUI environment integrating file systems and terminals, and unified sandbox platform for large-scale rollouts.

Result: Achieves 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, 73.3 on AndroidWorld, and 59.8 mean normalized score across 15-game suite (60% human-level performance). Outperforms Claude and OpenAI agents, competitive with frontier proprietary models.

Conclusion: UI-TARS-2 demonstrates significant improvements over predecessor, strong generalization to diverse agent tasks, and potential to advance GUI agent capabilities for real-world interactive scenarios.

Abstract: The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2’s potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

[253] PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Wesley Hanwen Deng, Sunnie S. Y. Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, Leon A Gatys

Main category: cs.AI

TL;DR: PersonaTeaming introduces personas into automated red-teaming to improve attack success rates by 144.1% while maintaining diversity, addressing the gap in identity-aware AI safety testing.

DetailsMotivation: Current automated red-teaming approaches don't consider the role of identity and background, which human red-teamers use to uncover different risks. There's a need to incorporate personas into automated methods to explore a wider spectrum of adversarial strategies.

Method: Developed PersonaTeaming with two approaches: using predefined “red-teaming expert” and “regular AI user” personas, and a dynamic persona-generating algorithm that creates adaptive personas for different seed prompts. Also created new metrics to measure mutation distance alongside diversity.

Result: Experiments showed up to 144.1% improvement in attack success rates compared to state-of-the-art RainbowPlus method, while maintaining prompt diversity. Different persona types and mutation methods showed varying strengths.

Conclusion: PersonaTeaming demonstrates the value of incorporating identity-aware approaches in automated red-teaming, revealing opportunities for complementarity between automated and human red-teaming methods in AI safety research.

Abstract: Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people’s background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either “red-teaming expert” personas or “regular AI user” personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the “mutation distance” to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.

[254] The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs

Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

Main category: cs.AI

TL;DR: LLMs show personality-like traits that stabilize with training alignment, but self-reported traits don’t reliably predict actual behavior, and persona interventions affect self-reports more than behavior.

DetailsMotivation: To systematically characterize LLM personality across training evolution, predictive validity of self-reports, and impact of interventions, addressing gaps in prior work that relied on simplified methods without behavioral validation.

Method: Analyzed LLM personality across three dimensions: trait emergence throughout training stages, predictive validity of self-reported traits in behavioral tasks, and impact of targeted interventions like persona injection on both self-reports and behavior.

Result: Instructional alignment (RLHF, instruction tuning) stabilizes trait expression and strengthens trait correlations similar to humans, but self-reported traits don’t reliably predict behavior. Persona injection successfully steers self-reports but has little or inconsistent effect on actual behavior.

Conclusion: LLM surface-level trait expression differs from behavioral consistency, challenging assumptions about LLM personality and highlighting the need for deeper evaluation in alignment and interpretability.

Abstract: Personality traits have long been studied as predictors of human behavior. Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems, with advanced LLMs displaying consistent behavioral tendencies resembling human traits like agreeableness and self-regulation. Understanding these patterns is crucial, yet prior work primarily relied on simplified self-reports and heuristic prompting, with little behavioral validation. In this study, we systematically characterize LLM personality across three dimensions: (1) the dynamic emergence and evolution of trait profiles throughout training stages; (2) the predictive validity of self-reported traits in behavioral tasks; and (3) the impact of targeted interventions, such as persona injection, on both self-reports and behavior. Our findings reveal that instructional alignment (e.g., RLHF, instruction tuning) significantly stabilizes trait expression and strengthens trait correlations in ways that mirror human data. However, these self-reported traits do not reliably predict behavior, and observed associations often diverge from human patterns. While persona injection successfully steers self-reports in the intended direction, it exerts little or inconsistent effect on actual behavior. By distinguishing surface-level trait expression from behavioral consistency, our findings challenge assumptions about LLM personality and underscore the need for deeper evaluation in alignment and interpretability.

cs.SD

[255] Ecologically Valid Benchmarking and Adaptive Attention: Scalable Marine Bioacoustic Monitoring

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

Main category: cs.SD

TL;DR: GetNetUPAM framework with ARPA-N architecture improves underwater acoustic monitoring by 14.4% precision and reduces variability through hierarchical cross-validation and adaptive pooling.

DetailsMotivation: Underwater Passive Acoustic Monitoring faces challenges with noise, signal dependencies, and environmental variability that hinder model stability and generalization across different sites and years.

Method: Hierarchical nested cross-validation framework (GetNetUPAM) partitions data by site-year segments, combined with ARPA-N neural architecture featuring adaptive resolution pooling and spatial attention for irregular spectrogram dimensions.

Result: 14.4% gain in average precision over DenseNet baselines and order-of-magnitude reduction in variability across all metrics, enabling consistent detection across diverse environmental conditions.

Conclusion: The framework enables robust, scalable bioacoustic monitoring by addressing environmental diversity and reducing overfitting to localized noise artifacts.

Abstract: Underwater Passive Acoustic Monitoring (UPAM) provides rich spatiotemporal data for long-term ecological analysis, but intrinsic noise and complex signal dependencies hinder model stability and generalization. Multilayered windowing has improved target sound localization, yet variability from shifting ambient noise, diverse propagation effects, and mixed biological and anthropogenic sources demands robust architectures and rigorous evaluation. We introduce GetNetUPAM, a hierarchical nested cross-validation framework designed to quantify model stability under ecologically realistic variability. Data are partitioned into distinct site-year segments, preserving recording heterogeneity and ensuring each validation fold reflects a unique environmental subset, reducing overfitting to localized noise and sensor artifacts. Site-year blocking enforces evaluation against genuine environmental diversity, while standard cross-validation on random subsets measures generalization across UPAM’s full signal distribution, a dimension absent from current benchmarks. Using GetNetUPAM as the evaluation backbone, we propose the Adaptive Resolution Pooling and Attention Network (ARPA-N), a neural architecture for irregular spectrogram dimensions. Adaptive pooling with spatial attention extends the receptive field, capturing global context without excessive parameters. Under GetNetUPAM, ARPA-N achieves a 14.4% gain in average precision over DenseNet baselines and a log2-scale order-of-magnitude drop in variability across all metrics, enabling consistent detection across site-year folds and advancing scalable, accurate bioacoustic monitoring.

[256] A Multiclass Acoustic Dataset and Interactive Tool for Analyzing Drone Signatures in Real-World Environments

Mia Y. Wang, Mackenzie Linn, Andrew P. Berg, Qian Zhang

Main category: cs.SD

TL;DR: A comprehensive drone acoustic dataset with 32 drone categories, featuring raw audio, spectrograms, and MFCC plots, plus an interactive web application for exploration and analysis.

DetailsMotivation: Address limitations of current visual/radar drone detection systems by developing effective acoustic-based detection methods to combat privacy, security, and noise pollution challenges from drone proliferation.

Method: Created a unique dataset of drone acoustic signatures across 32 categories, developed an interactive web application allowing users to explore audio, spectrograms, and MFCC plots, and detailed the dataset creation and implementation process.

Result: Successfully developed a comprehensive acoustic dataset and functional web tool that facilitates drone detection research, classification, and acoustic analysis with experimental results and user feedback.

Conclusion: The project provides valuable resources for drone acoustic research and education, with potential for future expansion and enhancement of applications in drone detection technology.

Abstract: The rapid proliferation of drones across various industries has introduced significant challenges related to privacy, security, and noise pollution. Current drone detection systems, primarily based on visual and radar technologies, face limitations under certain conditions, highlighting the need for effective acoustic-based detection methods. This paper presents a unique and comprehensive dataset of drone acoustic signatures, encompassing 32 different categories differentiated by brand and model. The dataset includes raw audio recordings, spectrogram plots, and Mel-frequency cepstral coefficient (MFCC) plots for each drone. Additionally, we introduce an interactive web application that allows users to explore this dataset by selecting specific drone categories, listening to the associated audio, and viewing the corresponding spectrogram and MFCC plots. This tool aims to facilitate research in drone detection, classification, and acoustic analysis, supporting both technological advancements and educational initiatives. The paper details the dataset creation process, the design and implementation of the web application, and provides experimental results and user feedback. Finally, we discuss potential applications and future work to expand and enhance the project.

[257] WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Gagan Mundada, Yash Vishe, Amit Namburi, Xin Xu, Zachary Novack, Julian McAuley, Junda Wu

Main category: cs.SD

TL;DR: WildScore is the first multimodal benchmark for evaluating MLLMs’ symbolic music reasoning using real-world music scores and authentic user questions, revealing both strengths and limitations in current models.

DetailsMotivation: While MLLMs show impressive capabilities in vision-language tasks, their reasoning abilities in multimodal symbolic music domain remain unexplored, creating a gap in understanding how these models handle complex musicological analysis.

Method: Created WildScore benchmark with real musical compositions and authentic user-generated questions, organized through systematic taxonomy with musicological ontologies. Framed complex music reasoning as multiple-choice QA for controlled evaluation.

Result: Empirical benchmarking revealed intriguing patterns in MLLMs’ visual-symbolic reasoning, uncovering both promising directions and persistent challenges in symbolic music understanding.

Conclusion: The study provides the first comprehensive evaluation framework for MLLMs in symbolic music reasoning, highlighting current limitations and future research directions while releasing the dataset and code for community use.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs’ capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs’ symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.

[258] Quantum Fourier Transform Based Denoising: Unitary Filtering for Enhanced Speech Clarity

Rajeshwar Tripathi, Sahil Tomar, Sandeep Kumar, Monika Aggarwal

Main category: cs.SD

TL;DR: Quantum-inspired denoising framework using Quantum Fourier Transform (QFT) instead of FFT in audio enhancement, achieving up to 15 dB SNR improvement with reduced artifacts and no extra computational cost.

DetailsMotivation: Conventional FFT-based methods lack the global phase coherence and energy preservation properties of QFT, which could enable better discrimination between speech and noise in audio enhancement.

Method: Replaces FFT in Wiener and spectral subtraction filters with QFT operator, maintaining consistent hyperparameters for fair comparison. Tested on clean speech, synthetic tones, and noisy mixtures across various SNR conditions.

Result: Statistically significant SNR gains up to 15 dB improvement, reduced artifact generation, and robustness under low SNR and nonstationary noise scenarios without additional computational overhead.

Conclusion: QFT-based denoising offers a scalable pathway toward quantum-enhanced speech processing with superior performance compared to traditional FFT methods.

Abstract: This paper introduces a quantum-inspired denoising framework that integrates the Quantum Fourier Transform (QFT) into classical audio enhancement pipelines. Unlike conventional Fast Fourier Transform (FFT) based methods, QFT provides a unitary transformation with global phase coherence and energy preservation, enabling improved discrimination between speech and noise. The proposed approach replaces FFT in Wiener and spectral subtraction filters with a QFT operator, ensuring consistent hyperparameter settings for fair comparison. Experiments on clean speech, synthetic tones, and noisy mixtures across diverse signal to noise ratio (SNR) conditions, demonstrate statistically significant gains in SNR, with up to 15 dB improvement and reduced artifact generation. Results confirm that QFT based denoising offers robustness under low SNR and nonstationary noise scenarios without additional computational overhead, highlighting its potential as a scalable pathway toward quantum-enhanced speech processing.

[259] Learning and composing of classical music using restricted Boltzmann machines

Mutsumi Kobayashi, Hiroshi Watanabe

Main category: cs.SD

TL;DR: Using restricted Boltzmann machines to analyze and compose music in J.S. Bach’s style, providing interpretable insights into musical characteristics.

DetailsMotivation: Existing machine learning models for music composition are too complex to understand how they capture a composer's style characteristics.

Method: Trained a restricted Boltzmann machine (RBM) on J.S. Bach’s music due to its simple structure that allows analysis of internal states.

Result: The learned RBM was able to successfully compose music in Bach’s style.

Conclusion: RBMs provide an interpretable alternative to complex models for music composition and style analysis.

Abstract: Recently, software has been developed that uses machine learning to mimic the style of a particular composer, such as J. S. Bach. However, since such software often adopts machine learning models with complex structures, it is difficult to analyze how the software understands the characteristics of the composer’s music. In this study, we adopted J. S. Bach’s music for training of a restricted Boltzmann machine (RBM). Since the structure of RBMs is simple, it allows us to investigate the internal states after learning. We found that the learned RBM is able to compose music.

[260] MAIA: An Inpainting-Based Approach for Music Adversarial Attacks

Yuxuan Liu, Peihong Zhang, Rui Sang, Zhixin Li, Shengchen Li

Main category: cs.SD

TL;DR: MAIA is a novel adversarial attack framework for music that uses importance analysis and generative inpainting to create subtle perturbations, achieving high attack success rates in both white-box and black-box scenarios while maintaining audio quality.

DetailsMotivation: To address vulnerabilities in Music Information Retrieval (MIR) systems by developing an effective adversarial attack framework that can identify and exploit critical audio segments.

Method: Uses importance analysis to identify critical audio segments, then employs generative inpainting models to reconstruct these segments with guidance from the attacked model’s output to create subtle adversarial perturbations.

Result: Achieves high attack success rates in both white-box and black-box settings with minimal perceptual distortion, and subjective listening tests confirm high audio fidelity of adversarial samples.

Conclusion: Highlights vulnerabilities in current MIR systems and emphasizes the need for more robust and secure models in music information retrieval.

Abstract: Music adversarial attacks have garnered significant interest in the field of Music Information Retrieval (MIR). In this paper, we present Music Adversarial Inpainting Attack (MAIA), a novel adversarial attack framework that supports both white-box and black-box attack scenarios. MAIA begins with an importance analysis to identify critical audio segments, which are then targeted for modification. Utilizing generative inpainting models, these segments are reconstructed with guidance from the output of the attacked model, ensuring subtle and effective adversarial perturbations. We evaluate MAIA on multiple MIR tasks, demonstrating high attack success rates in both white-box and black-box settings while maintaining minimal perceptual distortion. Additionally, subjective listening tests confirm the high audio fidelity of the adversarial samples. Our findings highlight vulnerabilities in current MIR systems and emphasize the need for more robust and secure models.

[261] Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack

Yuxuan Liu, Rui Sang, Peihong Zhang, Zhixin Li, Shengchen Li

Main category: cs.SD

TL;DR: PAMT framework bridges the gap between model feature spaces and human auditory perception in music AI systems, achieving better perceptual alignment and robustness against adversarial attacks.

DetailsMotivation: MIR systems are vulnerable to adversarial attacks due to misalignment between model features and human auditory perception, with existing defenses failing to capture auditory nuances.

Method: Introduces Perceptually-Aligned MERT Transformer (PAMT) with psychoacoustically-conditioned sequential contrastive transformer built on frozen MERT encoder.

Result: Achieves 0.65 Spearman correlation with subjective scores (outperforming existing metrics) and 9.15% improvement in robust accuracy on MIR tasks under adversarial attacks.

Conclusion: Pioneers architecturally-integrated psychoacoustic conditioning, yielding representations significantly more aligned with human perception and robust against music adversarial attacks.

Abstract: Music Information Retrieval (MIR) systems are highly vulnerable to adversarial attacks that are often imperceptible to humans, primarily due to a misalignment between model feature spaces and human auditory perception. Existing defenses and perceptual metrics frequently fail to adequately capture these auditory nuances, a limitation supported by our initial listening tests showing low correlation between common metrics and human judgments. To bridge this gap, we introduce Perceptually-Aligned MERT Transformer (PAMT), a novel framework for learning robust, perceptually-aligned music representations. Our core innovation lies in the psychoacoustically-conditioned sequential contrastive transformer, a lightweight projection head built atop a frozen MERT encoder. PAMT achieves a Spearman correlation coefficient of 0.65 with subjective scores, outperforming existing perceptual metrics. Our approach also achieves an average of 9.15% improvement in robust accuracy on challenging MIR tasks, including Cover Song Identification and Music Genre Classification, under diverse perceptual adversarial attacks. This work pioneers architecturally-integrated psychoacoustic conditioning, yielding representations significantly more aligned with human perception and robust against music adversarial attacks.

[262] Recomposer: Event-roll-guided generative audio editing

Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal

Main category: cs.SD

TL;DR: A transformer-based system for editing individual sound events in complex audio scenes using text descriptions and timing information, enabling deletion, insertion, and enhancement of specific sounds.

DetailsMotivation: Editing complex real-world sound scenes is challenging due to overlapping sound sources. Generative models can leverage their understanding of audio domains to fill in missing or corrupted details, enabling precise editing of individual sound events.

Method: Encoder-decoder transformer architecture working on SoundStream representations, trained on synthetic audio pairs created by adding isolated sound events to dense real-world backgrounds. Uses textual edit descriptions (action, class) and graphical timing information from event roll transcriptions.

Result: Evaluation shows the importance of each component in edit descriptions - action, class, and timing. The system successfully performs deletion, insertion, and enhancement operations on individual sound events within complex audio scenes.

Conclusion: The work demonstrates that ‘recomposition’ - the ability to edit individual sound events in complex audio scenes - is an important and practical application of generative audio models.

Abstract: Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., enhance Door'') and a graphical representation of the event timing derived from an event roll’’ transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions – action, class, timing. Our work demonstrates ``recomposition’’ is an important and practical application.

[263] WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

Longhao Li, Zhao Guo, Hongjie Chen, Yuhang Dai, Ziyu Zhang, Hongfei Xue, Tianlun Zuo, Chengyou Wang, Shuiyuan Wang, Jie Li, Jian Kang, Xin Xu, Hui Bu, Binbin Zhang, Ruibin Yuan, Ziya Zhou, Wei Xue, Lei Xie

Main category: cs.SD

TL;DR: WenetSpeech-Pipe pipeline creates WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotations, enabling competitive ASR and TTS performance against SOTA systems.

DetailsMotivation: Cantonese has limited annotated resources despite being spoken by 84.9 million native speakers, which has hindered progress in ASR and TTS performance for this language.

Method: Developed WenetSpeech-Pipe pipeline with six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting to create large-scale speech corpus with rich annotations.

Result: Created WenetSpeech-Yue corpus with 21,800 hours across 10 domains, including ASR transcription, text confidence, speaker identity, age, gender, and speech quality scores. Models trained on this corpus achieve competitive results against SOTA Cantonese ASR and TTS systems.

Conclusion: The proposed pipeline and released dataset successfully address the resource scarcity for Cantonese speech processing, enabling improved ASR and TTS performance comparable to commercial and LLM-based models.

Abstract: The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. It comprises six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting, enabling rich and high-quality annotations. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.

cs.LG

[264] An Arbitration Control for an Ensemble of Diversified DQN variants in Continual Reinforcement Learning

Wonseo Jang, Dongjae Kim

Main category: cs.LG

TL;DR: ACED-DQN framework uses arbitration control over diversified DQN ensemble to prevent catastrophic forgetting in continual reinforcement learning, inspired by human prefrontal cortex decision-making.

DetailsMotivation: Deep RL models suffer from catastrophic forgetting in continual learning scenarios, losing previously learned knowledge and performing poorly in changing environments.

Method: Proposes an ensemble of diversified DQN variants with different value functions, combined with an arbitration control mechanism that prioritizes agents with higher reliability in recent trials.

Result: Demonstrates significant performance improvements in both static and continual environments, showing effectiveness of arbitration control over diversified DQNs.

Conclusion: The framework enables RL agents to continuously learn effectively, drawing inspiration from human brain’s decision-making processes in the prefrontal cortex.

Abstract: Deep reinforcement learning (RL) models, despite their efficiency in learning an optimal policy in static environments, easily loses previously learned knowledge (i.e., catastrophic forgetting). It leads RL models to poor performance in continual reinforcement learning (CRL) scenarios. To address this, we present an arbitration control mechanism over an ensemble of RL agents. It is motivated by and closely aligned with how humans make decisions in a CRL context using an arbitration control of multiple RL agents in parallel as observed in the prefrontal cortex. We integrated two key ideas into our model: (1) an ensemble of RLs (i.e., DQN variants) explicitly trained to have diverse value functions and (2) an arbitration control that prioritizes agents with higher reliability (i.e., less error) in recent trials. We propose a framework for CRL, an Arbitration Control for an Ensemble of Diversified DQN variants (ACED-DQN). We demonstrate significant performance improvements in both static and continual environments, supported by empirical evidence showing the effectiveness of arbitration control over diversified DQNs during training. In this work, we introduced a framework that enables RL agents to continuously learn, with inspiration from the human brain.

[265] Q-SafeML: Safety Assessment of Quantum Machine Learning via Quantum Distance Metrics

Oliver Dunn, Koorosh Aslansefat, Yiannis Papadopoulos

Main category: cs.LG

TL;DR: Q-SafeML introduces quantum-adapted safety monitoring for Quantum Machine Learning using quantum-centric distance measures to detect concept drifts and enhance system safety.

DetailsMotivation: Existing classical ML safety monitoring methods are not applicable to QML due to fundamental differences in quantum computation, and dedicated safety mechanisms for QML remain underdeveloped.

Method: Adapts SafeML approach with quantum-centric distance measures defined over quantum state spaces, using model-dependent post-classification evaluation instead of dataset-driven classical approach.

Result: Experiments on QCNN and VQC models show Q-SafeML effectively detects distances between operational and training data, enabling informed human oversight and enhancing system transparency and safety.

Conclusion: Q-SafeML provides a novel safety monitoring framework specifically designed for QML systems, addressing the unique representational constraints of quantum computing while maintaining safety standards.

Abstract: The rise of machine learning in safety-critical systems has paralleled advancements in quantum computing, leading to the emerging field of Quantum Machine Learning (QML). While safety monitoring has progressed in classical ML, existing methods are not directly applicable to QML due to fundamental differences in quantum computation. Given the novelty of QML, dedicated safety mechanisms remain underdeveloped. This paper introduces Q-SafeML, a safety monitoring approach for QML. The method builds on SafeML, a recent method that utilizes statistical distance measures to assess model accuracy and provide confidence in the reasoning of an algorithm. An adapted version of Q-SafeML incorporates quantum-centric distance measures, aligning with the probabilistic nature of QML outputs. This shift to a model-dependent, post-classification evaluation represents a key departure from classical SafeML, which is dataset-driven and classifier-agnostic. The distinction is motivated by the unique representational constraints of quantum systems, requiring distance metrics defined over quantum state spaces. Q-SafeML detects distances between operational and training data addressing the concept drifts in the context of QML. Experiments on QCNN and VQC Models show that this enables informed human oversight, enhancing system transparency and safety.

[266] Finance-Grounded Optimization For Algorithmic Trading

Kasymkhan Khubiev, Mikhail Semenov, Irina Podlipnova

Main category: cs.LG

TL;DR: Proposes financially grounded loss functions based on Sharpe ratio, PnL, and Maximum Drawdown with turnover regularization, outperforming traditional MSE for trading strategies.

DetailsMotivation: Classical deep learning approaches are not optimal for finance where specialists use different performance metrics, requiring financially relevant loss functions.

Method: Developed loss functions derived from quantitative finance metrics (Sharpe ratio, PnL, Maximum Drawdown) and introduced turnover regularization to constrain position changes.

Result: The proposed financially grounded loss functions with turnover regularization outperform traditional mean squared error loss when evaluated using algorithmic trading metrics.

Conclusion: Financially grounded metrics significantly enhance predictive performance in trading strategies and portfolio optimization compared to traditional loss functions.

Abstract: Deep Learning is evolving fast and integrates into various domains. Finance is a challenging field for deep learning, especially in the case of interpretable artificial intelligence (AI). Although classical approaches perform very well with natural language processing, computer vision, and forecasting, they are not perfect for the financial world, in which specialists use different metrics to evaluate model performance. We first introduce financially grounded loss functions derived from key quantitative finance metrics, including the Sharpe ratio, Profit-and-Loss (PnL), and Maximum Draw down. Additionally, we propose turnover regularization, a method that inherently constrains the turnover of generated positions within predefined limits. Our findings demonstrate that the proposed loss functions, in conjunction with turnover regularization, outperform the traditional mean squared error loss for return prediction tasks when evaluated using algorithmic trading metrics. The study shows that financially grounded metrics enhance predictive performance in trading strategies and portfolio optimization.

[267] i-Mask: An Intelligent Mask for Breath-Driven Activity Recognition

Ashutosh Kumar Sinha, Ayush Patel, Mitul Dudhat, Pritam Anand, Rahul Mishra

Main category: cs.LG

TL;DR: i-Mask is a novel human activity recognition system that uses exhaled breath patterns captured through a custom sensor-equipped mask, achieving over 95% accuracy in activity detection.

DetailsMotivation: Breathing patterns contain important physiological signals that can anticipate human behavior, health trends, and vital parameters, enabling real-time health monitoring and deeper insights into well-being.

Method: Developed a custom mask with integrated sensors to capture exhaled breath patterns, followed by noise filtering, time-series decomposition, and labeling of collected data to train predictive models.

Result: The experimental results validate the approach’s effectiveness, achieving over 95% accuracy in human activity recognition.

Conclusion: i-Mask demonstrates significant potential for healthcare and fitness applications through its accurate breath-based activity recognition system.

Abstract: The patterns of inhalation and exhalation contain important physiological signals that can be used to anticipate human behavior, health trends, and vital parameters. Human activity recognition (HAR) is fundamentally connected to these vital signs, providing deeper insights into well-being and enabling real-time health monitoring. This work presents i-Mask, a novel HAR approach that leverages exhaled breath patterns captured using a custom-developed mask equipped with integrated sensors. Data collected from volunteers wearing the mask undergoes noise filtering, time-series decomposition, and labeling to train predictive models. Our experimental results validate the effectiveness of the approach, achieving over 95% accuracy and highlighting its potential in healthcare and fitness applications.

[268] Bootstrapping Task Spaces for Self-Improvement

Minqi Jiang, Andrei Lupu, Yoram Bachrach

Main category: cs.LG

TL;DR: ExIt is an autocurriculum RL method that trains LLMs for multi-step self-improvement by selectively sampling informative intermediate steps during training, enabling effective inference-time iteration beyond training depths.

DetailsMotivation: Current RL approaches for self-improvement tasks assume fixed maximum iteration depths, which is costly and arbitrary. There's a need for methods that can train agents to reliably self-improve over sequences at inference-time without this limitation.

Method: Exploratory Iteration (ExIt) grows a task space by selectively sampling the most informative intermediate partial histories encountered during episodes. These starting points are treated as new self-iteration task instances to train a self-improvement policy, optionally paired with explicit exploration mechanisms.

Result: ExIt demonstrates strong inference-time self-improvement on held-out task instances across domains including competition math, multi-turn tool-use, and machine learning engineering. The method enables iteration towards higher performance over step budgets extending beyond average training iteration depths.

Conclusion: ExIt provides an effective autocurriculum RL approach for training self-improvement policies that can perform multi-step iteration at inference-time while only requiring training on informative single-step iterations, overcoming limitations of fixed-depth approaches.

Abstract: Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.

[269] Instance-Wise Adaptive Sampling for Dataset Construction in Approximating Inverse Problem Solutions

Jiequn Han, Kui Ren, Nathan Soedjak

Main category: cs.LG

TL;DR: Instance-wise adaptive sampling framework for efficient training dataset construction in inverse problems, dynamically allocating samples based on test instances to reduce data collection costs.

DetailsMotivation: Traditional learning-based approaches require large training datasets from prior distributions, which becomes costly when priors have high intrinsic dimensions or high accuracy is needed. The method aims to improve sample efficiency by tailoring training to specific test instances.

Method: Iterative adaptive sampling strategy that dynamically allocates sampling effort based on the specific test instance, refining the training dataset conditioned on the latest prediction to tailor it to the inverse map geometry around each test instance.

Result: Demonstrated effectiveness in inverse scattering problems with structured priors, showing that adaptive method advantages become more pronounced with complex priors or higher accuracy requirements.

Conclusion: The adaptive sampling strategy offers a scalable and practical alternative to conventional fixed-dataset training, broadly applicable to various inverse problems beyond the demonstrated scattering problem.

Abstract: We propose an instance-wise adaptive sampling framework for constructing compact and informative training datasets for supervised learning of inverse problem solutions. Typical learning-based approaches aim to learn a general-purpose inverse map from datasets drawn from a prior distribution, with the training process independent of the specific test instance. When the prior has a high intrinsic dimension or when high accuracy of the learned solution is required, a large number of training samples may be needed, resulting in substantial data collection costs. In contrast, our method dynamically allocates sampling effort based on the specific test instance, enabling significant gains in sample efficiency. By iteratively refining the training dataset conditioned on the latest prediction, the proposed strategy tailors the dataset to the geometry of the inverse map around each test instance. We demonstrate the effectiveness of our approach in the inverse scattering problem under two types of structured priors. Our results show that the advantage of the adaptive method becomes more pronounced in settings with more complex priors or higher accuracy requirements. While our experiments focus on a particular inverse problem, the adaptive sampling strategy is broadly applicable and readily extends to other inverse problems, offering a scalable and practical alternative to conventional fixed-dataset training regimes.

[270] Hierarchical Multi-agent Reinforcement Learning for Cyber Network Defense

Aditya Vikram Singh, Ethan Rathbun, Emma Graham, Lisa Oakley, Simona Boboila, Alina Oprea, Peter Chin

Main category: cs.LG

TL;DR: Hierarchical PPO architecture for autonomous cyber defense using MARL, decomposing tasks into sub-policies for investigation and recovery, achieving superior performance in convergence speed and security metrics.

DetailsMotivation: Cybersecurity defense against sophisticated adversaries is challenging and typically requires teams of human operators. MARL offers opportunities to build autonomous defenses that can handle large policy spaces, partial observability, and deceptive adversarial strategies.

Method: Proposed hierarchical PPO architecture that decomposes cyber defense into sub-tasks (network investigation, host recovery). Trained sub-policies using PPO enhanced with cybersecurity domain expertise, coordinated by a master defense policy. Sub-policies can be fine-tuned and transferred efficiently.

Result: Achieved top performance in CybORG Cage 4 environment compared to multiple baselines. Superior convergence speed, episodic return, and cybersecurity metrics including fraction of clean machines, precision, and false positive rates across different adversaries.

Conclusion: Hierarchical MARL approach effectively addresses cyber defense challenges, enables efficient learning and generalization, and provides strong performance against various adversarial strategies with transferable sub-policies.

Abstract: Recent advances in multi-agent reinforcement learning (MARL) have created opportunities to solve complex real-world tasks. Cybersecurity is a notable application area, where defending networks against sophisticated adversaries remains a challenging task typically performed by teams of security operators. In this work, we explore novel MARL strategies for building autonomous cyber network defenses that address challenges such as large policy spaces, partial observability, and stealthy, deceptive adversarial strategies. To facilitate efficient and generalized learning, we propose a hierarchical Proximal Policy Optimization (PPO) architecture that decomposes the cyber defense task into specific sub-tasks like network investigation and host recovery. Our approach involves training sub-policies for each sub-task using PPO enhanced with cybersecurity domain expertise. These sub-policies are then leveraged by a master defense policy that coordinates their selection to solve complex network defense tasks. Furthermore, the sub-policies can be fine-tuned and transferred with minimal cost to defend against shifts in adversarial behavior or changes in network settings. We conduct extensive experiments using CybORG Cage 4, the state-of-the-art MARL environment for cyber defense. Comparisons with multiple baselines across different adversaries show that our hierarchical learning approach achieves top performance in terms of convergence speed, episodic return, and several interpretable metrics relevant to cybersecurity, including the fraction of clean machines on the network, precision, and false positives.

[271] Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective

Shijie Wang, Li Zhang, Xinyan Liang, Yuhua Qian, Shen Hu

Main category: cs.LG

TL;DR: UDI introduces a sequential training strategy that avoids joint loss to prevent modality imbalance, using anchor modality guidance and dynamic interaction adjustment for better multimodal learning performance.

DetailsMotivation: Traditional multimodal joint loss causes modality imbalance where strong modalities dominate weaker ones, limiting effective information utilization from individual modalities and their interactions.

Method: Unidirectional Dynamic Interaction (UDI) - sequential training that first converges anchor modality, then uses its representations to guide other modalities via unsupervised loss with dynamic interaction adjustment.

Result: UDI outperforms existing methods in handling modality imbalance and achieves performance improvement in multimodal learning tasks.

Conclusion: By abandoning joint loss and enabling proactive sequential training with directed information flow, UDI effectively prevents modality domination and fosters better cross-modal feature learning.

Abstract: Multimodal learning typically utilizes multimodal joint loss to integrate different modalities and enhance model performance. However, this joint learning strategy can induce modality imbalance, where strong modalities overwhelm weaker ones and limit exploitation of individual information from each modality and the inter-modality interaction information. Existing strategies such as dynamic loss weighting, auxiliary objectives and gradient modulation mitigate modality imbalance based on joint loss. These methods remain fundamentally reactive, detecting and correcting imbalance after it arises, while leaving the competitive nature of the joint loss untouched. This limitation drives us to explore a new strategy for multimodal imbalance learning that does not rely on the joint loss, enabling more effective interactions between modalities and better utilization of information from individual modalities and their interactions. In this paper, we introduce Unidirectional Dynamic Interaction (UDI), a novel strategy that abandons the conventional joint loss in favor of a proactive, sequential training scheme. UDI first trains the anchor modality to convergence, then uses its learned representations to guide the other modality via unsupervised loss. Furthermore, the dynamic adjustment of modality interactions allows the model to adapt to the task at hand, ensuring that each modality contributes optimally. By decoupling modality optimization and enabling directed information flow, UDI prevents domination by any single modality and fosters effective cross-modal feature learning. Our experimental results demonstrate that UDI outperforms existing methods in handling modality imbalance, leading to performance improvement in multimodal learning tasks.

[272] Toward Faithfulness-guided Ensemble Interpretation of Neural Network

Siyu Zhang, Kenneth Mcmillan

Main category: cs.LG

TL;DR: FEI is a faithfulness-guided ensemble framework that enhances neural network interpretability through improved visualization and quantitative faithfulness scores using smooth approximation techniques.

DetailsMotivation: Interpretable and faithful explanations are crucial for understanding neural network behavior, but existing methods lack comprehensive faithfulness evaluation and effective visualization.

Method: FEI uses faithfulness-guided ensemble interpretation with smooth approximation to improve quantitative scores, targets hidden layer faithfulness, and introduces a novel qualitative metric for evaluation.

Result: FEI surpasses existing methods in extensive experiments, showing substantial advances in both qualitative visualization and quantitative faithfulness scores.

Conclusion: The research establishes a comprehensive framework for elevating faithfulness in neural network explanations with emphasis on both breadth and precision.

Abstract: Interpretable and faithful explanations for specific neural inferences are crucial for understanding and evaluating model behavior. Our work introduces \textbf{F}aithfulness-guided \textbf{E}nsemble \textbf{I}nterpretation (\textbf{FEI}), an innovative framework that enhances the breadth and effectiveness of faithfulness, advancing interpretability by providing superior visualization. Through an analysis of existing evaluation benchmarks, \textbf{FEI} employs a smooth approximation to elevate quantitative faithfulness scores. Diverse variations of \textbf{FEI} target enhanced faithfulness in hidden layer encodings, expanding interpretability. Additionally, we propose a novel qualitative metric that assesses hidden layer faithfulness. In extensive experiments, \textbf{FEI} surpasses existing methods, demonstrating substantial advances in qualitative visualization and quantitative faithfulness scores. Our research establishes a comprehensive framework for elevating faithfulness in neural network explanations, emphasizing both breadth and precision

[273] Quantum-Enhanced Multi-Task Learning with Learnable Weighting for Pharmacokinetic and Toxicity Prediction

Han Zhang, Fengji Ma, Jiamin Su, Xinyue Yang, Lei Wang, Wen-Cai Ye, Li Liu

Main category: cs.LG

TL;DR: A new quantum-enhanced multi-task learning framework (QW-MTL) for ADMET prediction that outperforms single-task methods on 12/13 tasks using quantum chemical descriptors and adaptive task weighting.

DetailsMotivation: Existing single-task learning methods fail to exploit task complementarities and require more computational resources for independent task training and inference.

Method: Built on Chemprop-RDKit backbone, uses quantum chemical descriptors for enriched molecular representations and introduces exponential task weighting scheme combining dataset-scale priors with learnable parameters for dynamic loss balancing.

Result: Significantly outperforms single-task baselines on 12 out of 13 TDC classification benchmarks with minimal model complexity and fast inference.

Conclusion: Demonstrates effectiveness and efficiency of multi-task molecular learning enhanced by quantum-informed features and adaptive task weighting for ADMET prediction.

Abstract: Prediction for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) plays a crucial role in drug discovery and development, accelerating the screening and optimization of new drugs. Existing methods primarily rely on single-task learning (STL), which often fails to fully exploit the complementarities between tasks. Besides, it requires more computational resources while training and inference of each task independently. To address these issues, we propose a new unified Quantum-enhanced and task-Weighted Multi-Task Learning (QW-MTL) framework, specifically designed for ADMET classification tasks. Built upon the Chemprop-RDKit backbone, QW-MTL adopts quantum chemical descriptors to enrich molecular representations with additional information about the electronic structure and interactions. Meanwhile, it introduces a novel exponential task weighting scheme that combines dataset-scale priors with learnable parameters to achieve dynamic loss balancing across tasks. To the best of our knowledge, this is the first work to systematically conduct joint multi-task training across all 13 Therapeutics Data Commons (TDC) classification benchmarks, using leaderboard-style data splits to ensure a standardized and realistic evaluation setting. Extensive experimental results show that QW-MTL significantly outperforms single-task baselines on 12 out of 13 tasks, achieving high predictive performance with minimal model complexity and fast inference, demonstrating the effectiveness and efficiency of multi-task molecular learning enhanced by quantum-informed features and adaptive task weighting.

[274] Measuring the Measures: Discriminative Capacity of Representational Similarity Metrics Across Model Families

Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla

Main category: cs.LG

TL;DR: Systematic comparison of representational similarity metrics shows that separability increases with stricter alignment constraints, with soft-matching performing best among mapping-based methods.

DetailsMotivation: Lack of systematic comparisons of representational similarity metrics' discriminative power across different model families and training regimes in neuroscience and AI.

Method: Quantitative framework using three separability measures (dprime, silhouette coefficients, ROC-AUC) to evaluate metrics including RSA, linear predictivity, Procrustes, and soft matching across various architectures and training approaches.

Result: Separability systematically increases with more stringent alignment constraints; soft-matching achieves highest separability among mapping-based methods, followed by Procrustes and linear predictivity; non-fitting methods like RSA also show strong separability.

Conclusion: Provides first systematic comparison of similarity metrics through separability lens, clarifying relative sensitivity and guiding metric selection for large-scale model and brain comparisons.

Abstract: Representational similarity metrics are fundamental tools in neuroscience and AI, yet we lack systematic comparisons of their discriminative power across model families. We introduce a quantitative framework to evaluate representational similarity measures based on their ability to separate model families-across architectures (CNNs, Vision Transformers, Swin Transformers, ConvNeXt) and training regimes (supervised vs. self-supervised). Using three complementary separability measures-dprime from signal detection theory, silhouette coefficients and ROC-AUC, we systematically assess the discriminative capacity of commonly used metrics including RSA, linear predictivity, Procrustes, and soft matching. We show that separability systematically increases as metrics impose more stringent alignment constraints. Among mapping-based approaches, soft-matching achieves the highest separability, followed by Procrustes alignment and linear predictivity. Non-fitting methods such as RSA also yield strong separability across families. These results provide the first systematic comparison of similarity metrics through a separability lens, clarifying their relative sensitivity and guiding metric choice for large-scale model and brain comparisons.

[275] Learning to Coordinate: Distributed Meta-Trajectory Optimization Via Differentiable ADMM-DDP

Bingheng Wang, Yichao Gao, Tianchen Sun, Lin Zhao

Main category: cs.LG

TL;DR: L2C is a meta-learning framework that automatically tunes hyperparameters for distributed trajectory optimization (ADMM-DDP) using agent-wise neural networks, enabling adaptive coordination across diverse multi-agent tasks with efficient gradient computation.

DetailsMotivation: Distributed trajectory optimization via ADMM-DDP requires extensive manual tuning of tightly coupled hyperparameters that affect both local task performance and global coordination, which is challenging and time-consuming.

Method: Proposes Learning to Coordinate (L2C) framework with lightweight agent-wise neural networks that meta-learn hyperparameters. Uses end-to-end differentiation through ADMM-DDP pipeline, reuses DDP components for efficient meta-gradient computation, and employs truncated iterations with meta-learned ADMM penalty parameters.

Result: L2C generates dynamically feasible trajectories in high-fidelity simulation, reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces, adapts to varying team sizes and task conditions, and achieves up to 88% faster gradient computation than state-of-the-art methods.

Conclusion: L2C provides a general framework for automatic hyperparameter tuning in distributed multi-agent coordination, enabling robust adaptation across diverse scenarios while significantly improving computational efficiency compared to existing approaches.

Abstract: Distributed trajectory optimization via ADMM-DDP is a powerful approach for coordinating multi-agent systems, but it requires extensive tuning of tightly coupled hyperparameters that jointly govern local task performance and global coordination. In this paper, we propose Learning to Coordinate (L2C), a general framework that meta-learns these hyperparameters, modeled by lightweight agent-wise neural networks, to adapt across diverse tasks and agent configurations. L2C differentiates end-to-end through the ADMM-DDP pipeline in a distributed manner. It also enables efficient meta-gradient computation by reusing DDP components such as Riccati recursions and feedback gains. These gradients correspond to the optimal solutions of distributed matrix-valued LQR problems, coordinated across agents via an auxiliary ADMM framework that becomes convex under mild assumptions. Training is further accelerated by truncating iterations and meta-learning ADMM penalty parameters optimized for rapid residual reduction, with provable Lipschitz-bounded gradient errors. On a challenging cooperative aerial transport task, L2C generates dynamically feasible trajectories in high-fidelity simulation using IsaacSIM, reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces, and adapts robustly to varying team sizes and task conditions, while achieving up to $88%$ faster gradient computation than state-of-the-art methods.

[276] Split Conformal Prediction in the Function Space with Neural Operators

David Millard, Lars Lindemann, Ali Baheri

Main category: cs.LG

TL;DR: Extends split conformal prediction to function spaces for neural operators, providing finite-sample coverage guarantees through discretization and asymptotic convergence analysis.

DetailsMotivation: Neural operators lack finite-sample coverage guarantees for functional outputs, and existing methods require strong assumptions or yield conservative coverage.

Method: Two-step approach: establish finite-sample guarantees in finite-dimensional space using discretization, then lift to function-space via asymptotic convergence. Includes regression-based correction and diagnostic metrics.

Result: Method maintains calibrated coverage with less variation under resolution shifts and achieves better coverage in super-resolution tasks.

Conclusion: Successfully extends conformal prediction to function spaces, providing reliable uncertainty quantification for neural operators with practical diagnostic tools.

Abstract: Uncertainty quantification for neural operators remains an open problem in the infinite-dimensional setting due to the lack of finite-sample coverage guarantees over functional outputs. While conformal prediction offers finite-sample guarantees in finite-dimensional spaces, it does not directly extend to function-valued outputs. Existing approaches (Gaussian processes, Bayesian neural networks, and quantile-based operators) require strong distributional assumptions or yield conservative coverage. This work extends split conformal prediction to function spaces following a two step method. We first establish finite-sample coverage guarantees in a finite-dimensional space using a discretization map in the output function space. Then these guarantees are lifted to the function-space by considering the asymptotic convergence as the discretization is refined. To characterize the effect of resolution, we decompose the conformal radius into discretization, calibration, and misspecification components. This decomposition motivates a regression-based correction to transfer calibration across resolutions. Additionally, we propose two diagnostic metrics (conformal ensemble score and internal agreement) to quantify forecast degradation in autoregressive settings. Empirical results show that our method maintains calibrated coverage with less variation under resolution shifts and achieves better coverage in super-resolution tasks.

[277] Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction

Arash Behboodi, Alvaro H. C. Correia, Fabio Valerio Massoli, Christos Louizos

Main category: cs.LG

TL;DR: Transductive conformal prediction shows a fundamental trade-off: higher confidence requires exponentially larger prediction sets, with size scaling linearly with sample count and proportional to data entropy.

DetailsMotivation: To understand the fundamental limits of transductive conformal prediction methods in simultaneously predicting multiple data points while maintaining confidence guarantees.

Method: Derived strict finite-sample bounds analyzing the relationship between confidence levels and prediction set efficiency, measured by set size. Examined special case where all test points share the same label.

Result: Proved that non-trivial confidence levels lead to exponential growth in prediction set size, with exponent scaling linearly with sample count and proportional to conditional entropy. Showed bound is achievable in idealized settings.

Conclusion: There exists an inherent trade-off between confidence and efficiency in transductive conformal prediction, with fundamental limits governed by data entropy and dispersion characteristics.

Abstract: Transductive conformal prediction addresses the simultaneous prediction for multiple data points. Given a desired confidence level, the objective is to construct a prediction set that includes the true outcomes with the prescribed confidence. We demonstrate a fundamental trade-off between confidence and efficiency in transductive methods, where efficiency is measured by the size of the prediction sets. Specifically, we derive a strict finite-sample bound showing that any non-trivial confidence level leads to exponential growth in prediction set size for data with inherent uncertainty. The exponent scales linearly with the number of samples and is proportional to the conditional entropy of the data. Additionally, the bound includes a second-order term, dispersion, defined as the variance of the log conditional probability distribution. We show that this bound is achievable in an idealized setting. Finally, we examine a special case of transductive prediction where all test data points share the same label. We show that this scenario reduces to the hypothesis testing problem with empirically observed statistics and provide an asymptotically optimal confidence predictor, along with an analysis of the error exponent.

[278] Interpreting Transformer Architectures as Implicit Multinomial Regression

Jonas A. Actor, Anthony Gruber, Eric C. Cyr

Main category: cs.LG

TL;DR: Attention mechanisms in transformers are connected to multinomial regression, showing that optimal feature learning through attention aligns with classification-optimal features.

DetailsMotivation: To understand the opaque attention mechanism in transformers and its relationship to concepts like feature polysemanticity, superposition, and model performance.

Method: Established a connection between attention mechanisms and multinomial regression by showing that optimizing over latent features yields solutions that align with attention block dynamics.

Result: The evolution of representations through transformers can be interpreted as a trajectory that recovers optimal features for classification.

Conclusion: Attention mechanisms in transformers effectively learn optimal classification features through their representation evolution process.

Abstract: Mechanistic interpretability aims to understand how internal components of modern machine learning models, such as weights, activations, and layers, give rise to the model’s overall behavior. One particularly opaque mechanism is attention: despite its central role in transformer models, its mathematical underpinnings and relationship to concepts like feature polysemanticity, superposition, and model performance remain poorly understood. This paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields optimal solutions that align with the dynamics induced by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.

[279] Flexible inference of learning rules from de novo learning data using neural networks

Yuhan Helena Liu, Victor Geadah, Jonathan Pillow

Main category: cs.LG

TL;DR: A flexible nonparametric framework using deep neural networks to infer biological learning rules from animal decision-making data during de novo task learning, revealing asymmetric updates and history dependence.

DetailsMotivation: To understand how animals learn new behaviors from scratch, which is challenging for existing approaches that assume specific parametric forms or are limited to simplified settings, and to develop animal-aligned artificial intelligence.

Method: Proposed a nonparametric framework parameterizing per-trial policy weight updates with deep neural networks (DNN), validated through simulation, and extended to recurrent variants (RNN) to capture non-Markovian dynamics and trial history dependence.

Result: Models improved predictions on held-out behavioral data from mice learning sensory decision-making tasks, revealing asymmetric updates after correct vs error trials and history dependence consistent with non-Markovian learning.

Conclusion: The framework provides a flexible approach for inferring biological learning rules from behavioral data in de novo learning tasks, offering insights for experimental protocols and behavioral digital twin development.

Abstract: Understanding how animals learn is a central challenge in neuroscience, with growing relevance to the development of animal- or human-aligned artificial intelligence. However, most existing approaches assume specific parametric forms for the learning rule (e.g., Q-learning, policy gradient) or are limited to simplified settings like bandit tasks, which do not involve learning a new input-output mapping from scratch. In contrast, animals must often learn new behaviors de novo, which poses a rich challenge for learning-rule inference. We target this problem by inferring learning rules directly from animal decision-making data during de novo task learning, a setting that requires models flexible enough to capture suboptimality, history dependence, and rich external stimulus integration without strong structural priors. We first propose a nonparametric framework that parameterizes the per-trial update of policy weights with a deep neural network (DNN), and validate it by recovering ground-truth rules in simulation. We then extend to a recurrent variant (RNN) that captures non-Markovian dynamics by allowing updates to depend on trial history. Applied to a large behavioral dataset of mice learning a sensory decision-making task over multiple weeks, our models improved predictions on held-out data. The inferred rules revealed asymmetric updates after correct versus error trials and history dependence, consistent with non-Markovian learning. Overall, these results introduce a flexible framework for inferring biological learning rules from behavioral data in de novo learning tasks, providing insights to inform experimental training protocols and the development of behavioral digital twins.

[280] Beyond Ordinary Lipschitz Constraints: Differentially Private Stochastic Optimization with Tsybakov Noise Condition

Difei Xu, Meng Ding, Zihang Xiang, Jinhui Xu, Di Wang

Main category: cs.LG

TL;DR: This paper studies differentially private stochastic convex optimization under Tsybakov Noise Condition, proposing novel algorithms with utility bounds independent of Lipschitz constant and establishing matching lower bounds.

DetailsMotivation: Previous DP-SCO studies assumed Lipschitz losses, but many real-world problems have unbounded Lipschitz constants. The paper aims to handle cases where the gradient has bounded moments rather than requiring bounded Lipschitz constants.

Method: Proposed (ε,δ)-DP algorithms for population risk functions satisfying TNC with θ>1. For Lipschitz case with θ≥2, developed algorithm with utility bound independent of Lipschitz constant. Extended to non-Lipschitz cases when ε is small.

Result: Achieved utility bounds of Õ((r̃_{2k}(1/√n + √d/(nε))^{(k-1)/k})^{θ/(θ-1)}) for Lipschitz case, and Õ((r̃_k(1/√n + √d/(nε))^{(k-1)/k})^{θ/(θ-1)}) for non-Lipschitz case. Showed matching lower bounds for ρ-zCDP.

Conclusion: The paper provides the first DP-SCO results for TNC settings with potentially unbounded Lipschitz constants, establishing optimal rates that depend on gradient moments rather than Lipschitz bounds, significantly expanding the applicability of DP-SCO.

Abstract: We study Stochastic Convex Optimization in the Differential Privacy model (DP-SCO). Unlike previous studies, here we assume the population risk function satisfies the Tsybakov Noise Condition (TNC) with some parameter $\theta>1$, where the Lipschitz constant of the loss could be extremely large or even unbounded, but the $\ell_2$-norm gradient of the loss has bounded $k$-th moment with $k\geq 2$. For the Lipschitz case with $\theta\geq 2$, we first propose an $(\varepsilon, \delta)$-DP algorithm whose utility bound is $\Tilde{O}\left(\left(\tilde{r}{2k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\varepsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$ in high probability, where $n$ is the sample size, $d$ is the model dimension, and $\tilde{r}{2k}$ is a term that only depends on the $2k$-th moment of the gradient. It is notable that such an upper bound is independent of the Lipschitz constant. We then extend to the case where $\theta\geq \bar{\theta}> 1$ for some known constant $\bar{\theta}$. Moreover, when the privacy budget $\varepsilon$ is small enough, we show an upper bound of $\tilde{O}\left(\left(\tilde{r}{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\varepsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$ even if the loss function is not Lipschitz. For the lower bound, we show that for any $\theta\geq 2$, the private minimax rate for $\rho$-zero Concentrated Differential Privacy is lower bounded by $\Omega\left(\left(\tilde{r}{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\sqrt{\rho}}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$.

[281] Echoes Before Collapse: Deep Learning Detection of Flickering in Complex Systems

Yazdan Babazadeh Maghsoodlo, Madhur Anand, Chris T. Bauch

Main category: cs.LG

TL;DR: Deep learning models can detect flickering patterns (noise-driven switching between stable states) in complex systems, providing early warning signals for critical regime shifts.

DetailsMotivation: Flickering is a key indicator of reduced resilience in climate systems, ecosystems, and financial markets, often preceding impactful but hard-to-predict regime shifts. Current methods for detecting these early warning signals need improvement.

Method: Used convolutional LSTM (CNN LSTM) models trained on synthetic time series generated from simple polynomial functions with additive noise to identify flickering patterns.

Result: Models accurately identified flickering patterns and generalized well to diverse stochastic systems. Successfully detected flickering in empirical datasets including dormouse body temperature records and African Humid Period palaeoclimate proxies.

Conclusion: Deep learning provides a flexible framework for extracting early warning signals from noisy, nonlinear time series, enabling identification of instability across various dynamical systems.

Abstract: Deep learning offers powerful tools for anticipating tipping points in complex systems, yet its potential for detecting flickering (noise-driven switching between coexisting stable states) remains unexplored. Flickering is a hallmark of reduced resilience in climate systems, ecosystems, financial markets, and other systems. It can precede critical regime shifts that are highly impactful but difficult to predict. Here we show that convolutional long short-term memory (CNN LSTM) models, trained on synthetic time series generated from simple polynomial functions with additive noise, can accurately identify flickering patterns. Despite being trained on simplified dynamics, our models generalize to diverse stochastic systems and reliably detect flickering in empirical datasets, including dormouse body temperature records and palaeoclimate proxies from the African Humid Period. These findings demonstrate that deep learning can extract early warning signals from noisy, nonlinear time series, providing a flexible framework for identifying instability across a wide range of dynamical systems.

[282] KRAFT: A Knowledge Graph-Based Framework for Automated Map Conflation

Farnoosh Hashemi, Laks V. S. Lakshmanan

Main category: cs.LG

TL;DR: KRAFT is a learning-based map conflation approach that addresses limitations of existing methods by using knowledge graphs, neural matching, and optimization to merge geospatial databases without inconsistencies.

DetailsMotivation: Existing map conflation methods are limited to linear objects and rely on heuristic rules, missing most entities and lacking data-driven learning capabilities for entity matching.

Method: Three-part approach: (1) Knowledge Graph Construction to represent GDBs, (2) Map Matching using knowledge graph alignment and geospatial feature encoding, (3) Map Merging via mixed integer linear programming for consistent merging.

Result: KRAFT achieves outstanding performance compared to state-of-the-art methods in map conflation tasks, with each module separately outperforming traditional matching and merging methods.

Conclusion: KRAFT provides a comprehensive learning-based solution for map conflation that handles both linear and non-linear objects while ensuring consistency through optimization techniques.

Abstract: Digital maps play a crucial role in various applications such as navigation, fleet management, and ride-sharing, necessitating their accuracy and currency, which require timely updates. While the majority of geospatial databases (GDBs) provide high-quality information, their data is (i) limited to specific regions and/or (ii) missing some entities, even in their covered areas. Map conflation is the process of augmentation of a GDB using another GDB to conflate missing spatial features. Existing map conflation methods suffer from two main limitations: (1) They are designed for the conflation of linear objects (e.g., road networks) and cannot simply be extended to non-linear objects, thus missing information about most entities in the map. (2) They are heuristic algorithmic approaches that are based on pre-defined rules, unable to learn entities matching in a data-driven manner. To address these limitations, we design KRAFT, a learning based approach consisting of three parts: (1) Knowledge Graph Construction - where each GDB is represented by a knowledge graph, (2) Map Matching - where we use a knowledge graph alignment method as well as a geospatial feature encoder to match entities in obtained knowledge graphs, and (3) Map Merging - where we merge matched entities in the previous modules in a consistent manner, using a mixed integer linear programming formulation that fully merges the GDBs without adding any inconsistencies. Our experimental evaluation shows that not only does KRAFT achieve outstanding performance compared to state-of-the-art and baseline methods in map conflation tasks, but each of its modules (e.g., Map Matching and Map Merging) also separately outperforms traditional matching and merging methods.

[283] CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals

Wenhui Cui, Christopher Sandino, Hadi Pouransari, Ran Liu, Juri Minxha, Ellen L. Zippi, Aman Verma, Anna Sedlackova, Behrooz Mahasseni, Erdrin Azemi

Main category: cs.LG

TL;DR: A framework that aligns EMG signals with pose representations to enable zero-shot gesture classification using weak biosignals instead of traditional visual data.

DetailsMotivation: Hand gesture classification typically relies on high-quality structured data like videos and images, but using low-power biosignals like EMG allows for continuous prediction on wearables. The goal is to improve representation quality from weak-modality data by aligning it with structured data.

Method: Proposed Contrastive Pose-EMG Pre-training (CPEP) framework to align EMG and pose representations, learning an EMG encoder that produces high-quality, pose-informative representations. Evaluated through linear probing and zero-shot classification setups.

Result: Outperforms emg2pose benchmark models by up to 21% on in-distribution gesture classification and 72% on unseen (out-of-distribution) gesture classification.

Conclusion: Learning representations from weak-modality EMG data aligned with structured pose data significantly improves gesture classification performance and enables effective zero-shot classification capabilities.

Abstract: Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Leveraging low-power, cost-effective biosignals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearables. In this paper, we demonstrate that learning representations from weak-modality data that are aligned with those from structured, high-quality data can improve representation quality and enables zero-shot classification. Specifically, we propose a Contrastive Pose-EMG Pre-training (CPEP) framework to align EMG and pose representations, where we learn an EMG encoder that produces high-quality and pose-informative representations. We assess the gesture classification performance of our model through linear probing and zero-shot setups. Our model outperforms emg2pose benchmark models by up to 21% on in-distribution gesture classification and 72% on unseen (out-of-distribution) gesture classification.

[284] Natural Spectral Fusion: p-Exponent Cyclic Scheduling and Early Decision-Boundary Alignment in First-Order Optimization

Gongyue Zhang, Honghai Liu

Main category: cs.LG

TL;DR: Natural Spectral Fusion (NSF) reveals that first-order optimizers have intrinsic frequency preferences and reframes training as controllable spectral coverage. It uses p-exponent cyclic scheduling to dynamically balance low- and high-frequency information, improving cross-band fusion and achieving better accuracy with less training cost.

DetailsMotivation: The spectral behaviors in machine learning have been widely studied, but the optimizer's own spectral bias remains unclear. The authors argue that first-order optimizers exhibit intrinsic frequency preferences that significantly reshape optimization paths, motivating the need to understand and control this spectral bias.

Method: Propose Natural Spectral Fusion (NSF) with two core principles: treating optimizer as spectral controller that dynamically balances frequency information, and periodically reweighting frequency bands. Implement via p-exponent extension of second-moment term (allowing both positive/negative exponents) with cyclic scheduling.

Result: Theory and experiments show adaptive methods emphasize low frequencies, SGD is near-neutral, negative exponents amplify high frequencies. Cyclic scheduling broadens spectral coverage, improves cross-band fusion, induces early decision-boundary alignment. Consistently reduces test error across benchmarks, matches baseline accuracy with only 1/4 training cost on some tasks.

Conclusion: NSF reveals optimizer’s role as active spectral controller and provides unified, controllable, and efficient framework for first-order optimization, demonstrating that spectral control can significantly improve training efficiency and performance without modifying model, data, or training pipeline.

Abstract: Spectral behaviors have been widely discussed in machine learning, yet the optimizer’s own spectral bias remains unclear. We argue that first-order optimizers exhibit an intrinsic frequency preference that significantly reshapes the optimization path. To address this, we propose Natural Spectral Fusion (NSF): reframing training as controllable spectral coverage and information fusion rather than merely scaling step sizes. NSF has two core principles: treating the optimizer as a spectral controller that dynamically balances low- and high-frequency information; and periodically reweighting frequency bands at negligible cost, without modifying the model, data, or training pipeline. We realize NSF via a p-exponent extension of the second-moment term, enabling both positive and negative exponents, and implement it through cyclic scheduling. Theory and experiments show that adaptive methods emphasize low frequencies, SGD is near-neutral, and negative exponents amplify high-frequency information. Cyclic scheduling broadens spectral coverage, improves cross-band fusion, and induces early decision-boundary alignment, where accuracy improves even while loss remains high. Across multiple benchmarks, with identical learning-rate strategies and fixed hyperparameters, p-exponent cyclic scheduling consistently reduces test error and demonstrates distinct convergence behavior; on some tasks, it matches baseline accuracy with only one-quarter of the training cost. Overall, NSF reveals the optimizer’s role as an active spectral controller and provides a unified, controllable, and efficient framework for first-order optimization.

[285] CoVeR: Conformal Calibration for Versatile and Reliable Autoregressive Next-Token Prediction

Yuzhu Chen, Yingjie Wang, Shunyu Liu, Yongcheng Jing, Dacheng Tao

Main category: cs.LG

TL;DR: CoVeR is a conformal prediction-based decoding strategy that provides provable coverage guarantees for autoregressive models while maintaining efficient search.

DetailsMotivation: Current decoding methods like beam search lack provable coverage guarantees and struggle to balance efficiency with diverse trajectory generation, especially for long-tail sequences needed in real-world applications.

Method: Proposes CoVeR, a model-free decoding strategy within the conformal prediction framework that maintains a compact search space while ensuring high coverage probability over desirable trajectories.

Result: Theoretical analysis establishes a PAC-style generalization bound, guaranteeing that CoVeR asymptotically achieves a coverage rate of at least 1-α for any target level α ∈ (0,1).

Conclusion: CoVeR addresses limitations of traditional decoding methods by providing provable coverage guarantees while maintaining search efficiency, making it suitable for real-world applications requiring diverse reasoning trajectories.

Abstract: Autoregressive pre-trained models combined with decoding methods have achieved impressive performance on complex reasoning tasks. While mainstream decoding strategies such as beam search can generate plausible candidate sets, they often lack provable coverage guarantees, and struggle to effectively balance search efficiency with the need for versatile trajectories, particularly those involving long-tail sequences that are essential in certain real-world applications. To address these limitations, we propose \textsc{CoVeR}, a novel model-free decoding strategy wihtin the conformal prediction framework that simultaneously maintains a compact search space and ensures high coverage probability over desirable trajectories. Theoretically, we establish a PAC-style generalization bound, guaranteeing that \textsc{CoVeR} asymptotically achieves a coverage rate of at least $1 - \alpha$ for any target level $\alpha \in (0,1)$.

[286] Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning

Jasmine Shone, Shaden Alshammari, Mark Hamilton, Zhening Li, William Freeman

Main category: cs.LG

TL;DR: Beyond I-Con framework replaces KL divergence with alternative statistical divergences and similarity kernels in representation learning, achieving state-of-the-art results in unsupervised clustering, supervised contrastive learning, and dimensionality reduction.

DetailsMotivation: KL divergence used in existing representation learning methods has limitations including asymmetry, unboundedness, and potential misalignment with true objectives, motivating exploration of alternative divergences.

Method: Proposes Beyond I-Con framework that systematically explores alternative statistical divergences (total variation distance, bounded f-divergences) and similarity kernels (distance-based instead of angular) to replace KL divergence.

Result: Achieved SOTA in unsupervised clustering using TV distance with PMI algorithm; outperformed standard supervised contrastive learning with TV and distance-based kernel; superior dimensionality reduction results with bounded f-divergence compared to SNE.

Conclusion: Divergence and similarity kernel choices significantly impact representation learning optimization, and moving beyond KL divergence can lead to improved performance across various tasks.

Abstract: The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences and similarity kernels. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) on supervised contrastive learning, we outperform the standard approach by using TV and a distance-based similarity kernel instead of KL and an angular kernel; (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded f-divergence. Our results highlight the importance of considering divergence and similarity kernel choices in representation learning optimization.

[287] VARMA-Enhanced Transformer for Time Series Forecasting

Jiajun Song, Xiaoou Liu

Main category: cs.LG

TL;DR: VARMAformer combines cross-attention transformer efficiency with classical VARMA statistical modeling to improve time series forecasting by capturing both global dependencies and local temporal patterns.

DetailsMotivation: Current transformer-based time series models remove self-attention for efficiency but may overlook fine-grained local temporal dependencies that classical statistical models like VARMA effectively capture.

Method: Proposes VARMAformer with two innovations: 1) VARMA-inspired Feature Extractor (VFE) that models autoregressive and moving-average patterns at patch level, and 2) VARMA-Enhanced Attention mechanism with temporal gate for context-aware queries.

Result: Extensive experiments on benchmark datasets show VARMAformer consistently outperforms existing state-of-the-art methods in time series forecasting.

Conclusion: Integrating classical statistical insights into modern deep learning frameworks provides significant benefits for time series forecasting, validating the synergy between traditional statistical methods and contemporary neural architectures.

Abstract: Transformer-based models have significantly advanced time series forecasting. Recent work, like the Cross-Attention-only Time Series transformer (CATS), shows that removing self-attention can make the model more accurate and efficient. However, these streamlined architectures may overlook the fine-grained, local temporal dependencies effectively captured by classical statistical models like Vector AutoRegressive Moving Average model (VARMA). To address this gap, we propose VARMAformer, a novel architecture that synergizes the efficiency of a cross-attention-only framework with the principles of classical time series analysis. Our model introduces two key innovations: (1) a dedicated VARMA-inspired Feature Extractor (VFE) that explicitly models autoregressive (AR) and moving-average (MA) patterns at the patch level, and (2) a VARMA-Enhanced Attention (VE-atten) mechanism that employs a temporal gate to make queries more context-aware. By fusing these classical insights into a modern backbone, VARMAformer captures both global, long-range dependencies and local, statistical structures. Through extensive experiments on widely-used benchmark datasets, we demonstrate that our model consistently outperforms existing state-of-the-art methods. Our work validates the significant benefit of integrating classical statistical insights into modern deep learning frameworks for time series forecasting.

[288] Graph Unlearning: Efficient Node Removal in Graph Neural Networks

Faqian Guan, Tianqing Zhu, Zhoutian Wang, Wei Ren, Wanlei Zhou

Main category: cs.LG

TL;DR: Three novel node unlearning methods for GNNs that effectively leverage graph topology to efficiently remove sensitive training nodes while maintaining model performance.

DetailsMotivation: Address privacy concerns in GNNs by developing efficient methods to remove sensitive training data without compromising graph topology or imposing structural restrictions on GNNs.

Method: Proposed three methods: Class-based Label Replacement, Topology-guided Neighbor Mean Posterior Probability, and Class-consistent Neighbor Node Filtering that utilize graph topological features for effective node unlearning.

Result: Experimental results on three benchmark datasets show superior performance in model utility, unlearning utility, and efficiency compared to state-of-the-art methods.

Conclusion: The proposed methods successfully protect sensitive node privacy in GNNs while maintaining efficiency, contributing to enhanced privacy and security in graph neural networks.

Abstract: With increasing concerns about privacy attacks and potential sensitive information leakage, researchers have actively explored methods to efficiently remove sensitive training data and reduce privacy risks in graph neural network (GNN) models. Node unlearning has emerged as a promising technique for protecting the privacy of sensitive nodes by efficiently removing specific training node information from GNN models. However, existing node unlearning methods either impose restrictions on the GNN structure or do not effectively utilize the graph topology for node unlearning. Some methods even compromise the graph’s topology, making it challenging to achieve a satisfactory performance-complexity trade-off. To address these issues and achieve efficient unlearning for training node removal in GNNs, we propose three novel node unlearning methods: Class-based Label Replacement, Topology-guided Neighbor Mean Posterior Probability, and Class-consistent Neighbor Node Filtering. Among these methods, Topology-guided Neighbor Mean Posterior Probability and Class-consistent Neighbor Node Filtering effectively leverage the topological features of the graph, resulting in more effective node unlearning. To validate the superiority of our proposed methods in node unlearning, we conducted experiments on three benchmark datasets. The evaluation criteria included model utility, unlearning utility, and unlearning efficiency. The experimental results demonstrate the utility and efficiency of the proposed methods and illustrate their superiority compared to state-of-the-art node unlearning methods. Overall, the proposed methods efficiently remove sensitive training nodes and protect the privacy information of sensitive nodes in GNNs. The findings contribute to enhancing the privacy and security of GNN models and provide valuable insights into the field of node unlearning.

[289] Revolution or Hype? Seeking the Limits of Large Models in Hardware Design

Qiang Xu, Leon Stok, Rolf Drechsler, Xi Wang, Grace Li Zhang, Igor L. Markov

Main category: cs.LG

TL;DR: Critical examination of LLMs and LCMs in electronic design automation, assessing whether they represent genuine revolution or inflated expectations in circuit design.

DetailsMotivation: Recent breakthroughs in LLMs and LCMs have generated both excitement and skepticism in EDA community, requiring systematic evaluation of their practical capabilities and limitations.

Method: Synthesizes perspectives from leading academic and industry experts, critically examining reliability, scalability, and interpretability of large AI models in hardware design.

Result: Authoritative overview framing the debate on whether AI models can meaningfully outperform or complement traditional EDA methods.

Conclusion: Provides fresh insights into one of today’s most contentious technology trends, serving as foundational text for ICCAD 2025 panel discussion.

Abstract: Recent breakthroughs in Large Language Models (LLMs) and Large Circuit Models (LCMs) have sparked excitement across the electronic design automation (EDA) community, promising a revolution in circuit design and optimization. Yet, this excitement is met with significant skepticism: Are these AI models a genuine revolution in circuit design, or a temporary wave of inflated expectations? This paper serves as a foundational text for the corresponding ICCAD 2025 panel, bringing together perspectives from leading experts in academia and industry. It critically examines the practical capabilities, fundamental limitations, and future prospects of large AI models in hardware design. The paper synthesizes the core arguments surrounding reliability, scalability, and interpretability, framing the debate on whether these models can meaningfully outperform or complement traditional EDA methods. The result is an authoritative overview offering fresh insights into one of today’s most contentious and impactful technology trends.

[290] Scaling Law for Large-Scale Pre-Training Using Chaotic Time Series and Predictability in Financial Time Series

Yuki Takemoto

Main category: cs.LG

TL;DR: Proposes generating artificial chaotic time series to model financial data, uses large-scale pre-training with 10B samples, and achieves significant performance improvements in Bitcoin trading predictions through zero-shot forecasting.

DetailsMotivation: Financial time series forecasting is challenging due to chaotic properties. Existing foundation models need diverse training data, and synthetic chaotic time series generation can help create realistic financial data simulations.

Method: Generate artificial chaotic time series, apply resampling techniques to simulate financial data, conduct large-scale pre-training with 10B samples per case, and perform zero-shot prediction on actual Bitcoin trade data without retraining.

Result: Significant performance improvements over autocorrelation models in trading strategy profitability. Observed scaling law phenomenon where predictive performance improves with exponentially increasing training samples for extended horizons.

Conclusion: Large-scale training with synthetic chaotic data enables effective financial forecasting. The observed scaling law suggests potential for near-future event prediction with substantial computational resources, warranting further verification across diverse chaotic models.

Abstract: Time series forecasting plays a critical role in decision-making processes across diverse fields including meteorology, traffic, electricity, economics, finance, and so on. Especially, predicting returns on financial instruments is a challenging problem. Some researchers have proposed time series foundation models applicable to various forecasting tasks. Simultaneously, based on the recognition that real-world time series exhibit chaotic properties, methods have been developed to artificially generate synthetic chaotic time series, construct diverse datasets and train models. In this study, we propose a methodology for modeling financial time series by generating artificial chaotic time series and applying resampling techniques to simulate financial time series data, which we then use as training samples. Increasing the resampling interval to extend predictive horizons, we conducted large-scale pre-training using 10 billion training samples for each case. We subsequently created test datasets for multiple timeframes using actual Bitcoin trade data and performed zero-shot prediction without re-training the pre-trained model. The results of evaluating the profitability of a simple trading strategy based on these predictions demonstrated significant performance improvements over autocorrelation models. During the large-scale pre-training process, we observed a scaling law-like phenomenon that we can achieve predictive performance at a certain level with extended predictive horizons for chaotic time series by increasing the number of training samples exponentially. If this scaling law proves robust and holds true across various chaotic models, it suggests the potential to predict near-future events by investing substantial computational resources. Future research should focus on further large-scale training and verifying the applicability of this scaling law to diverse chaotic models.

[291] A transformer-BiGRU-based framework with data augmentation and confident learning for network intrusion detection

Jiale Zhang, Pengfei He, Fei Li, Kewei Li, Yan Wang, Lan Huang, Ruochi Zhang, Fengfeng Zhou

Main category: cs.LG

TL;DR: TrailGate is a novel network intrusion detection framework combining Transformer and BiGRU architectures with feature selection and data augmentation to detect both common attacks and emerging threats.

DetailsMotivation: Address the limitations of conventional machine learning methods in handling complex patterns, data scarcity, and class imbalance in network intrusion datasets.

Method: Integrated machine learning and deep learning techniques using Transformer and BiGRU architectures with advanced feature selection strategies and data augmentation techniques.

Result: Developed TrailGate framework that excels at detecting common attack types and has unique ability to swiftly identify and neutralize emerging threats.

Conclusion: The fusion of machine learning and deep learning techniques effectively bridges the gap in network intrusion detection, providing robust and precise solutions for both known and emerging threats.

Abstract: In today’s fast-paced digital communication, the surge in network traffic data and frequency demands robust and precise network intrusion solutions. Conventional machine learning methods struggle to grapple with complex patterns within the vast network intrusion datasets, which suffer from data scarcity and class imbalance. As a result, we have integrated machine learning and deep learning techniques within the network intrusion detection system to bridge this gap. This study has developed TrailGate, a novel framework that combines machine learning and deep learning techniques. By integrating Transformer and Bidirectional Gated Recurrent Unit (BiGRU) architectures with advanced feature selection strategies and supplemented by data augmentation techniques, TrailGate can identifies common attack types and excels at detecting and mitigating emerging threats. This algorithmic fusion excels at detecting common and well-understood attack types and has the unique ability to swiftly identify and neutralize emerging threats that stem from existing paradigms.

[292] Ontology-Aligned Embeddings for Data-Driven Labour Market Analytics

Heinke Hihn, Dennis A. V. Dittrich, Carl Jeske, Cayo Costa Sobral, Helio Pais, Timm Lochmann

Main category: cs.LG

TL;DR: Using Sentence-BERT embeddings to create a semantic search system that maps German job titles to standardized occupational classifications without manual ontology maintenance.

DetailsMotivation: Traditional hand-crafted ontologies for occupational data alignment are computationally expensive and require expert maintenance, creating bottlenecks in labor market analytics.

Method: Fine-tuned Sentence-BERT model on German job title data to learn semantic embeddings, creating a similarity graph structure for efficient approximate nearest-neighbor search to classify job titles as semantic search.

Result: Developed an embedding-based alignment process that links free-form German job titles to established occupational classifications (German Klassifikation der Berufe and ISCO) with greater flexibility than traditional approaches.

Conclusion: The approach offers a scalable alternative to manual ontology curation, enabling flexible classification and potential extension to multiple ontologies and multilingual titles.

Abstract: The limited ability to reason across occupational data from different sources is a long-standing bottleneck for data-driven labour market analytics. Previous research has relied on hand-crafted ontologies that allow such reasoning but are computationally expensive and require careful maintenance by human experts. The rise of language processing machine learning models offers a scalable alternative by learning shared semantic spaces that bridge diverse occupational vocabularies without extensive human curation. We present an embedding-based alignment process that links any free-form German job title to two established ontologies - the German Klassifikation der Berufe and the International Standard Classification of Education. Using publicly available data from the German Federal Employment Agency, we construct a dataset to fine-tune a Sentence-BERT model to learn the structure imposed by the ontologies. The enriched pairs (job title, embedding) define a similarity graph structure that we can use for efficient approximate nearest-neighbour search, allowing us to frame the classification process as a semantic search problem. This allows for greater flexibility, e.g., adding more classes. We discuss design decisions, open challenges, and outline ongoing work on extending the graph with other ontologies and multilingual titles.

Artem Lensky, Yiding Qiu

Main category: cs.LG

TL;DR: Deep learning models for EEG blink detection using 1-5 frontal electrodes, with CNN-RNN hybrid achieving best accuracy (up to 95.8% in healthy subjects, 75.8% in Parkinson’s patients).

DetailsMotivation: Blinks in EEG are often treated as artifacts but contain valuable physiological information about cognitive load, attention, and neurological disorders. Accurate blink detection is crucial for monitoring these markers.

Method: Evaluated various deep learning architectures (RNNs, CNNs, TCNs, transformers, hybrids) for sequence-to-sequence blink segmentation on raw EEG signals. Used UCSD dataset with 31 subjects (15 healthy, 16 Parkinson’s patients) with minimal pre-processing.

Result: CNN-RNN hybrid model consistently outperformed others, achieving 93.8-95.8% accuracy in healthy cohort and 73.8-75.8% in Parkinson’s patients across 1-5 electrode configurations.

Conclusion: Deep learning models, particularly CNN-RNN hybrids, can effectively detect blinks in EEG signals with minimal pre-processing, demonstrating robustness even in patients with tremor conditions like Parkinson’s disease.

Abstract: Blinks in electroencephalography (EEG) are often treated as unwanted artifacts. However, recent studies have demonstrated that blink rate and its variability are important physiological markers to monitor cognitive load, attention, and potential neurological disorders. This paper addresses the critical task of accurate blink detection by evaluating various deep learning models for segmenting EEG signals into involuntary blinks and non-blinks. We present a pipeline for blink detection using 1, 3, or 5 frontal EEG electrodes. The problem is formulated as a sequence-to-sequence task and tested on various deep learning architectures including standard recurrent neural networks, convolutional neural networks (both standard and depth-wise), temporal convolutional networks (TCN), transformer-based models, and hybrid architectures. The models were trained on raw EEG signals with minimal pre-processing. Training and testing was carried out on a public dataset of 31 subjects collected at UCSD. This dataset consisted of 15 healthy participants and 16 patients with Parkinson’s disease allowing us to verify the model’s robustness to tremor. Out of all models, CNN-RNN hybrid model consistently outperformed other models and achieved the best blink detection accuracy of 93.8%, 95.4% and 95.8% with 1, 3, and 5 channels in the healthy cohort and correspondingly 73.8%, 75.4% and 75.8% in patients with PD. The paper compares neural networks for the task of segmenting EEG recordings to involuntary blinks and no blinks allowing for computing blink rate and other statistics.

[294] On the Normalization of Confusion Matrices: Methods and Geometric Interpretations

Johan Erbani, Pierre-Edouard Portier, Elod Egyed-Zsigmond, Sonia Ben Mokhtar, Diana Nurbakova

Main category: cs.LG

TL;DR: The paper introduces bistochastic normalization using Iterative Proportional Fitting to disentangle class similarity from distribution bias in confusion matrices, enabling better model diagnosis and improvement.

DetailsMotivation: Confusion matrices mix class similarity and distribution bias, making it hard to understand individual contributions to classification errors in heterogeneous settings.

Method: Proposed bistochastic normalization using Iterative Proportional Fitting, which generalizes row and column normalization to recover underlying class similarity structure.

Result: The method successfully disentangles error sources, provides more accurate model diagnosis, and shows geometric interpretation in the model’s internal class representation space.

Conclusion: Bistochastic normalization offers deeper insights into classifier behavior by separating class similarity from distribution bias, supporting more targeted model improvements.

Abstract: The confusion matrix is a standard tool for evaluating classifiers by providing insights into class-level errors. In heterogeneous settings, its values are shaped by two main factors: class similarity – how easily the model confuses two classes – and distribution bias, arising from skewed distributions in the training and test sets. However, confusion matrix values reflect a mix of both factors, making it difficult to disentangle their individual contributions. To address this, we introduce bistochastic normalization using Iterative Proportional Fitting, a generalization of row and column normalization. Unlike standard normalizations, this method recovers the underlying structure of class similarity. By disentangling error sources, it enables more accurate diagnosis of model behavior and supports more targeted improvements. We also show a correspondence between confusion matrix normalizations and the model’s internal class representations. Both standard and bistochastic normalizations can be interpreted geometrically in this space, offering a deeper understanding of what normalization reveals about a classifier.

[295] Neuro-Spectral Architectures for Causal Physics-Informed Networks

Arthur Bizzi, Leonardo M. Moreira, Márcio Marques, Leonardo Mendonça, Christian Júnior de Oliveira, Vitor Balestro, Lucas dos Santos Fernandez, Daniel Yukimura, Pavel Petrov, João M. Pereira, Tiago Novello, Lucas Nissenbaum

Main category: cs.LG

TL;DR: NeuSA introduces spectral-inspired PINNs that overcome convergence issues in complex PDEs by combining spectral basis projection with Neural ODEs, achieving better performance than standard MLP-based PINNs.

DetailsMotivation: Standard MLP-based Physics-Informed Neural Networks (PINNs) often fail to converge for complex initial-value problems, violating causality and suffering from spectral bias towards low-frequency components.

Method: NeuSA learns a projection of PDEs onto a spectral basis for finite-dimensional representation, integrates with Neural ODEs to enforce causality, and uses classical method-based initialization to start training near target solutions.

Result: Validated on linear and nonlinear wave equations, NeuSA demonstrates faster convergence, improved temporal consistency, and superior predictive accuracy compared to other architectures.

Conclusion: NeuSA effectively addresses spectral bias and causality issues in PINNs through spectral-inspired architecture and Neural ODE integration, providing a robust framework for solving complex PDEs.

Abstract: Physics-Informed Neural Networks (PINNs) have emerged as a powerful neural framework for solving partial differential equations (PDEs). However, standard MLP-based PINNs often fail to converge when dealing with complex initial-value problems, leading to solutions that violate causality and suffer from a spectral bias towards low-frequency components. To address these issues, we introduce NeuSA (Neuro-Spectral Architectures), a novel class of PINNs inspired by classical spectral methods, designed to solve linear and nonlinear PDEs with variable coefficients. NeuSA learns a projection of the underlying PDE onto a spectral basis, leading to a finite-dimensional representation of the dynamics which is then integrated with an adapted Neural ODE (NODE). This allows us to overcome spectral bias, by leveraging the high-frequency components enabled by the spectral representation; to enforce causality, by inheriting the causal structure of NODEs, and to start training near the target solution, by means of an initialization scheme based on classical methods. We validate NeuSA on canonical benchmarks for linear and nonlinear wave equations, demonstrating strong performance as compared to other architectures, with faster convergence, improved temporal consistency and superior predictive accuracy. Code and pretrained models will be released.

[296] Topology-Aware Graph Reinforcement Learning for Dynamic Routing in Cloud Networks

Yuxi Wang, Heyao Liu, Guanzi Yao, Nyutian Long, Yue Kang

Main category: cs.LG

TL;DR: Topology-aware graph RL for routing policy optimization in cloud networks using SASE for state encoding and PAGU for adaptive graph updates.

DetailsMotivation: Address decision instability and insufficient structural awareness in dynamic cloud network topologies for better routing optimization.

Method: Combines Structure-Aware State Encoding (SASE) module with multi-layer graph convolution and Policy-Adaptive Graph Update (PAGU) mechanism for adaptive structural updates.

Result: Outperforms existing graph RL models on GEANT dataset across throughput, latency control, and link balance metrics.

Conclusion: Proposed approach enables efficient and robust routing in dynamic cloud networks through improved structural awareness and adaptive updates.

Abstract: This paper proposes a topology-aware graph reinforcement learning approach to address the routing policy optimization problem in cloud server environments. The method builds a unified framework for state representation and structural evolution by integrating a Structure-Aware State Encoding (SASE) module and a Policy-Adaptive Graph Update (PAGU) mechanism. It aims to tackle the challenges of decision instability and insufficient structural awareness under dynamic topologies. The SASE module models node states through multi-layer graph convolution and structural positional embeddings, capturing high-order dependencies in the communication topology and enhancing the expressiveness of state representations. The PAGU module adjusts the graph structure based on policy behavior shifts and reward feedback, enabling adaptive structural updates in dynamic environments. Experiments are conducted on the real-world GEANT topology dataset, where the model is systematically evaluated against several representative baselines in terms of throughput, latency control, and link balance. Additional experiments, including hyperparameter sensitivity, graph sparsity perturbation, and node feature dimensionality variation, further explore the impact of structure modeling and graph updates on model stability and decision quality. Results show that the proposed method outperforms existing graph reinforcement learning models across multiple performance metrics, achieving efficient and robust routing in dynamic and complex cloud networks.

[297] Adapt in the Wild: Test-Time Entropy Minimization with Sharpness and Feature Regularization

Shuaicheng Niu, Guohao Chen, Deyu Chen, Yifan Zhang, Jiaxiang Wu, Zhiquan Wen, Yaofo Chen, Peilin Zhao, Chunyan Miao, Mingkui Tan

Main category: cs.LG

TL;DR: SAR and SAR^2 methods stabilize test-time adaptation by addressing batch norm issues, gradient noise, and representation collapse through sharpness-aware entropy minimization and regularization techniques.

DetailsMotivation: Test-time adaptation (TTA) methods often fail in real-world scenarios with mixed distribution shifts, small batch sizes, and imbalanced label distributions, primarily due to instability issues with batch norm layers and model collapse into trivial solutions.

Method: Proposes SAR (sharpness-aware and reliable entropy minimization) to remove noisy samples with large gradients and encourage flat minima, and SAR^2 with additional redundancy and inequity regularizers to prevent representation collapse by reducing feature correlations and maximizing prediction entropy.

Result: The methods demonstrate promising performance with improved stability over prior TTA approaches and maintain computational efficiency across various challenging test scenarios including mixed shifts, small batches, and imbalanced distributions.

Conclusion: SAR and SAR^2 effectively address the key stability issues in TTA by tackling batch norm limitations, gradient noise, and representation collapse, making them suitable for real-world deployment under wild test conditions.

Abstract: Test-time adaptation (TTA) may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, 3) online imbalanced label distribution shifts. This is often a key obstacle preventing existing TTA methods from being deployed in the real world. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, i.e., group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases, i.e., the model collapses into trivial solutions by assigning the same class label for all samples. By digging into this, we find that, during the collapse process: 1) the model gradients often undergo an initial explosion followed by rapid degradation, suggesting that certain noisy test samples with large gradients may disrupt adaptation; and 2) the model representations tend to exhibit high correlations and classification bias. To address this, we first propose a sharpness-aware and reliable entropy minimization method, called SAR, for stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Based on SAR, we further introduce SAR^2 to prevent representation collapse with two regularizers: 1) a redundancy regularizer to reduce inter-dimensional correlations among centroid-invariant features; and 2) an inequity regularizer to maximize the prediction entropy of a prototype centroid, thereby penalizing biased representations toward any specific class. Promising results demonstrate that our methods perform more stably over prior methods and are computationally efficient under the above wild test scenarios.

[298] Directed Evolution of Proteins via Bayesian Optimization in Embedding Space

Matouš Soldát, Jiří Kléma

Main category: cs.LG

TL;DR: A novel method combining Bayesian optimization with protein language model embeddings for machine-learning-assisted directed evolution, showing improved performance over state-of-the-art methods with fewer screenings.

DetailsMotivation: Directed evolution is time-consuming and expensive due to iterative protein variant screening. Machine learning can help select more informative variants to reduce screening costs while improving protein function.

Method: Combines Bayesian optimization with sequence embeddings from a pre-trained protein language model to select promising protein variants for screening.

Result: The new representation significantly improves Bayesian optimization performance, yielding better results with the same number of screenings, and outperforms state-of-the-art regression-based methods.

Conclusion: Protein language model embeddings enhance machine-learning-assisted directed evolution, making the process more efficient and effective compared to existing approaches.

Abstract: Directed evolution is an iterative laboratory process of designing proteins with improved function by iteratively synthesizing new protein variants and evaluating their desired property with expensive and time-consuming biochemical screening. Machine learning methods can help select informative or promising variants for screening to increase their quality and reduce the amount of necessary screening. In this paper, we present a novel method for machine-learning-assisted directed evolution of proteins which combines Bayesian optimization with informative representation of protein variants extracted from a pre-trained protein language model. We demonstrate that the new representation based on the sequence embeddings significantly improves the performance of Bayesian optimization yielding better results with the same number of conducted screening in total. At the same time, our method outperforms the state-of-the-art machine-learning-assisted directed evolution methods with regression objective.

[299] Depth-Aware Initialization for Stable and Efficient Neural Network Training

Vijay Pandey

Main category: cs.LG

TL;DR: A novel neural network initialization method that incorporates layer depth information and increases variance progressively from first to last layer, outperforming existing schemes.

DetailsMotivation: Existing initialization methods either ignore depth information or don't properly handle variance propagation in deep networks, where theoretical unit variance assumptions fail for deeper architectures.

Method: Proposed a flexible variance-increasing initialization scheme that incorporates both individual layer depth and total network depth information to better scale variance from first to last layer.

Result: Experimental results show the proposed method performs better than existing initialization schemes like Glorot, He, orthogonal matrix, and random walk methods.

Conclusion: Incorporating layer-specific depth information and progressively increasing variance through the network leads to superior initialization performance, especially for deep neural networks.

Abstract: In past few years, various initialization schemes have been proposed. These schemes are glorot initialization, He initialization, initialization using orthogonal matrix, random walk method for initialization. Some of these methods stress on keeping unit variance of activation and gradient propagation through the network layer. Few of these methods are independent of the depth information while some methods has considered the total network depth for better initialization. In this paper, comprehensive study has been done where depth information of each layer as well as total network is incorporated for better initialization scheme. It has also been studied that for deeper networks theoretical assumption of unit variance throughout the network does not perform well. It requires the need to increase the variance of the network from first layer activation to last layer activation. We proposed a novel way to increase the variance of the network in flexible manner, which incorporates the information of each layer depth. Experiments shows that proposed method performs better than the existing initialization scheme.

[300] MultiSurv: A Multimodal Deep Survival Framework for Prostrate and Bladder Cancer

Noorul Wahab, Ethar Alzaid, Jiaqi Lv, Adam Shephard, Shan E Ahmed Raza

Main category: cs.LG

TL;DR: MultiSurv is a multimodal deep survival model that integrates clinical, MRI, RNA-seq, and pathology data using DeepHit with cross-attention to predict cancer recurrence in prostate and bladder cancer patients.

DetailsMotivation: Accurate time-to-event prediction is crucial in oncology for treatment planning and patient management, but existing methods may not effectively integrate heterogeneous multimodal patient data.

Method: MultiSurv uses DeepHit with a projection layer and inter-modality cross-attention to integrate clinical, MRI, RNA-seq, and whole-slide pathology features for survival prediction.

Result: Achieved C-index of 0.843 (prostate) and 0.662 (bladder) on cross-validation, and 0.818 (prostate) and 0.457 (bladder) on development sets in the CHIMERA Grand Challenge.

Conclusion: Multimodal integration with deep survival learning provides promising personalized risk stratification for prostate and bladder cancer, with broad applicability to survival prediction tasks involving heterogeneous biomedical data.

Abstract: Accurate prediction of time-to-event outcomes is a central challenge in oncology, with significant implications for treatment planning and patient management. In this work, we present MultiSurv, a multimodal deep survival model utilising DeepHit with a projection layer and inter-modality cross-attention, which integrates heterogeneous patient data, including clinical, MRI, RNA-seq and whole-slide pathology features. The model is designed to capture complementary prognostic signals across modalities and estimate individualised time-to-biochemical recurrence in prostate cancer and time-to-cancer recurrence in bladder cancer. Our approach was evaluated in the context of the CHIMERA Grand Challenge, across two of the three provided tasks. For Task 1 (prostate cancer bio-chemical recurrence prediction), the proposed framework achieved a concordance index (C-index) of 0.843 on 5-folds cross-validation and 0.818 on CHIMERA development set, demonstrating robust discriminatory ability. For Task 3 (bladder cancer recurrence prediction), the model obtained a C-index of 0.662 on 5-folds cross-validation and 0.457 on development set, highlighting its adaptability and potential for clinical translation. These results suggest that leveraging multimodal integration with deep survival learning provides a promising pathway toward personalised risk stratification in prostate and bladder cancer. Beyond the challenge setting, our framework is broadly applicable to survival prediction tasks involving heterogeneous biomedical data.

[301] Recurrent State Encoders for Efficient Neural Combinatorial Optimization

Tim Dernedde, Daniela Thyssens, Lars Schmidt-Thieme

Main category: cs.LG

TL;DR: Proposes recurrent encoder for neural combinatorial optimization that reuses computation from previous steps, achieving equivalent/better performance with 3x fewer layers and lower latency.

DetailsMotivation: Typical construction methods in NCO make small state changes between steps, suggesting computation reuse could improve efficiency.

Method: Train recurrent encoder that computes state embeddings based on current state and previous step embeddings, enabling computation reuse.

Result: Recurrent encoder achieves equivalent or better performance than non-recurrent encoder with 3x fewer layers, significantly reducing latency.

Conclusion: Recurrent approach is practical and efficient for NCO problems, demonstrated on TSP, CVRP, and OP with large neighborhood search integration.

Abstract: The primary paradigm in Neural Combinatorial Optimization (NCO) are construction methods, where a neural network is trained to sequentially add one solution component at a time until a complete solution is constructed. We observe that the typical changes to the state between two steps are small, since usually only the node that gets added to the solution is removed from the state. An efficient model should be able to reuse computation done in prior steps. To that end, we propose to train a recurrent encoder that computes the state embeddings not only based on the state but also the embeddings of the step before. We show that the recurrent encoder can achieve equivalent or better performance than a non-recurrent encoder even if it consists of $3\times$ fewer layers, thus significantly improving on latency. We demonstrate our findings on three different problems: the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), and the Orienteering Problem (OP) and integrate the models into a large neighborhood search algorithm, to showcase the practical relevance of our findings.

[302] HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions

Rafael Bischof, Michal Piovarči, Michael A. Kraus, Siddhartha Mishra, Bernd Bickel

Main category: cs.LG

TL;DR: HyPINO is a multi-physics neural operator that achieves zero-shot generalization across parametric PDEs without fine-tuning, using a Swin Transformer hypernetwork with mixed supervision from analytical solutions and physics-informed objectives.

DetailsMotivation: To create a neural operator that can generalize across various PDE types without task-specific fine-tuning, handling different geometries, boundary conditions, and source terms while improving accuracy and reducing computational costs.

Method: Combines Swin Transformer-based hypernetwork with mixed supervision: labeled data from Method of Manufactured Solutions and unlabeled samples optimized via physics-informed objectives. Includes iterative refinement procedure that creates ensemble PINNs to progressively reduce error.

Result: Outperforms U-Nets, Poseidon, and PINO on seven benchmark problems. Achieves over 100x gain in average L2 loss with iterative refinement. PINNs initialized by HyPINO converge faster and to lower error than random initialization or Reptile-meta-learned approaches.

Conclusion: HyPINO demonstrates strong zero-shot generalization capabilities and provides a scalable foundation for solving complex, nonlinear, high-dimensional PDE problems with improved accuracy and reduced computational requirements.

Abstract: We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of parametric PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parametrizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that compares the physics of the generated PINN to the requested PDE and uses the discrepancy to generate a “delta” PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves over 100x gain in average $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems with significantly improved accuracy and reduced computational cost.

[303] Should We Always Train Models on Fine-Grained Classes?

Davide Pirovano, Federico Milanesio, Michele Caselle, Piero Fariselli, Matteo Osella

Main category: cs.LG

TL;DR: Fine-grained label training doesn’t always improve classification accuracy - effectiveness depends on data geometry, label hierarchy relations, dataset size, and model capacity.

DetailsMotivation: To investigate whether training with fine-grained hierarchical labels universally improves classification performance, as empirical evidence suggests but lacks systematic understanding.

Method: Used both real and synthetic datasets to analyze how geometric structure of data, label hierarchy relations, dataset size, and model capacity affect fine-grained training effectiveness.

Result: Fine-grained training does not universally improve accuracy; its effectiveness depends critically on data geometry and hierarchy relations, with dataset size and model capacity being significant factors.

Conclusion: The benefit of fine-grained label training is conditional rather than universal, requiring careful consideration of data structure, hierarchy relationships, and model/dataset characteristics for optimal performance.

Abstract: In classification problems, models must predict a class label based on the input data features. However, class labels are organized hierarchically in many datasets. While a classification task is often defined at a specific level of this hierarchy, training can utilize a finer granularity of labels. Empirical evidence suggests that such fine-grained training can enhance performance. In this work, we investigate the generality of this observation and explore its underlying causes using both real and synthetic datasets. We show that training on fine-grained labels does not universally improve classification accuracy. Instead, the effectiveness of this strategy depends critically on the geometric structure of the data and its relations with the label hierarchy. Additionally, factors such as dataset size and model capacity significantly influence whether fine-grained labels provide a performance benefit.

[304] On the Learnability of Distribution Classes with Adaptive Adversaries

Tosca Lechner, Alex Bie, Gautam Kamath

Main category: cs.LG

TL;DR: Learnability with adaptive adversaries (who can intercept and manipulate samples) is strictly stronger than with oblivious adversaries (who can only modify the underlying distribution).

DetailsMotivation: To understand how learnability changes when adversaries can actively intercept and manipulate individual samples in real-time, rather than just modifying the underlying data distribution.

Method: Formulated a general notion of learnability with respect to adaptive adversaries considering adversary budget constraints, and compared it against learnability with oblivious adversaries.

Result: Learnability with respect to additive adaptive adversaries is strictly stronger than learnability with respect to additive oblivious adversaries.

Conclusion: The ability to learn in the presence of adaptive adversaries who can actively manipulate samples requires more robust learning algorithms than those needed for oblivious adversaries who only modify the data distribution.

Abstract: We consider the question of learnability of distribution classes in the presence of adaptive adversaries – that is, adversaries capable of intercepting the samples requested by a learner and applying manipulations with full knowledge of the samples before passing it on to the learner. This stands in contrast to oblivious adversaries, who can only modify the underlying distribution the samples come from but not their i.i.d.\ nature. We formulate a general notion of learnability with respect to adaptive adversaries, taking into account the budget of the adversary. We show that learnability with respect to additive adaptive adversaries is a strictly stronger condition than learnability with respect to additive oblivious adversaries.

[305] Foundational Models and Federated Learning: Survey, Taxonomy, Challenges and Practical Insights

Cosmin-Andrei Hatfaludi, Alex Serban

Main category: cs.LG

TL;DR: Survey paper on integrating federated learning with foundational models, presenting taxonomy and technical comparison of 42 methods, with healthcare case study.

DetailsMotivation: To address the need for collaborative training of complex foundational models using distributed private data without sharing it, bridging the gap between federated learning and foundational models.

Method: Conducted literature survey of 4,200+ articles, narrowed to 250+ reviewed articles, identified 42 unique methods, developed taxonomy based on development life-cycle stages, and performed technical comparison.

Result: Created comprehensive taxonomy and technical comparison framework covering federated learning, self-supervised learning, fine-tuning, distillation, and transfer learning methods with complexity, efficiency, and scalability metrics.

Conclusion: Provides practical guidelines for implementing federated foundational models, particularly in healthcare, and offers insights for adoption and evolution of these integrated approaches.

Abstract: Federated learning has the potential to unlock siloed data and distributed resources by enabling collaborative model training without sharing private data. As more complex foundational models gain widespread use, the need to expand training resources and integrate privately owned data grows as well. In this article, we explore the intersection of federated learning and foundational models, aiming to identify, categorize, and characterize technical methods that integrate the two paradigms. As a unified survey is currently unavailable, we present a literature survey structured around a novel taxonomy that follows the development life-cycle stages, along with a technical comparison of available methods. Additionally, we provide practical insights and guidelines for implementing and evolving these methods, with a specific focus on the healthcare domain as a case study, where the potential impact of federated learning and foundational models is considered significant. Our survey covers multiple intersecting topics, including but not limited to federated learning, self-supervised learning, fine-tuning, distillation, and transfer learning. Initially, we retrieved and reviewed a set of over 4,200 articles. This collection was narrowed to more than 250 thoroughly reviewed articles through inclusion criteria, featuring 42 unique methods. The methods were used to construct the taxonomy and enabled their comparison based on complexity, efficiency, and scalability. We present these results as a self-contained overview that not only summarizes the state of the field but also provides insights into the practical aspects of adopting, evolving, and integrating foundational models with federated learning.

[306] KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed

Main category: cs.LG

TL;DR: A KV cache compression method using attention-guided composite tokens that reduces memory usage while maintaining accuracy and compatibility with standard inference engines.

DetailsMotivation: KV cache size grows linearly with context length and model depth, creating a major bottleneck for long-context inference in LLMs. Existing compression methods have limitations like rigid heuristics, disrupted tensor layouts, or require specialized kernels.

Method: Uses attention-guided, layer-adaptive composite tokens that aggregate attention scores to estimate token importance, select head-specific tokens independently, and align them into composite tokens that respect uniform cache structure. Includes global allocation mechanism to adapt retention budgets across layers.

Result: Achieves significant memory reduction while preserving accuracy, consistently outperforming prior structured and semi-structured methods.

Conclusion: Provides a practical and scalable solution for efficient long-context LLM deployment that remains fully compatible with standard inference pipelines.

Abstract: Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache compression methods either enforce rigid heuristics, disrupt tensor layouts with per-attention-head variability, or require specialized compute kernels. We propose a simple, yet effective, KV cache compression framework based on attention-guided, layer-adaptive composite tokens. Our method aggregates attention scores to estimate token importance, selects head-specific tokens independently, and aligns them into composite tokens that respect the uniform cache structure required by existing inference engines. A global allocation mechanism further adapts retention budgets across layers, assigning more capacity to layers with informative tokens. This approach achieves significant memory reduction while preserving accuracy, consistently outperforming prior structured and semi-structured methods. Crucially, our approach remains fully compatible with standard inference pipelines, offering a practical and scalable solution for efficient long-context LLM deployment.

[307] Accuracy-Constrained CNN Pruning for Efficient and Reliable EEG-Based Seizure Detection

Mounvik K, N Harshit

Main category: cs.LG

TL;DR: Lightweight 1D CNN with structured pruning achieves efficient EEG seizure detection with 92.87% accuracy and improved F1 score while reducing model size by 50%.

DetailsMotivation: Address challenges of deep learning models (size and compute requirements) for real-time seizure detection in resource-limited environments.

Method: One-dimensional CNN with structured pruning (removing 50% of convolutional kernels based on importance) and mild early stopping to prevent overfitting.

Result: After pruning, model maintained predictive capabilities with 92.87% accuracy (vs 92.78% baseline) and improved macro-F1 score to 0.8707 (vs 0.8686 baseline) while reducing weights and memory by 50%.

Conclusion: Structured pruning removes redundancy, improves generalization, and combined with mild early stopping provides an efficient and reliable approach for seizure detection in resource-constrained settings.

Abstract: Deep learning models, especially convolutional neural networks (CNNs), have shown considerable promise for biomedical signals such as EEG-based seizure detection. However, these models come with challenges, primarily due to their size and compute requirements in environments where real-time detection or limited resources are available. In this study, we present a lightweight one-dimensional CNN model with structured pruning to improve efficiency and reliability. The model was trained with mild early stopping to address possible overfitting, achieving an accuracy of 92.78% and a macro-F1 score of 0.8686. Structured pruning of the baseline CNN involved removing 50% of the convolutional kernels based on their importance to model predictions. Surprisingly, after pruning the weights and memory by 50%, the new network was still able to maintain predictive capabilities, while modestly increasing precision to 92.87% and improving the macro-F1 score to 0.8707. Overall, we present a convincing case that structured pruning removes redundancy, improves generalization, and, in combination with mild early stopping, achieves a promising way forward to improve seizure detection efficiency and reliability, which is clear motivation for resource-limited settings.

[308] Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning

Bastien Dubail, Stefan Stojanovic, Alexandre Proutière

Main category: cs.LG

TL;DR: This paper challenges the common low-rank assumption in RL by showing that the successor measure itself is not low-rank, but a low-rank structure emerges in the shifted successor measure after bypassing initial transitions.

DetailsMotivation: Many modern RL algorithms assume low-rank structure in successor measures, but this work questions this assumption and explores where low-rank structure actually emerges.

Method: The authors analyze the spectral recoverability of matrices, derive Type II Poincaré inequalities for Markov chains, and provide finite-sample guarantees for estimating low-rank approximations of the shifted successor measure.

Result: The analysis shows that approximation and estimation errors are governed by spectral recoverability, and that the required shift depends on decay of high-order singular values and local mixing properties.

Conclusion: Shifting the successor measure leads to effective low-rank approximation and improved performance in goal-conditioned RL, with practical shift amounts being typically small.

Abstract: Low-rank structure is a common implicit assumption in many modern reinforcement learning (RL) algorithms. For instance, reward-free and goal-conditioned RL methods often presume that the successor measure admits a low-rank representation. In this work, we challenge this assumption by first remarking that the successor measure itself is not low-rank. Instead, we demonstrate that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions. We provide finite-sample performance guarantees for the entry-wise estimation of a low-rank approximation of the shifted successor measure from sampled entries. Our analysis reveals that both the approximation and estimation errors are primarily governed by the so-called spectral recoverability of the corresponding matrix. To bound this parameter, we derive a new class of functional inequalities for Markov chains that we call Type II Poincar'e inequalities and from which we can quantify the amount of shift needed for effective low-rank approximation and estimation. This analysis shows in particular that the required shift depends on decay of the high-order singular values of the shifted successor measure and is hence typically small in practice. Additionally, we establish a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift. Finally, we validate our theoretical findings with experiments, and demonstrate that shifting the successor measure indeed leads to improved performance in goal-conditioned RL.

[309] RapidGNN: Energy and Communication-Efficient Distributed Training on Large-Scale Graph Neural Networks

Arefin Niam, Tevfik Kosar, M S Q Zulkar Nine

Main category: cs.LG

TL;DR: RapidGNN is a distributed GNN training framework that uses deterministic sampling-based scheduling to improve training efficiency by reducing communication overhead and enabling better cache management.

DetailsMotivation: Distributed training of GNNs on large-scale graphs faces significant challenges due to high computational loads and communication overhead from traditional sampling-based approaches.

Method: Presents RapidGNN framework with deterministic sampling-based scheduling for efficient cache construction and prefetching of remote features.

Result: Improves training throughput by 2.46x-3.00x, reduces remote feature fetches by 9.70x-15.39x, shows near-linear scalability, and achieves 44% CPU and 32% GPU energy efficiency improvements.

Conclusion: RapidGNN effectively addresses communication bottlenecks in distributed GNN training through intelligent scheduling and caching strategies, delivering significant performance and efficiency gains.

Abstract: Graph Neural Networks (GNNs) have become popular across a diverse set of tasks in exploring structural relationships between entities. However, due to the highly connected structure of the datasets, distributed training of GNNs on large-scale graphs poses significant challenges. Traditional sampling-based approaches mitigate the computational loads, yet the communication overhead remains a challenge. This paper presents RapidGNN, a distributed GNN training framework with deterministic sampling-based scheduling to enable efficient cache construction and prefetching of remote features. Evaluation on benchmark graph datasets demonstrates RapidGNN’s effectiveness across different scales and topologies. RapidGNN improves end-to-end training throughput by 2.46x to 3.00x on average over baseline methods across the benchmark datasets, while cutting remote feature fetches by over 9.70x to 15.39x. RapidGNN further demonstrates near-linear scalability with an increasing number of computing units efficiently. Furthermore, it achieves increased energy efficiency over the baseline methods for both CPU and GPU by 44% and 32%, respectively.

[310] An Efficient Subspace Algorithm for Federated Learning on Heterogeneous Data

Jiaojiao Zhang, Yuqi Xu, Kun Yuan

Main category: cs.LG

TL;DR: FedSub is an efficient subspace algorithm for federated learning that reduces communication, computation, and memory costs while mitigating client drift through low-dimensional subspace projections and dual variables.

DetailsMotivation: Address key challenges in applying federated learning to large-scale deep neural networks, particularly client drift due to data heterogeneity and high costs of communication, computation, and memory.

Method: Utilizes subspace projection to constrain local updates within low-dimensional subspaces, reducing costs. Incorporates low-dimensional dual variables to mitigate client drift. Provides convergence analysis examining step size and projection matrices.

Result: Experimental results demonstrate efficiency of the proposed FedSub approach.

Conclusion: FedSub effectively addresses federated learning challenges by combining subspace projection with dual variables, offering reduced costs and improved performance on heterogeneous data.

Abstract: This work addresses the key challenges of applying federated learning to large-scale deep neural networks, particularly the issue of client drift due to data heterogeneity across clients and the high costs of communication, computation, and memory. We propose FedSub, an efficient subspace algorithm for federated learning on heterogeneous data. Specifically, FedSub utilizes subspace projection to guarantee local updates of each client within low-dimensional subspaces, thereby reducing communication, computation, and memory costs. Additionally, it incorporates low-dimensional dual variables to mitigate client drift. We provide convergence analysis that reveals the impact of key factors such as step size and subspace projection matrices on convergence. Experimental results demonstrate its efficiency.

[311] Deep Learning-Enhanced for Amine Emission Monitoring and Performance Analysis in Industrial Carbon Capture Plants

Lokendra Poudel, David Tincher, Duy-Nhat Phan, Rahul Bhowmik

Main category: cs.LG

TL;DR: Deep learning models for amine emission forecasting and carbon capture system performance monitoring using LSTM architectures, achieving >99% accuracy and enabling operational optimization through causal impact analysis.

DetailsMotivation: To develop data-driven deep learning approaches for real-time monitoring and optimization of amine-based carbon capture systems, addressing the need for predictive tools that can handle both steady-state and dynamic operational conditions while reducing environmental impacts.

Method: Developed four DL architectures (Basic LSTM, Stacked LSTM, Bi-directional LSTM, Convolutional LSTM) using operational data from CESAR1 solvent campaign. Models predict amine emissions (AMP and Piperazine) and four key performance parameters. Conducted causal impact analysis by perturbing eight input variables ±20% to assess operational impacts.

Result: Models achieved high predictive accuracy exceeding 99%, effectively tracking both steady trends and abrupt fluctuations. Causal analysis revealed that adjusting specific parameters like lean solvent temperature and water wash conditions can significantly reduce amine emissions and enhance system performance.

Conclusion: ML framework serves as both predictive tool and decision support system, enabling real-time monitoring, scenario testing, and operational optimization for carbon capture systems. Represents a step toward intelligent, data-driven control strategies that enhance efficiency, stability, and sustainability of carbon capture technologies.

Abstract: We present data driven deep learning models for forecasting and monitoring amine emissions and key performance parameters in amine-based post-combustion carbon capture systems. Using operational data from the CESAR1 solvent campaign at Technology Center Mongstad, four DL architectures such as Basic Long Short-Term Memory (LSTM), Stacked LSTM, Bi-directional LSTM, and Convolutional LSTM were developed to capture time-dependent process behavior. For emission prediction, models were designed for 2-amino-2-methyl-1-propanol (AMP) and Piperazine emissions measured via FTIR and IMR-MS methods. System performance models target four critical parameters: CO$_2$ product flow, absorber outlet temperature, depleted flue gas outlet temperature, and RFCC stripper bottom temperature. These models achieved high predictive accuracy exceeding 99% and effectively tracked both steady trends and abrupt fluctuations. Additionally, we conducted causal impact analysis to evaluate how operational variables influence emissions and system performance. Eight input variables were systematically perturbed within $\pm$20% of nominal values to simulate deviations and assess their impact. This analysis revealed that adjusting specific operational parameters, such as lean solvent temperature and water wash conditions, can significantly reduce amine emissions and enhance system performance. This study highlights ML not only as a predictive tool but also as a decision support system for optimizing carbon capture operations under steady state and dynamic conditions. By enabling real time monitoring, scenario testing, and operational optimization, the developed ML framework offers a practical pathway for mitigating environmental impacts. This work represents a step toward intelligent, data-driven control strategies that enhance the efficiency, stability, and sustainability of carbon capture and storage technologies.

[312] A Kolmogorov-Arnold Network for Interpretable Cyberattack Detection in AGC Systems

Jehad Jilan, Niranjana Naveen Nambiar, Ahmad Mohammad Saber, Alok Paranjape, Amr Youssef, Deepa Kundur

Main category: cs.LG

TL;DR: Proposes interpretable Kolmogorov-Arnold Networks (KAN) for detecting False Data Injection Attacks in Automatic Generation Control systems, achieving ~96% detection rates with symbolic formula extraction for enhanced interpretability.

DetailsMotivation: Traditional AGC systems are vulnerable to stealthy cyberattacks like FDIAs that evade conventional detection methods, and existing approaches lack interpretability.

Method: Uses Kolmogorov-Arnold Networks (KAN) trained offline to learn nonlinear relationships in AGC measurements, with symbolic equation extraction capability for interpretable detection.

Result: Achieves 95.97% detection rate for initial model and 95.9% for symbolic formula, both with low false alarm rates.

Conclusion: KAN provides an accurate and interpretable solution for FDIA detection in AGC systems, enhancing cybersecurity while maintaining transparency through symbolic formulas.

Abstract: Automatic Generation Control (AGC) is essential for power grid stability but remains vulnerable to stealthy cyberattacks, such as False Data Injection Attacks (FDIAs), which can disturb the system’s stability while evading traditional detection methods. Unlike previous works that relied on blackbox approaches, this work proposes Kolmogorov-Arnold Networks (KAN) as an interpretable and accurate method for FDIA detection in AGC systems, considering the system nonlinearities. KAN models include a method for extracting symbolic equations, and are thus able to provide more interpretability than the majority of machine learning models. The proposed KAN is trained offline to learn the complex nonlinear relationships between the AGC measurements under different operating scenarios. After training, symbolic formulas that describe the trained model’s behavior can be extracted and leveraged, greatly enhancing interpretability. Our findings confirm that the proposed KAN model achieves FDIA detection rates of up to 95.97% and 95.9% for the initial model and the symbolic formula, respectively, with a low false alarm rate, offering a reliable approach to enhancing AGC cybersecurity.

[313] Greener Deep Reinforcement Learning: Analysis of Energy and Carbon Efficiency Across Atari Benchmarks

Jason Gardner, Ayan Dutta, Swapnoneel Roy, O. Patrick Kreidl, Ladislau Boloni

Main category: cs.LG

TL;DR: Systematic benchmarking of 7 DRL algorithms shows significant variations in energy efficiency, with some algorithms consuming up to 24% less energy and emitting 68% less CO2 while maintaining comparable performance.

DetailsMotivation: Growing concerns about environmental and economic costs of deep reinforcement learning training, as energy requirements, emissions, and monetary costs remain largely unexplored despite extensive study of algorithmic learning performance.

Method: Benchmarked 7 state-of-the-art DRL algorithms (DQN, TRPO, A2C, ARS, PPO, RecurrentPPO, QR-DQN) using Stable Baselines, trained for 1M steps each on 10 Atari 2600 games with real-time power measurement to estimate energy usage, CO2 emissions, and electricity costs.

Result: Substantial variation in energy efficiency: ARS vs DQN - 24% less energy; QR-DQN vs RecurrentPPO - 68% less CO2 emissions and 68% lower monetary cost, while achieving comparable learning performance.

Conclusion: Algorithmic choices can significantly mitigate environmental and economic impact without sacrificing performance, providing actionable insights for energy-aware DRL practices and establishing foundation for sustainability considerations in future algorithmic design.

Abstract: The growing computational demands of deep reinforcement learning (DRL) have raised concerns about the environmental and economic costs of training large-scale models. While algorithmic efficiency in terms of learning performance has been extensively studied, the energy requirements, greenhouse gas emissions, and monetary costs of DRL algorithms remain largely unexplored. In this work, we present a systematic benchmarking study of the energy consumption of seven state-of-the-art DRL algorithms, namely DQN, TRPO, A2C, ARS, PPO, RecurrentPPO, and QR-DQN, implemented using Stable Baselines. Each algorithm was trained for one million steps each on ten Atari 2600 games, and power consumption was measured in real-time to estimate total energy usage, CO2-Equivalent emissions, and electricity cost based on the U.S. national average electricity price. Our results reveal substantial variation in energy efficiency and training cost across algorithms, with some achieving comparable performance while consuming up to 24% less energy (ARS vs. DQN), emitting nearly 68% less CO2, and incurring almost 68% lower monetary cost (QR-DQN vs. RecurrentPPO) than less efficient counterparts. We further analyze the trade-offs between learning performance, training time, energy use, and financial cost, highlighting cases where algorithmic choices can mitigate environmental and economic impact without sacrificing learning performance. This study provides actionable insights for developing energy-aware and cost-efficient DRL practices and establishes a foundation for incorporating sustainability considerations into future algorithmic design and evaluation.

[314] SpikingBrain Technical Report: Spiking Brain-inspired Large Models

Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Zehao Liu, Bohan Sun, Yuhong Chou, Han Xu, Xuerui Qiu, Anlin Deng, Anjie Hu, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li

Main category: cs.LG

TL;DR: SpikingBrain introduces brain-inspired models with linear/hybrid-linear attention and spiking neurons for efficient long-context training and inference on non-NVIDIA platforms, achieving comparable performance to Transformers with significantly better efficiency.

DetailsMotivation: Address efficiency bottlenecks in Transformer-based LLMs where training computation scales quadratically with sequence length and inference memory grows linearly, limiting long-context processing on non-NVIDIA platforms.

Method: Develops SpikingBrain models with: (1) linear/hybrid-linear attention architectures with adaptive spiking neurons, (2) efficient conversion-based training pipeline and spike coding framework, (3) customized training frameworks and operator libraries for MetaX hardware.

Result: Created SpikingBrain-7B and SpikingBrain-76B models that achieve performance comparable to open-source Transformers using only ~150B tokens. Achieved 100x speedup in Time to First Token for 4M-token sequences, 23.4% Model FLOPs Utilization, 69.15% sparsity for low-power operation, and stable training on hundreds of MetaX GPUs.

Conclusion: Demonstrates the potential of brain-inspired mechanisms for efficient and scalable large model design, enabling large-scale LLM development on non-NVIDIA platforms with significantly improved long-sequence efficiency and constant memory inference.

Abstract: Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

[315] Dual-Branch Convolutional Framework for Spatial and Frequency-Based Image Forgery Detection

Naman Tyagi

Main category: cs.LG

TL;DR: A dual-branch CNN framework combining spatial and frequency features for image forgery detection, achieving 77.9% accuracy on CASIA 2.0 with balanced computational complexity for practical deployment.

DetailsMotivation: Addressing the rapid increase in deepfakes and digital image forgeries by ensuring image authenticity, which is crucial for media verification, law enforcement, and digital content reliability.

Method: Proposes a dual-branch convolutional neural network that extracts features from both spatial and frequency domains, fuses them, and uses a Siamese network to generate 64-dimensional embeddings for classification.

Result: Achieves 77.9% accuracy on the CASIA 2.0 dataset, outperforming traditional statistical methods while balancing computational complexity and detection reliability.

Conclusion: Provides a practical and deployable solution for image forensic scrutiny, advancing visual forensics and addressing urgent needs in media verification and content authenticity.

Abstract: With a very rapid increase in deepfakes and digital image forgeries, ensuring the authenticity of images is becoming increasingly challenging. This report introduces a forgery detection framework that combines spatial and frequency-based features for detecting forgeries. We propose a dual branch convolution neural network that operates on features extracted from spatial and frequency domains. Features from both branches are fused and compared within a Siamese network, yielding 64 dimensional embeddings for classification. When benchmarked on CASIA 2.0 dataset, our method achieves an accuracy of 77.9%, outperforming traditional statistical methods. Despite its relatively weaker performance compared to larger, more complex forgery detection pipelines, our approach balances computational complexity and detection reliability, making it ready for practical deployment. It provides a strong methodology for forensic scrutiny of digital images. In a broader sense, it advances the state of the art in visual forensics, addressing an urgent requirement in media verification, law enforcement and digital content reliability.

[316] Learning to accelerate distributed ADMM using graph neural networks

Henri Doerks, Paul Häusner, Daniel Hernández Escobar, Jens Sjölund

Main category: cs.LG

TL;DR: A GNN-based approach to learn adaptive hyperparameters for distributed ADMM, improving convergence speed and solution quality while preserving convergence guarantees.

DetailsMotivation: ADMM is popular for distributed optimization but suffers from slow convergence and sensitivity to hyperparameter choices, which limits its practical effectiveness.

Method: Represent ADMM iterations as message-passing in GNNs, then learn adaptive step sizes and communication weights using a GNN that predicts hyperparameters based on iterates. Train end-to-end by unrolling ADMM for fixed iterations.

Result: Numerical experiments show the learned variant consistently improves convergence speed and solution quality compared to standard ADMM.

Conclusion: The connection between ADMM and GNNs enables learning adaptive hyperparameters that significantly enhance ADMM performance while maintaining theoretical convergence properties.

Abstract: Distributed optimization is fundamental in large-scale machine learning and control applications. Among existing methods, the Alternating Direction Method of Multipliers (ADMM) has gained popularity due to its strong convergence guarantees and suitability for decentralized computation. However, ADMM often suffers from slow convergence and sensitivity to hyperparameter choices. In this work, we show that distributed ADMM iterations can be naturally represented within the message-passing framework of graph neural networks (GNNs). Building on this connection, we propose to learn adaptive step sizes and communication weights by a graph neural network that predicts the hyperparameters based on the iterates. By unrolling ADMM for a fixed number of iterations, we train the network parameters end-to-end to minimize the final iterates error for a given problem class, while preserving the algorithm’s convergence properties. Numerical experiments demonstrate that our learned variant consistently improves convergence speed and solution quality compared to standard ADMM. The code is available at https://github.com/paulhausner/learning-distributed-admm.

[317] Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest

Xiao Yang, Mehdi Ben Ayed, Longyu Zhao, Fan Zhou, Yuchen Shen, Abe Engle, Jinfeng Zhuang, Ling Leng, Jiajing Xu, Charles Rosenberg, Prathibha Deshikachar

Main category: cs.LG

TL;DR: DRL-PUT uses deep reinforcement learning to automatically optimize multi-objective ad ranking utility functions, improving click-through rates by 9.7% compared to manual tuning.

DetailsMotivation: Manual tuning of ad ranking utility functions is suboptimal due to unprincipled objectives, vast parameter space, lack of personalization, and inability to adapt to seasonality.

Method: Formulated as RL task: predict optimal hyperparameters for each ad request state to maximize reward. Directly learns policy model from online logs without value function estimation.

Result: Online A/B test showed 9.7% improvement in click-through rate and 7.7% improvement in long click-through rate compared to manual tuning baseline.

Conclusion: DRL-PUT framework effectively addresses multi-objective optimization challenges in ad recommender systems, providing personalized and adaptive utility tuning with significant performance gains.

Abstract: The ranking utility function in an ad recommender system, which linearly combines predictions of various business goals, plays a central role in balancing values across the platform, advertisers, and users. Traditional manual tuning, while offering simplicity and interpretability, often yields suboptimal results due to its unprincipled tuning objectives, the vast amount of parameter combinations, and its lack of personalization and adaptability to seasonality. In this work, we propose a general Deep Reinforcement Learning framework for Personalized Utility Tuning (DRL-PUT) to address the challenges of multi-objective optimization within ad recommender systems. Our key contributions include: 1) Formulating the problem as a reinforcement learning task: given the state of an ad request, we predict the optimal hyperparameters to maximize a pre-defined reward. 2) Developing an approach to directly learn an optimal policy model using online serving logs, avoiding the need to estimate a value function, which is inherently challenging due to the high variance and unbalanced distribution of immediate rewards. We evaluated DRL-PUT through an online A/B experiment in Pinterest’s ad recommender system. Compared to the baseline manual utility tuning approach, DRL-PUT improved the click-through rate by 9.7% and the long click-through rate by 7.7% on the treated segment. We conducted a detailed ablation study on the impact of different reward definitions and analyzed the personalization aspect of the learned policy model.

[318] FC-PINO: High Precision Physics-Informed Neural Operators via Fourier Continuation

Adarsh Ganeshram, Haydn Maust, Valentin Duruisseaux, Zongyi Li, Yixuan Wang, Daniel Leibovici, Oscar Bruno, Thomas Hou, Anima Anandkumar

Main category: cs.LG

TL;DR: FC-PINO extends PINO to handle non-periodic and non-smooth PDEs using Fourier continuation methods (FC-Legendre and FC-Gram) to avoid boundary errors from spectral differentiation.

DetailsMotivation: Standard PINO struggles with non-periodic PDEs due to spectral differentiation's periodicity assumption, causing boundary errors and degraded accuracy for higher-order derivatives.

Method: Integrates Fourier continuation into PINO framework to transform non-periodic signals into periodic functions on extended domains, enabling accurate derivative computations without finite differences or automatic differentiation overhead.

Result: FC-PINO substantially outperforms standard PINO on non-periodic and non-smooth PDE benchmarks, providing accurate, robust, and scalable solutions with high precision.

Conclusion: Fourier continuation is critical for extending PINO to a wider range of PDE problems requiring high-precision solutions, overcoming limitations of spectral differentiation for non-periodic cases.

Abstract: The physics-informed neural operator (PINO) is a machine learning paradigm that has demonstrated promising results for learning solutions to partial differential equations (PDEs). It leverages the Fourier Neural Operator to learn solution operators in function spaces and leverages physics losses during training to penalize deviations from known physics laws. Spectral differentiation provides an efficient way to compute derivatives for the physics losses, but it inherently assumes periodicity. When applied to non-periodic functions, this assumption of periodicity can lead to significant errors, including Gibbs phenomena near domain boundaries which degrade the accuracy of both function representations and derivative computations, especially for higher order derivatives. To overcome this limitation, we introduce the FC-PINO (Fourier-Continuation-based PINO) architecture which extends the accuracy and efficiency of PINO and spectral differentiation to non-periodic and non-smooth PDEs. In FC-PINO, we propose integrating Fourier continuation into the PINO framework, and test two different continuation approaches: FC-Legendre and FC-Gram. By transforming non-periodic signals into periodic functions on extended domains in a well-conditioned manner, Fourier continuation enables fast and accurate derivative computations. This approach avoids the discretization sensitivity of finite differences and the memory overhead of automatic differentiation. We demonstrate that standard PINO struggles to solve non-periodic and non-smooth PDEs with high precision, across challenging benchmarks. In contrast, the proposed FC-PINO provides accurate, robust, and scalable solutions, substantially outperforming PINO alternatives, and demonstrating that Fourier continuation is critical for extending PINO to a wider range of PDE problems when high-precision solutions are needed.

[319] Continuum Attention for Neural Operators

Edoardo Calvello, Nikola B. Kovachki, Matthew E. Levine, Andrew M. Stuart

Main category: cs.LG

TL;DR: The paper extends transformers to function spaces, formulating attention as an operator between infinite-dimensional function spaces and proving it’s a Monte Carlo approximation of this operator. They introduce transformer neural operators with universal approximation capabilities and propose efficient patching strategies for multidimensional domains.

DetailsMotivation: Transformers have shown success in modeling nonlocal, long-range correlations in various domains. Since neural operators need to be both nonlinear and nonlocal to be universal, the authors investigate whether attention mechanisms can be effectively used in neural operator design for function space mappings.

Method: The authors formulate attention as a map between infinite-dimensional function spaces and prove it approximates practical implementations. They design transformer neural operators with universal approximation properties and introduce a function space generalization of patching strategies from computer vision to handle multidimensional domains efficiently.

Result: The paper provides the first universal approximation result for transformer neural operators using only slight modifications of existing architectures. Numerical experiments across various operator learning problems demonstrate the effectiveness of their function space attention formulations.

Conclusion: The work successfully bridges transformers and neural operators, showing that attention mechanisms can be effectively formulated and implemented in function spaces, with promising results for learning mappings between function spaces through transformer-based architectures.

Abstract: Transformers, and the attention mechanism in particular, have become ubiquitous in machine learning. Their success in modeling nonlocal, long-range correlations has led to their widespread adoption in natural language processing, computer vision, and time series problems. Neural operators, which map spaces of functions into spaces of functions, are necessarily both nonlinear and nonlocal if they are universal; it is thus natural to ask whether the attention mechanism can be used in the design of neural operators. Motivated by this, we study transformers in the function space setting. We formulate attention as a map between infinite dimensional function spaces and prove that the attention mechanism as implemented in practice is a Monte Carlo or finite difference approximation of this operator. The function space formulation allows for the design of transformer neural operators, a class of architectures designed to learn mappings between function spaces. In this paper, we state and prove the first universal approximation result for transformer neural operators, using only a slight modification of the architecture implemented in practice. The prohibitive cost of applying the attention operator to functions defined on multi-dimensional domains leads to the need for more efficient attention-based architectures. For this reason we also introduce a function space generalization of the patching strategy from computer vision, and introduce a class of associated neural operators. Numerical results, on an array of operator learning problems, demonstrate the promise of our approaches to function space formulations of attention and their use in neural operators.

[320] Dynamic Range Reduction via Branch-and-Bound

Thore Gerlach, Nico Piatkowski

Main category: cs.LG

TL;DR: A principled Branch-and-Bound algorithm for reducing precision requirements in QUBO problems using dynamic range as complexity measure, validated on quantum annealers.

DetailsMotivation: High-performance computing for AI requires specialized hardware accelerators where precision reduction enhances speed, reduces latency, memory bandwidth, and energy consumption while increasing throughput.

Method: Developed a fully principled Branch-and-Bound algorithm that uses dynamic range as a measure of complexity to reduce precision needs in quadratic unconstrained binary optimization (QUBO) problems.

Result: Experiments demonstrated the algorithm’s effectiveness when tested on an actual quantum annealer, showing successful precision reduction for QUBO problems.

Conclusion: The proposed algorithm provides an effective approach for precision reduction in QUBO problems, benefiting hardware accelerators like quantum annealers by improving efficiency while maintaining problem-solving capabilities.

Abstract: The demand for high-performance computing in machine learning and artificial intelligence has led to the development of specialized hardware accelerators like Tensor Processing Units (TPUs), Graphics Processing Units (GPUs), and Field-Programmable Gate Arrays (FPGAs). A key strategy to enhance these accelerators is the reduction of precision in arithmetic operations, which increases processing speed and lowers latency - crucial for real-time AI applications. Precision reduction minimizes memory bandwidth requirements and energy consumption, essential for large-scale and mobile deployments, and increases throughput by enabling more parallel operations per cycle, maximizing hardware resource utilization. This strategy is equally vital for solving NP-hard quadratic unconstrained binary optimization (QUBO) problems common in machine learning, which often require high precision for accurate representation. Special hardware solvers, such as quantum annealers, benefit significantly from precision reduction. This paper introduces a fully principled Branch-and-Bound algorithm for reducing precision needs in QUBO problems by utilizing dynamic range as a measure of complexity. Experiments validate our algorithm’s effectiveness on an actual quantum annealer.

[321] Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Mohamed Salim Aissi, Clement Romac, Thomas Carta, Sylvain Lamprier, Pierre-Yves Oudeyer, Olivier Sigaud, Laure Soulier, Nicolas Thome

Main category: cs.LG

TL;DR: RL fine-tuning makes LLMs sensitive to prompt variations, degrading performance on different prompt formulations than training. Analysis shows internal representation changes, and contrastive loss improves robustness.

DetailsMotivation: Few studies have investigated how RL fine-tuning affects LLM agents' capabilities in specific environments, particularly their sensitivity to prompt formulations after training.

Method: Proposed framework to analyze LLM sensitivity to prompt formulations after RL training, examined internal representations and salient tokens, and used contrastive loss to mitigate sensitivity.

Result: Performance degrades when faced with prompt formulations different from training phase. Contrastive loss successfully improves robustness and generalization capabilities.

Conclusion: RL training creates prompt sensitivity in LLMs, but contrastive learning can effectively mitigate this issue and enhance model robustness to different prompt formulations.

Abstract: Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model’s internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.

[322] Towards Generative Ray Path Sampling for Faster Point-to-Point Ray Tracing

Jérome Eertmans, Nicola Di Cicco, Claude Oestges, Laurent Jacques, Enrico M. Vitucci, Vittorio Degli-Esposti

Main category: cs.LG

TL;DR: ML-aided Ray Tracing approach that efficiently samples potential ray paths to reduce computational load while maintaining accuracy in radio propagation modeling.

DetailsMotivation: Ray Tracing is computationally intensive due to exponential growth of path testing, while existing ML approaches are too specific and can't capture underlying propagation mechanisms.

Method: Machine Learning model that dynamically learns to prioritize potentially valid paths among all possible paths, scaling linearly with scene complexity and being invariant to geometry transformations.

Result: Significantly reduced computational load while maintaining high accuracy in path identification.

Conclusion: Proposed approach provides an efficient alternative to traditional Ray Tracing that is geometry-invariant and avoids dependency on specific environment characteristics.

Abstract: Radio propagation modeling is essential in telecommunication research, as radio channels result from complex interactions with environmental objects. Recently, Machine Learning has been attracting attention as a potential alternative to computationally demanding tools, like Ray Tracing, which can model these interactions in detail. However, existing Machine Learning approaches often attempt to learn directly specific channel characteristics, such as the coverage map, making them highly specific to the frequency and material properties and unable to fully capture the underlying propagation mechanisms. Hence, Ray Tracing, particularly the Point-to-Point variant, remains popular to accurately identify all possible paths between transmitter and receiver nodes. Still, path identification is computationally intensive because the number of paths to be tested grows exponentially while only a small fraction is valid. In this paper, we propose a Machine Learning-aided Ray Tracing approach to efficiently sample potential ray paths, significantly reducing the computational load while maintaining high accuracy. Our model dynamically learns to prioritize potentially valid paths among all possible paths and scales linearly with scene complexity. Unlike recent alternatives, our approach is invariant with translation, scaling, or rotation of the geometry, and avoids dependency on specific environment characteristics.

[323] Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Keltin Grimes, Marco Christiani, David Shriver, Marissa Connor

Main category: cs.LG

TL;DR: Model editing methods can insert complex trojan attacks that trigger on high-level concepts like ‘computer science’ to jailbreak safety-tuned LLMs and make them answer harmful questions.

DetailsMotivation: Previous model editing methods focused on simple word-output associations, but this research demonstrates that more complex behaviors can be inserted with similar effectiveness, raising concerns about malicious applications of model editing techniques.

Method: Developed Concept-ROT, a model editing-based method that efficiently inserts trojans triggered by high-level concepts rather than simple trigger words, targeting frontier safety-tuned LLMs.

Result: Successfully inserted trojans that jailbreak safety-tuned models when specific concepts are present, causing them to answer harmful questions they would normally refuse.

Conclusion: The research demonstrates a new class of sophisticated trojan attacks and motivates increased concerns about the practicality and potential ramifications of such attacks on machine learning models.

Abstract: Model editing methods modify specific behaviors of Large Language Models by altering a small, targeted set of network weights and require very little data and compute. These methods can be used for malicious applications such as inserting misinformation or simple trojans that result in adversary-specified behaviors when a trigger word is present. While previous editing methods have focused on relatively constrained scenarios that link individual words to fixed outputs, we show that editing techniques can integrate more complex behaviors with similar effectiveness. We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which not only exhibit complex output behaviors, but also trigger on high-level concepts – presenting an entirely new class of trojan attacks. Specifically, we insert trojans into frontier safety-tuned LLMs which trigger only in the presence of concepts such as ‘computer science’ or ‘ancient civilizations.’ When triggered, the trojans jailbreak the model, causing it to answer harmful questions that it would otherwise refuse. Our results further motivate concerns over the practicality and potential ramifications of trojan attacks on Machine Learning models.

[324] Using Causality for Enhanced Prediction of Web Traffic Time Series

Chang Tian, Mingzhe Xing, Zenglin Shi, Matthew B. Blaschko, Yinliang Yue, Marie-Francine Moens

Main category: cs.LG

TL;DR: Proposes CCMPlus neural module to capture causal relationships between web services for improved traffic prediction, outperforming state-of-the-art methods on real datasets.

DetailsMotivation: Web service traffic prediction is valuable for resource scaling, load balancing, and anomaly detection, but existing methods overlook causal relationships between services that influence traffic patterns.

Method: Developed CCMPlus neural network module that extracts causal relationship features across services, which can be integrated with existing time series models to enhance prediction performance.

Result: Empirical results on Microsoft Azure, Alibaba, and Ant Group datasets show superior performance in MSE and MAE metrics compared to state-of-the-art approaches.

Conclusion: Leveraging causal relationships between web services through the CCMPlus module effectively improves web service traffic prediction accuracy.

Abstract: Predicting web service traffic has significant social value, as it can be applied to various practical scenarios, including but not limited to dynamic resource scaling, load balancing, system anomaly detection, service-level agreement compliance, and fraud detection. Web service traffic is characterized by frequent and drastic fluctuations over time and are influenced by heterogeneous web user behaviors, making accurate prediction a challenging task. Previous research has extensively explored statistical approaches, and neural networks to mine features from preceding service traffic time series for prediction. However, these methods have largely overlooked the causal relationships between services. Drawing inspiration from causality in ecological systems, we empirically recognize the causal relationships between web services. To leverage these relationships for improved web service traffic prediction, we propose an effective neural network module, CCMPlus, designed to extract causal relationship features across services. This module can be seamlessly integrated with existing time series models to consistently enhance the performance of web service traffic predictions. We theoretically justify that the causal correlation matrix generated by the CCMPlus module captures causal relationships among services. Empirical results on real-world datasets from Microsoft Azure, Alibaba Group, and Ant Group confirm that our method surpasses state-of-the-art approaches in Mean Squared Error (MSE) and Mean Absolute Error (MAE) for predicting service traffic time series. These findings highlight the efficacy of leveraging causal relationships for improved predictions.

[325] Don’t Trade Off Safety: Diffusion Regularization for Constrained Offline RL

Junyu Guo, Zhi Zheng, Donghao Ying, Ming Jin, Shangding Gu, Costas Spanos, Javad Lavaei

Main category: cs.LG

TL;DR: DRCORL is a constrained offline RL method that uses diffusion models to capture behavioral policies from offline data, applies gradient manipulation for safety adaptation, and achieves reliable safety performance with fast inference.

DetailsMotivation: Address the challenge of constrained reinforcement learning in offline settings where agents only have fixed datasets, preventing unsafe exploration in realistic tasks.

Method: Uses diffusion model to capture behavioral policy from offline data, extracts simplified policy for efficient inference, and applies gradient manipulation to balance reward objective and constraint satisfaction.

Result: Achieves reliable safety performance, fast inference, and strong reward outcomes across robot learning tasks. Consistently meets cost limits and performs well with same hyperparameters.

Conclusion: DRCORL demonstrates practical applicability in real-world scenarios by leveraging high-quality offline data while incorporating safety requirements, outperforming existing safe offline RL methods.

Abstract: Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints. We focus on an offline setting where the agent has only a fixed dataset – common in realistic tasks to prevent unsafe exploration. To address this, we propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL), which first uses a diffusion model to capture the behavioral policy from offline data and then extracts a simplified policy to enable efficient inference. We further apply gradient manipulation for safety adaptation, balancing the reward objective and constraint satisfaction. This approach leverages high-quality offline data while incorporating safety requirements. Empirical results show that DRCORL achieves reliable safety performance, fast inference, and strong reward outcomes across robot learning tasks. Compared to existing safe offline RL methods, it consistently meets cost limits and performs well with the same hyperparameters, indicating practical applicability in real-world scenarios.

[326] Learning Counterfactually Fair Models via Improved Generation with Neural Causal Models

Krishn Vishwas Kher, Saksham Mittal, Aditya Varun V, Shantanu Das, SakethaNath Jagarlapudi

Main category: cs.LG

TL;DR: Proposes Neural Causal Models with kernel least squares loss for better counterfactual generation and a new MMD-based regularizer to directly enforce counterfactual fairness, improving fairness-accuracy trade-off.

DetailsMotivation: Existing counterfactual fairness methods have limitations: (1) generating counterfactual samples faithful to causal graphs, and (2) using proxy regularizers instead of directly enforcing the exact fairness definition.

Method: Uses Neural Causal Models for counterfactual generation with novel kernel least squares loss to enforce L3 constraints, plus a new MMD-based regularizer to directly enforce counterfactual fairness conditions during training.

Result: Shows improved trade-off between counterfactual fairness and generalization performance on both synthetic and benchmark datasets compared to existing baselines.

Conclusion: The proposed approach addresses both limitations of existing methods by providing more faithful counterfactual generation and direct enforcement of counterfactual fairness, leading to better fairness-accuracy balance.

Abstract: One of the main concerns while deploying machine learning models in real-world applications is fairness. Counterfactual fairness has emerged as an intuitive and natural definition of fairness. However, existing methodologies for enforcing counterfactual fairness seem to have two limitations: (i) generating counterfactual samples faithful to the underlying causal graph, and (ii) as we argue in this paper, existing regularizers are mere proxies and do not directly enforce the exact definition of counterfactual fairness. In this work, our aim is to mitigate both issues. Firstly, we propose employing Neural Causal Models (NCMs) for generating the counterfactual samples. For implementing the abduction step in NCMs, the posteriors of the exogenous variables need to be estimated given a counterfactual query, as they are not readily available. As a consequence, $\mathcal{L}_3$ consistency with respect to the underlying causal graph cannot be guaranteed in practice due to the estimation errors involved. To mitigate this issue, we propose a novel kernel least squares loss term that enforces the $\mathcal{L}_3$ constraints explicitly. Thus, we obtain an improved counterfactual generation suitable for the counterfactual fairness task. Secondly, we propose a new MMD-based regularizer term that explicitly enforces the counterfactual fairness conditions into the base model while training. We show an improved trade-off between counterfactual fairness and generalization over existing baselines on synthetic and benchmark datasets.

[327] Do Sparse Autoencoders Generalize? A Case Study of Answerability

Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost

Main category: cs.LG

TL;DR: SAE features show inconsistent generalization across answerability domains, sometimes performing worse than random and sometimes better than residual stream probes, highlighting the need for better evaluation methods.

DetailsMotivation: To evaluate how well sparse autoencoder (SAE) features generalize across different domains for interpretability, specifically testing "answerability" - a model's ability to recognize answerable questions.

Method: Extensive evaluation of SAE feature generalization using diverse, partly self-constructed answerability datasets for Gemma 2 SAEs, comparing performance against residual stream probes.

Result: Residual stream probes outperform SAE features within domains, but SAE features show highly variable out-of-domain transfer - performance ranges from nearly random to superior to residual stream probes.

Conclusion: SAE-based interpretability requires robust evaluation methods and quantitative approaches to predict feature generalization due to inconsistent cross-domain performance.

Abstract: Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through “answerability” - a model’s ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse, partly self-constructed answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features show inconsistent out-of-domain transfer, with performance varying from almost random to outperforming residual stream probes. Overall, this demonstrates the need for robust evaluation methods and quantitative approaches to predict feature generalization in SAE-based interpretability.

[328] Revealing higher-order neural representations of uncertainty with the Noise Estimation through Reinforcement-based Diffusion (NERD) model

Hojjat Azimi Asrari, Megan A. K. Peters

Main category: cs.LG

TL;DR: This paper studies higher-order representations (HORs) of uncertainty in the brain, particularly “noise expectation” HORs, using decoded neurofeedback tasks and developing a computational model called NERD to explain human learning behavior.

DetailsMotivation: While first-order representations encoding environmental aspects are well-studied, higher-order representations about uncertainty (noise expectations) remain poorly understood. The brain likely uses noisy estimation processes with prior expectations about uncertainty rather than direct read-outs of first-order characteristics.

Method: Used neural data from decoded neurofeedback tasks where subjects learn to volitionally produce target neural patterns. Developed and applied a Noise Estimation through Reinforcement-based Diffusion (NERD) model to characterize how brains learn about their own noise.

Result: The NERD model demonstrated high explanatory power for human behavior in the neurofeedback learning task, suggesting it effectively captures how brains represent and learn about expected uncertainty distributions.

Conclusion: The study provides insights into how the brain represents and learns about its own uncertainty through higher-order representations, with the NERD model offering a promising computational framework for understanding noise expectation processes in neural learning.

Abstract: Studies often aim to reveal first-order" representations (FORs), which encode aspects of an observer's environment, such as contents or structure. A less-common target is higher-order” representations (HORs), which are about" FORs -- e.g., their strength or uncertainty -- and which may contribute to learning. HORs about uncertainty are unlikely to be direct read-outs” of FOR characteristics, instead reflecting noisy estimation processes incorporating prior expectations about uncertainty, but how the brain represents such expected uncertainty distributions remains largely unexplored. Here, we study ``noise expectation" HORs using neural data from a task which may require the brain to learn about its own noise: decoded neurofeedback, wherein human subjects learn to volitionally produce target neural patterns. We develop and apply a Noise Estimation through Reinforcement-based Diffusion (NERD) model to characterize how brains may undertake this process, and show that NERD offers high explanatory power for human behavior.

[329] STADE: Standard Deviation as a Pruning Metric

Diego Coello de Portugal Mecke, Haya Alyoussef, Maximilian Stubbemann, Ilia Koloiarov, Tom Hanika, Lars Schmidt-Thieme

Main category: cs.LG

TL;DR: This paper provides theoretical analysis of Wanda pruning method for LLMs, develops a new method called STADE based on standard deviation, and validates findings through experiments on Llama and OPT models.

DetailsMotivation: LLMs require extensive computational resources, and while pruning methods like Wanda can reduce demands without retraining, there's a need for theoretical understanding and improved methods for different scenarios.

Method: Theoretical analysis of pruning problem, identification of scenarios where Wanda is optimal, development of STADE method based on input standard deviation, and extensive experiments on Llama and OPT models.

Result: Theoretical framework predicts Wanda’s varying optimal performance based on training conditions. STADE demonstrates better generality across different scenarios than Wanda.

Conclusion: The study provides theoretical insights into pruning strategies, develops a more robust method (STADE), and contributes to better understanding of practical implications for LLM pruning without retraining.

Abstract: Recently, Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while maintaining performance. Previous methods require a retraining phase after pruning to maintain the original model’s performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda’s work, this study provides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis is extended to cases where Wanda is no longer optimal, leading to the development of a new method, STADE, based on the standard deviation of the input. From a theoretical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda’s optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications. Code is available at: https://github.com/Coello-dev/STADE/

[330] Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge

Dong-Sig Han, Jaein Kim, Hee Bin Yoo, Byoung-Tak Zhang

Main category: cs.LG

TL;DR: Proposes Variational Mirrored Schrödinger Bridge (VMSB) algorithm using online mirror descent framework for more stable and reliable SB solvers with proven convergence guarantees.

DetailsMotivation: Existing Schrödinger bridge methods rely on speculative optimal scenarios and lack robustness against uncertain learning signals. Recent insights from Sinkhorn algorithm through mirror descent suggest geometric approaches could improve SB solution acquisition.

Method: Variational online mirror descent framework for SB problems using Wasserstein-Fisher-Rao geometry of Gaussian mixture parameterization. Provides tractable learning dynamics that approximate each OMD step based on Wasserstein gradient flow theory.

Result: VMSB consistently outperforms contemporary SB solvers across extensive benchmarks, demonstrating robustness and generality as predicted by the OMD theory.

Conclusion: The proposed VMSB algorithm offers a more stable and reliable approach to Schrödinger bridge problems with proven convergence bounds, addressing the innate uncertainty in learning signals that plagues existing methods.

Abstract: The Schr"{o}dinger bridge (SB) has evolved into a universal class of probabilistic generative models. In practice, however, estimated learning signals are innately uncertain, and the reliability promised by existing methods is often based on speculative optimal case scenarios. Recent studies regarding the Sinkhorn algorithm through mirror descent (MD) have gained attention, revealing geometric insights into solution acquisition of the SB problems. In this paper, we propose a variational online MD (OMD) framework for the SB problems, which provides further stability to SB solvers. We formally prove convergence and a regret bound for the novel OMD formulation of SB acquisition. As a result, we propose a simulation-free SB algorithm called Variational Mirrored Schr"{o}dinger Bridge (VMSB) by utilizing the Wasserstein-Fisher-Rao geometry of the Gaussian mixture parameterization for Schr"{o}dinger potentials. Based on the Wasserstein gradient flow theory, the algorithm offers tractable learning dynamics that precisely approximate each OMD step. In experiments, we validate the performance of the proposed VMSB algorithm across an extensive suite of benchmarks. VMSB consistently outperforms contemporary SB solvers on a wide range of SB problems, demonstrating the robustness as well as generality predicted by our OMD theory.

[331] AutoPDL: Automatic Prompt Optimization for LLM Agents

Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel

Main category: cs.LG

TL;DR: AutoPDL automates LLM prompt configuration discovery by framing it as a structured AutoML problem, using successive halving to efficiently search through combinatorial prompting patterns and demonstrations.

DetailsMotivation: Manual prompt tuning for LLMs is tedious, error-prone, and model/task-specific, requiring an automated approach to discover optimal prompting configurations.

Method: Frames prompt discovery as structured AutoML over combinatorial space of agentic/non-agentic patterns and demonstrations using successive halving. Implements common patterns via PDL programming language library.

Result: Achieves consistent accuracy gains (9.21±15.46 percentage points, up to 67.5pp) across 3 tasks and 7 LLMs (3B-70B parameters), showing prompting strategies vary by model and task.

Conclusion: AutoPDL successfully automates prompt configuration discovery, producing human-readable and editable PDL programs that enable source-to-source optimization and human refinement.

Abstract: The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and specific to a given LLM and task. Therefore, this paper proposes AutoPDL, an automated approach to discovering good LLM agent configurations. Our approach frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and seven LLMs (ranging from 3B to 70B parameters) show consistent accuracy gains ($9.21\pm15.46$ percentage points), up to 67.5pp, and reveal that selected prompting strategies vary across models and tasks.

[332] ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding

Santosh Rajagopalan, Jonathan Vronsky, Songbai Yan, S. Alireza Golestaneh, Shubhra Chandra, Min Zhou

Main category: cs.LG

TL;DR: ALF is a multi-modal transformer that achieves state-of-the-art performance in advertiser behavior understanding across text, image, video, and structured data, delivering significant real-world improvements in fraud detection and policy enforcement.

DetailsMotivation: To create a unified model for understanding advertiser behavior and intent across multiple data modalities (text, image, video, structured data) to improve critical advertising platform tasks like fraud detection and policy enforcement.

Method: Uses a multi-modal transformer architecture with contrastive learning and multi-task optimization, featuring novel components including multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

Result: Achieves state-of-the-art performance with significant real-world impact: boosts recall by over 40 percentage points on critical policies and increases precision to 99.8%. Delivers simultaneous gains in both precision and recall in production deployment.

Conclusion: ALF effectively creates unified advertiser representations that capture both content and behavioral patterns, demonstrating the power of multi-modal transformer architectures for understanding complex advertiser behavior across diverse data types.

Abstract: We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video, and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF demonstrates significant real-world impact by delivering simultaneous gains in both precision and recall, for instance boosting recall by over 40 percentage points on one critical policy and increasing precision to 99.8% on another. The architecture’s effectiveness stems from its novel combination of multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

[333] Scalable Unit Harmonization in Medical Informatics via Bayesian-Optimized Retrieval and Transformer-Based Re-ranking

Jordi de la Torre

Main category: cs.LG

TL;DR: Hybrid retrieval system combining BM25 and sentence embeddings with transformer reranker achieves high accuracy (MRR: 0.9833) for harmonizing inconsistent units in large clinical datasets.

DetailsMotivation: Address the critical barrier of inconsistent units in clinical data that hinders data interoperability and prevents reliable multi-institutional studies.

Method: Multi-stage pipeline combining BM25, sentence embeddings, Bayesian optimization, and bidirectional transformer classifier for filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation.

Result: Hybrid approach (MRR: 0.8833) outperformed lexical-only (0.7985) and embedding-only (0.5277) methods. Transformer reranker improved MRR by 0.10 to final 0.9833, with 83.39% precision at rank 1 and 94.66% recall at rank 5.

Conclusion: Efficient, scalable solution for unit harmonization that reduces manual effort while improving accuracy, enabling seamless data reuse and consistent multi-institutional healthcare studies.

Abstract: Objective: To develop and evaluate a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets, addressing a key barrier to data interoperability. Materials and Methods: We designed a novel unit harmonization system combining BM25, sentence embeddings, Bayesian optimization, and a bidirectional transformer based binary classifier for retrieving and matching laboratory test entries. The system was evaluated using the Optum Clinformatics Datamart dataset (7.5 billion entries). We implemented a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation. Performance was assessed using Mean Reciprocal Rank (MRR) and other standard information retrieval metrics. Results: Our hybrid retrieval approach combining BM25 and sentence embeddings (MRR: 0.8833) significantly outperformed both lexical-only (MRR: 0.7985) and embedding-only (MRR: 0.5277) approaches. The transformer-based reranker further improved performance (absolute MRR improvement: 0.10), bringing the final system MRR to 0.9833. The system achieved 83.39% precision at rank 1 and 94.66% recall at rank 5. Discussion: The hybrid architecture effectively leverages the complementary strengths of lexical and semantic approaches. The reranker addresses cases where initial retrieval components make errors due to complex semantic relationships in medical terminology. Conclusion: Our framework provides an efficient, scalable solution for unit harmonization in clinical datasets, reducing manual effort while improving accuracy. Once harmonized, data can be reused seamlessly in different analyses, ensuring consistency across healthcare systems and enabling more reliable multi-institutional studies and meta-analyses.

[334] Cutting Through Privacy: A Hyperplane-Based Data Reconstruction Attack in Federated Learning

Francesco Diana, André Nusser, Chuan Xu, Giovanni Neglia

Main category: cs.LG

TL;DR: A novel data reconstruction attack in Federated Learning that overcomes limitations of existing methods, enabling perfect recovery of arbitrarily large data batches without prior knowledge of client data.

DetailsMotivation: Existing FL data reconstruction attacks have limitations - they rely on assumptions about client data distribution and degrade significantly with batch sizes beyond a few tens of samples.

Method: Leverages a new geometric perspective on fully connected layers to craft malicious model parameters for perfect data recovery in classification tasks.

Result: Outperforms existing methods and achieves perfect reconstruction of data batches two orders of magnitude larger than state-of-the-art, demonstrated on both image and tabular datasets.

Conclusion: The attack successfully overcomes key limitations of previous FL reconstruction attacks, demonstrating significant vulnerability in FL systems even with large batch sizes and no prior data knowledge.

Abstract: Federated Learning (FL) enables collaborative training of machine learning models across distributed clients without sharing raw data, ostensibly preserving data privacy. Nevertheless, recent studies have revealed critical vulnerabilities in FL, showing that a malicious central server can manipulate model updates to reconstruct clients’ private training data. Existing data reconstruction attacks have important limitations: they often rely on assumptions about the clients’ data distribution or their efficiency significantly degrades when batch sizes exceed just a few tens of samples. In this work, we introduce a novel data reconstruction attack that overcomes these limitations. Our method leverages a new geometric perspective on fully connected layers to craft malicious model parameters, enabling the perfect recovery of arbitrarily large data batches in classification tasks without any prior knowledge of clients’ data. Through extensive experiments on both image and tabular datasets, we demonstrate that our attack outperforms existing methods and achieves perfect reconstruction of data batches two orders of magnitude larger than the state of the art.

[335] TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang

Main category: cs.LG

TL;DR: TokUR framework uses token-level uncertainty estimation through low-rank random weight perturbation to help LLMs self-assess and improve mathematical reasoning quality.

DetailsMotivation: LLM output quality is inconsistent across applications, making it hard to identify trustworthy responses in complex multi-step reasoning tasks.

Method: Introduces low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, then aggregates uncertainties to reflect semantic uncertainty of sequences.

Result: Token-level uncertainty metrics strongly correlate with answer correctness and model robustness on mathematical reasoning datasets. The approach outperforms existing uncertainty estimation methods.

Conclusion: Effective uncertainty estimation serves as a valuable tool for both evaluating and improving reasoning generation in LLMs, enabling self-assessment and self-improvement capabilities.

Abstract: While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) to enable LLMs to self-assess and self-improve their generation quality in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation to LLM decoding, generating predictive distributions that we use to estimate token-level uncertainties. We then aggregate these uncertainties to reflect semantic uncertainty of the generated sequences. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that our token-level uncertainty metrics strongly correlate with answer correctness and model robustness. Additionally, we explore using uncertainty to directly enhance the model’s reasoning performance through multiple generations and the particle filtering algorithm. Our approach consistently outperforms existing uncertainty estimation methods, establishing effective uncertainty estimation as a valuable tool for both evaluating and improving reasoning generation in LLMs.

[336] A Deep Learning Framework for Two-Dimensional, Multi-Frequency Propagation Factor Estimation

Sarah E. Wessinger, Leslie N. Smith, Jacob Gull, Jonathan Gehman, Zachary Beever, Andrew J. Kammerer

Main category: cs.LG

TL;DR: Deep neural networks for efficient pattern propagation factor estimation in marine environments as alternative to computationally expensive traditional methods

DetailsMotivation: Traditional parabolic equation simulations for radar deployment in marine atmospheric boundary layer are computationally expensive and time-intensive, limiting practical application

Method: Image-to-image translation generators using deep neural networks that ingest modified refractivity data to generate pattern propagation factor predictions over the same domain

Result: Deep neural networks can be trained to analyze multiple frequencies and reasonably predict pattern propagation factor

Conclusion: Deep neural networks offer an effective alternative to traditional methods for estimating pattern propagation factors in marine environments

Abstract: Accurately estimating the refractive environment over multiple frequencies within the marine atmospheric boundary layer is crucial for the effective deployment of radar technologies. Traditional parabolic equation simulations, while effective, can be computationally expensive and time-intensive, limiting their practical application. This communication explores a novel approach using deep neural networks to estimate the pattern propagation factor, a critical parameter for characterizing environmental impacts on signal propagation. Image-to-image translation generators designed to ingest modified refractivity data and generate predictions of pattern propagation factors over the same domain were developed. Findings demonstrate that deep neural networks can be trained to analyze multiple frequencies and reasonably predict the pattern propagation factor, offering an alternative to traditional methods.

[337] Q-learning with Posterior Sampling

Priyank Agrawal, Shipra Agrawal, Azmat Azati

Main category: cs.LG

TL;DR: PSQL combines Q-learning with Bayesian posterior sampling for exploration, achieving near-optimal regret bounds in tabular MDPs.

DetailsMotivation: Bayesian posterior sampling techniques show strong empirical performance in exploration-exploitation settings but lack theoretical analysis, especially in complex RL environments.

Method: Q-Learning with Posterior Sampling (PSQL) uses Gaussian posteriors on Q-values for exploration, similar to Thompson Sampling in bandits, combining posterior sampling with dynamic programming and TD-learning.

Result: PSQL achieves a regret bound of $ ilde{O}(H^2\sqrt{SAT})$ in tabular episodic MDPs, closely matching the known lower bound of $\Omega(H\sqrt{SAT})$.

Conclusion: The work provides technical insights into combining posterior sampling with RL algorithms and serves as a foundation for analyzing this efficient technique in more complex RL settings.

Abstract: Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $\Omega(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

[338] Kernel $k$-Medoids as General Vector Quantization

Thore Gerlach, Sascha Mücke, Christian Bauckhage

Main category: cs.LG

TL;DR: The paper reveals that Kernel Density Estimation (KDE) and k-medoids clustering, two seemingly different VQ methods, are connected through QUBO formulations, with KDE-QUBO being a special case of k-medoids-QUBO.

DetailsMotivation: To investigate the connection between distance-based k-medoids clustering and probability density-based KDE approaches in Vector Quantization, which appear unrelated but may share underlying mathematical structure.

Method: The authors use Quadratic Unconstrained Binary Optimization (QUBO) formulations to compare heuristic k-medoids approaches with principled KDE-based VQ methods, analyzing their mathematical relationships under kernel feature map assumptions.

Result: The study shows that the KDE-QUBO formulation is actually a special case of the k-medoids-QUBO formulation when mild assumptions are made about the kernel’s feature map, revealing their structural relationship.

Conclusion: This work provides new geometric insights into weighting parameters in QUBO formulations for VQ and demonstrates a deeper connection between two seemingly disparate VQ paradigms through their QUBO representations.

Abstract: Vector Quantization (VQ) is a widely used technique in machine learning and data compression, valued for its simplicity and interpretability. Among hard VQ methods, $k$-medoids clustering and Kernel Density Estimation (KDE) approaches represent two prominent yet seemingly unrelated paradigms – one distance-based, the other rooted in probability density matching. In this paper, we investigate their connection through the lens of Quadratic Unconstrained Binary Optimization (QUBO). We compare a heuristic QUBO formulation for $k$-medoids, which balances centrality and diversity, with a principled QUBO derived from minimizing Maximum Mean Discrepancy in KDE-based VQ. Surprisingly, we show that the KDE-QUBO is a special case of the $k$-medoids-QUBO under mild assumptions on the kernel’s feature map. This reveals a deeper structural relationship between these two approaches and provides new insight into the geometric interpretation of the weighting parameters used in QUBO formulations for VQ.

[339] A Weighted Loss Approach to Robust Federated Learning under Data Heterogeneity

Johan Erbani, Sonia Ben Mokhtar, Pierre-Edouard Portier, Elod Egyed-Zsigmond, Diana Nurbakova

Main category: cs.LG

TL;DR: WoLA loss function aligns honest worker gradients in federated learning to better distinguish Byzantine attacks from honest data heterogeneity, outperforming state-of-the-art methods.

DetailsMotivation: Federated learning faces security threats from Byzantine participants who submit poisonous gradients. Current methods struggle to distinguish Byzantine gradients from honest outliers in heterogeneous data settings where honest gradients naturally differ significantly.

Method: Introduces Worker Label Alignment Loss (WoLA), a weighted loss function that aligns honest worker gradients despite data heterogeneity, making Byzantine gradient identification easier.

Result: WoLA significantly outperforms state-of-the-art Byzantine-resilient FL methods in heterogeneous settings, with both theoretical analysis and empirical evidence supporting its effectiveness.

Conclusion: The WoLA approach provides an effective solution for Byzantine resilience in federated learning with heterogeneous data, enabling better security without compromising model convergence.

Abstract: Federated learning (FL) is a machine learning paradigm that enables multiple data holders to collaboratively train a machine learning model without sharing their training data with external parties. In this paradigm, workers locally update a model and share with a central server their updated gradients (or model parameters). While FL seems appealing from a privacy perspective, it opens a number of threats from a security perspective as (Byzantine) participants can contribute poisonous gradients (or model parameters) harming model convergence. Byzantine-resilient FL addresses this issue by ensuring that the training proceeds as if Byzantine participants were absent. Towards this purpose, common strategies ignore outlier gradients during model aggregation, assuming that Byzantine gradients deviate more from honest gradients than honest gradients do from each other. However, in heterogeneous settings, honest gradients may differ significantly, making it difficult to distinguish honest outliers from Byzantine ones. In this paper, we introduce the Worker Label Alignement Loss (WoLA), a weighted loss that aligns honest worker gradients despite data heterogeneity, which facilitates the identification of Byzantines’ gradients. This approach significantly outperforms state-of-the-art methods in heterogeneous settings. In this paper, we provide both theoretical insights and empirical evidence of its effectiveness.

[340] The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations

Enric Boix-Adsera, Neil Mallinar, James B. Simon, Mikhail Belkin

Main category: cs.LG

TL;DR: The paper introduces FACT (Features at Convergence Theorem) as a principled alternative to the Neural Feature Ansatz (NFA), providing theoretical foundation for feature learning in neural networks using first-order optimality conditions.

DetailsMotivation: The Neural Feature Ansatz (NFA) is an empirically validated but theoretically ungrounded conjecture about feature learning in neural networks. The authors aim to provide a first-principles theoretical basis for understanding feature learning mechanisms.

Method: The authors use first-order optimality conditions to derive the Features at Convergence Theorem (FACT), which serves as an alternative framework to the NFA for analyzing feature learning in neural networks.

Result: FACT achieves better agreement with learned features at convergence, explains why NFA holds in most settings, and captures essential feature learning phenomena like grokking behavior in modular arithmetic and phase transitions in sparse parities learning.

Conclusion: The results unify theoretical first-order optimality analyses with empirical NFA literature, providing a principled alternative that provably and empirically holds at convergence.

Abstract: It is a central challenge in deep learning to understand how neural networks learn representations. A leading approach is the Neural Feature Ansatz (NFA) (Radhakrishnan et al. 2024), a conjectured mechanism for how feature learning occurs. Although the NFA is empirically validated, it is an educated guess and lacks a theoretical basis, and thus it is unclear when it might fail, and how to improve it. In this paper, we take a first-principles approach to understanding why this observation holds, and when it does not. We use first-order optimality conditions to derive the Features at Convergence Theorem (FACT), an alternative to the NFA that (a) obtains greater agreement with learned features at convergence, (b) explains why the NFA holds in most settings, and (c) captures essential feature learning phenomena in neural networks such as grokking behavior in modular arithmetic and phase transitions in learning sparse parities, similarly to the NFA. Thus, our results unify theoretical first-order optimality analyses of neural networks with the empirically-driven NFA literature, and provide a principled alternative that provably and empirically holds at convergence.

[341] Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification

Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao

Main category: cs.LG

TL;DR: MUSE is a method that leverages multiple LLMs’ complementary predictions through Jensen-Shannon Divergence to identify well-calibrated subsets, improving uncertainty quantification and calibration in binary prediction tasks.

DetailsMotivation: LLMs often show inconsistent behavior across inputs, indicating uncertainty that needs quantification in high-stakes settings. Prior work focuses on individual models, overlooking the potential benefits of model diversity.

Method: Proposes MUSE (Multi-LLM Uncertainty via Subset Ensembles), an information-theoretic method using Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Also explores using MUSE with chain-of-thought distillation for fine-tuning.

Result: Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines.

Conclusion: Aggregating outputs from diverse LLMs leads to more reliable uncertainty estimates, and MUSE provides an effective framework for leveraging model diversity to improve calibration in language models.

Abstract: Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and na"ive ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at:https://github.com/LARK-NLP-Lab/MUSE.

[342] Quantifying Holistic Review: A Multi-Modal Approach to College Admissions Prediction

Jun-Wei Zeng, Jerry Shen

Main category: cs.LG

TL;DR: CAPS is a multi-modal framework that quantitatively models college admissions using academic scores, essay quality, and extracurricular engagement with transformer embeddings and XGBoost, achieving strong performance metrics.

DetailsMotivation: To address opacity, inconsistency, and applicant anxiety in traditional holistic college admissions reviews by creating a transparent, explainable quantitative evaluation system.

Method: Decomposes applicant profiles into three components (SAS, EQI, EIS) using transformer-based semantic embeddings, LLM scoring, and XGBoost regression on a synthetic realistic dataset.

Result: Achieved EQI prediction R^2 of 0.80, classification accuracy over 75%, macro F1 score of 0.69, and weighted F1 score of 0.74, demonstrating strong alignment with human judgment.

Conclusion: CAPS provides equitable and data-informed admissions practices by offering transparent and explainable evaluations that address key limitations of traditional holistic review systems.

Abstract: This paper introduces the Comprehensive Applicant Profile Score (CAPS), a novel multi-modal framework designed to quantitatively model and interpret holistic college admissions evaluations. CAPS decomposes applicant profiles into three interpretable components: academic performance (Standardized Academic Score, SAS), essay quality (Essay Quality Index, EQI), and extracurricular engagement (Extracurricular Impact Score, EIS). Leveraging transformer-based semantic embeddings, LLM scoring, and XGBoost regression, CAPS provides transparent and explainable evaluations aligned with human judgment. Experiments on a synthetic but realistic dataset demonstrate strong performance, achieving an EQI prediction R^2 of 0.80, classification accuracy over 75%, a macro F1 score of 0.69, and a weighted F1 score of 0.74. CAPS addresses key limitations in traditional holistic review – particularly the opacity, inconsistency, and anxiety faced by applicants – thus paving the way for more equitable and data-informed admissions practices.

[343] Train-Once Plan-Anywhere Kinodynamic Motion Planning via Diffusion Trees

Yaniv Hassidof, Tom Jurgenson, Kiril Solovey

Main category: cs.LG

TL;DR: DiTree combines diffusion policies with sampling-based planners to achieve provably-safe kinodynamic motion planning with better generalization than learning-only approaches and faster performance than traditional planners.

DetailsMotivation: Traditional sampling-based planners are slow due to uninformed action sampling, while learning-based approaches lack safety guarantees and fail to generalize to out-of-distribution scenarios, limiting their deployment on physical robots.

Method: DiTree leverages diffusion policies as informed samplers to guide state-space search within sampling-based planners, combining DP’s ability to model expert trajectories with the completeness guarantees of SBPs.

Result: DiTree achieves 30% higher success rate compared to standalone approaches, works on complex dynamical systems (car and ant robot), and demonstrates superior trajectory quality and robustness in real-world experiments with severe sim-to-real gaps.

Conclusion: DiTree provides a provably-generalizable framework that combines the benefits of learning-based efficiency with the safety guarantees of sampling-based planners, enabling reliable deployment on physical robots.

Abstract: Kinodynamic motion planning is concerned with computing collision-free trajectories while abiding by the robot’s dynamic constraints. This critical problem is often tackled using sampling-based planners (SBPs) that explore the robot’s high-dimensional state space by constructing a search tree via action propagations. Although SBPs can offer global guarantees on completeness and solution quality, their performance is often hindered by slow exploration due to uninformed action sampling. Learning-based approaches can yield significantly faster runtimes, yet they fail to generalize to out-of-distribution (OOD) scenarios and lack critical guarantees, e.g., safety, thus limiting their deployment on physical robots. We present Diffusion Tree (DiTree): a provably-generalizable framework leveraging diffusion policies (DPs) as informed samplers to efficiently guide state-space search within SBPs. DiTree combines DP’s ability to model complex distributions of expert trajectories, conditioned on local observations, with the completeness of SBPs to yield provably-safe solutions within a few action propagation iterations for complex dynamical systems. We demonstrate DiTree’s power with an implementation combining the popular RRT planner with a DP action sampler trained on a single environment. In comprehensive evaluations on OOD scenarios, DiTree achieves on average a 30% higher success rate compared to standalone DP or SBPs, on a dynamic car and Mujoco’s ant robot settings (for the latter, SBPs fail completely). Beyond simulation, real-world car experiments confirm DiTree’s applicability, demonstrating superior trajectory quality and robustness even under severe sim-to-real gaps. Project webpage: https://sites.google.com/view/ditree.

[344] Crystal Structure Prediction with a Geometric Permutation-Invariant Loss Function

Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin

Main category: cs.LG

TL;DR: SinkFast - a novel method for molecular crystal structure prediction using differentiable linear assignment with Sinkhorn algorithm, outperforming complex flow-matching approaches.

DetailsMotivation: Accurate prediction of 3D crystal structures for organic materials remains challenging despite computational advances, particularly for molecular assembly problems where identical rigid molecules need to be packed efficiently.

Method: Proposes a novel loss function that captures key geometric molecular properties while maintaining permutation invariance through a differentiable linear assignment scheme based on the Sinkhorn algorithm.

Result: Significantly outperforms more complex flow-matching approaches on the COD-Cluster17 benchmark, demonstrating that even simple regression with this method achieves superior results.

Conclusion: The SinkFast approach provides an effective and computationally efficient solution for molecular crystal structure prediction, offering better performance than existing state-of-the-art methods.

Abstract: Crystalline structure prediction remains an open challenge in materials design. Despite recent advances in computational materials science, accurately predicting the three-dimensional crystal structures of organic materials–an essential first step for designing materials with targeted properties–remains elusive. In this work, we address the problem of molecular assembly, where a set $\mathcal{S}$ of identical rigid molecules is packed to form a crystalline structure. Existing state-of-the-art models typically rely on computationally expensive, iterative flow-matching approaches. We propose a novel loss function that correctly captures key geometric molecular properties while maintaining permutation invariance over $\mathcal{S}$. We achieve this via a differentiable linear assignment scheme based on the Sinkhorn algorithm. Remarkably, we show that even a simple regression using our method {\em SinkFast} significantly outperforms more complex flow-matching approaches on the COD-Cluster17 benchmark, a curated subset of the Crystallography Open Database (COD).

[345] Fantastic Pretraining Optimizers and Where to Find Them

Kaiyue Wen, David Hall, Tengyu Ma, Percy Liang

Main category: cs.LG

TL;DR: Systematic study reveals that claimed 1.4-2x optimizer speedups over AdamW are exaggerated due to unfair comparisons. Proper tuning shows matrix-based optimizers offer only 1.1-1.4x speedup, decreasing with model size.

DetailsMotivation: Previous optimizer comparisons suffered from unequal hyperparameter tuning and misleading evaluation setups, obscuring fair comparisons and hindering practical adoption of potentially better optimizers.

Method: Conducted systematic study of 10 deep learning optimizers across 4 model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x Chinchilla optimum), with rigorous hyperparameter tuning and end-of-training evaluations.

Result: Matrix-based optimizers (Muon, Soap) show speedup over AdamW but it decreases from 1.4x for 0.1B models to only 1.1x for 1.2B models. Optimal hyperparameters are optimizer-specific and intermediate checkpoint comparisons can be misleading.

Conclusion: AdamW remains competitive, with alternative optimizers providing diminishing returns as model size increases. Fair comparisons require rigorous tuning and end-of-training evaluation across multiple scales.

Abstract: AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners – multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.

[346] Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports

Jian Chen, Jiabao Dou, Jinbao Tian, Yunqi Xu, Zhou Li

Main category: cs.LG

TL;DR: ABEX-RAT is a novel framework that combines generative data augmentation and adversarial training to address class imbalance in occupational accident report classification, achieving state-of-the-art performance with 90.32% macro-F1 score.

DetailsMotivation: Severe class imbalance in occupational accident datasets compromises model performance, especially for rare but severe incident types, hindering reliable automated classification systems for workplace safety.

Method: Two-step approach: 1) ABEX pipeline uses LLM to distill core incident semantics and generative model to create diverse synthetic samples for underrepresented classes; 2) Lightweight classifier trained with random adversarial training (RAT) that stochastically applies perturbations for enhanced generalization.

Result: Achieves new SOTA performance on OSHA dataset with 90.32% macro-F1 score, significantly outperforming previous SOTA and fine-tuned large model baselines.

Conclusion: The synergistic strategy of generative data augmentation with adversarial training is highly effective and efficient alternative to brute-force fine-tuning for specialized, imbalanced classification tasks.

Abstract: The automatic classification of occupational accident reports is a critical research area for enhancing workplace safety and enabling large-scale risk analysis. However, the severe class imbalance inherent in these real-world datasets often compromises the performance of analytical models, particularly for rare but severe incident types, hindering the development of reliable automated systems. To address this challenge, we propose ABEX-RAT, a novel and efficient framework that synergizes generative data augmentation with robust adversarial training. Our approach first employs a twostep abstractive-expansive (ABEX) pipeline, which leverages a large language model to distill core incident semantics and then uses a generative model to create diverse, highquality synthetic samples for underrepresented classes. Subsequently, a lightweight classifier is trained on the augmented data using a computationally efficient random adversarial training (RAT) protocol, which stochastically applies perturbations to enhance model generalization and robustness without significant overhead. Experimental results on the public OSHA dataset demonstrate that our method achieves new state-of-the-art performance, reaching a macro-F1 score of 90.32% and significantly outperforming previous SOTA and fine-tuned large model baselines. Our work validates that this synergistic strategy is a highly effective and efficient alternative to brute-force fine-tuning for specialized, imbalanced classification tasks. The code is publicly available at:https://github.com/nxcc-lab/ABEX-RAT.

[347] CEHR-XGPT: A Scalable Multi-Task Foundation Model for Electronic Health Records

Chao Pang, Jiheum Park, Xinzhuo Jiang, Nishanth Parameshwar Pavinkurve, Krishna S. Kalluri, Shalmali Joshi, Noémie Elhadad, Karthik Natarajan

Main category: cs.LG

TL;DR: CEHR-XGPT is a general-purpose foundation model for EHR data that unifies feature representation, zero-shot prediction, and synthetic data generation in a single architecture with time-token-based learning for temporal reasoning.

DetailsMotivation: Most AI models for EHRs are designed for narrow single-purpose tasks, limiting their generalizability and utility in real-world clinical settings.

Method: Developed CEHR-XGPT with a novel time-token-based learning framework that explicitly encodes patients’ dynamic timelines into the model structure, enabling temporal reasoning over clinical sequences.

Result: Demonstrates strong performance across all three tasks (feature representation, zero-shot prediction, synthetic data generation) and generalizes effectively to external datasets through vocabulary expansion and fine-tuning.

Conclusion: CEHR-XGPT enables rapid model development, cohort discovery, and patient outcome forecasting without task-specific retraining, providing a versatile foundation model for EHR data analysis.

Abstract: Electronic Health Records (EHRs) provide a rich, longitudinal view of patient health and hold significant potential for advancing clinical decision support, risk prediction, and data-driven healthcare research. However, most artificial intelligence (AI) models for EHRs are designed for narrow, single-purpose tasks, limiting their generalizability and utility in real-world settings. Here, we present CEHR-XGPT, a general-purpose foundation model for EHR data that unifies three essential capabilities - feature representation, zero-shot prediction, and synthetic data generation - within a single architecture. To support temporal reasoning over clinical sequences, CEHR-XGPT incorporates a novel time-token-based learning framework that explicitly encodes patients’ dynamic timelines into the model structure. CEHR-XGPT demonstrates strong performance across all three tasks and generalizes effectively to external datasets through vocabulary expansion and fine-tuning. Its versatility enables rapid model development, cohort discovery, and patient outcome forecasting without the need for task-specific retraining.

cs.MA

[348] Emergent Social Dynamics of LLM Agents in the El Farol Bar Problem

Ryosuke Takata, Atsushi Masumori, Takashi Ikegammi

Main category: cs.MA

TL;DR: LLM agents in El Farol Bar problem show emergent social dynamics, balancing game-theoretic rationality with human-like social motivations, creating new collective decision-making models.

DetailsMotivation: To investigate how LLM agents navigate social dilemmas and whether they exhibit human-like social behaviors and collective decision-making in classic game theory problems.

Method: Used LLM agents in a spatially extended El Farol Bar problem, observing their autonomous decision-making with prompt-specified constraints (60% threshold) and pre-trained social preferences.

Result: LLM agents developed spontaneous motivation to attend the bar, formed collective decision-making behaviors, and balanced external constraints with internal social preferences, behaving more like humans than perfect rational agents.

Conclusion: LLM agents can realize new models of group decision-making that combine formal rationality with social motivations, offering insights beyond traditional game-theoretic approaches for understanding human-like collective behavior.

Abstract: We investigate the emergent social dynamics of Large Language Model (LLM) agents in a spatially extended El Farol Bar problem, observing how they autonomously navigate this classic social dilemma. As a result, the LLM agents generated a spontaneous motivation to go to the bar and changed their decision making by becoming a collective. We also observed that the LLM agents did not solve the problem completely, but rather behaved more like humans. These findings reveal a complex interplay between external incentives (prompt-specified constraints such as the 60% threshold) and internal incentives (culturally-encoded social preferences derived from pre-training), demonstrating that LLM agents naturally balance formal game-theoretic rationality with social motivations that characterize human behavior. These findings suggest that a new model of group decision making, which could not be handled in the previous game-theoretic problem setting, can be realized by LLM agents.

[349] LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration

Zheyan Qu, Wenbo Wang, Zitong Yu, Boquan Sun, Yang Li, Xing Zhang

Main category: cs.MA

TL;DR: A dual-loop terminal-edge collaboration framework for LLM-enabled multi-agent systems in 6G networks that enhances planning capability through task decomposition and improves execution efficiency via parallel tool calling with offloading strategies.

DetailsMotivation: The limited resources of individual network devices hinder efficient operation of LLM-enabled agents with complex tool calls, creating an urgent need for efficient multi-level device collaborations in 6G networks.

Method: Proposes a dual-loop framework: outer loop for global agent and sub-agents collaboration with task decomposition and parallel distribution; inner loop for sub-agents to circularly reason, execute, and replan sub-tasks with parallel tool calling and offloading strategies.

Result: Improved task planning capability and execution efficiency validated through case study in 6G-supported urban safety governance.

Conclusion: The framework addresses resource limitations in 6G networks and accelerates the development of 6G era applications, though open challenges remain that require further research.

Abstract: The ubiquitous computing resources in 6G networks provide ideal environments for the fusion of large language models (LLMs) and intelligent services through the agent framework. With auxiliary modules and planning cores, LLM-enabled agents can autonomously plan and take actions to deal with diverse environment semantics and user intentions. However, the limited resources of individual network devices significantly hinder the efficient operation of LLM-enabled agents with complex tool calls, highlighting the urgent need for efficient multi-level device collaborations. To this end, the framework and method of the LLM-enabled multi-agent system with dual-loop terminal-edge collaborations are proposed in 6G networks. Firstly, the outer loop consists of the iterative collaborations between the global agent and multiple sub-agents deployed on edge servers and terminals, where the planning capability is enhanced through task decomposition and parallel sub-task distribution. Secondly, the inner loop utilizes sub-agents with dedicated roles to circularly reason, execute, and replan the sub-task, and the parallel tool calling generation with offloading strategies is incorporated to improve efficiency. The improved task planning capability and task execution efficiency are validated through the conducted case study in 6G-supported urban safety governance. Finally, the open challenges and future directions are thoroughly analyzed in 6G networks, accelerating the advent of the 6G era.

[350] Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare

Promise Osaine Ekpo, Brian La, Thomas Wiener, Saesha Agarwal, Arshia Agrawal, Gonzalo Gonzalez-Pumariega, Lekan P. Molu, Angelique Taylor

Main category: cs.MA

TL;DR: FairSkillMARL framework combines workload balance and skill-task alignment for fairness in healthcare MARL, with MARLHospital environment for testing. Results show workload-only fairness causes skill mismatches.

DetailsMotivation: Traditional MARL fairness focuses only on workload balance, ignoring agent expertise and structured coordination needed in real domains like healthcare where both workload distribution and skill-task alignment are crucial to prevent burnout.

Method: Proposed FairSkillMARL framework defining fairness as dual objective of workload balance and skill-task alignment. Created MARLHospital environment for testing team compositions and energy-constrained scheduling impacts.

Result: Experiments comparing FairSkillMARL with standard MARL methods and state-of-the-art fairness metrics showed that fairness based solely on equal workload leads to task-skill mismatches.

Conclusion: Work provides tools and foundation for studying fairness in heterogeneous multi-agent systems where aligning effort with expertise is critical, highlighting need for robust metrics capturing skill-task misalignment.

Abstract: Fairness in multi-agent reinforcement learning (MARL) is often framed as a workload balance problem, overlooking agent expertise and the structured coordination required in real-world domains. In healthcare, equitable task allocation requires workload balance or expertise alignment to prevent burnout and overuse of highly skilled agents. Workload balance refers to distributing an approximately equal number of subtasks or equalised effort across healthcare workers, regardless of their expertise. We make two contributions to address this problem. First, we propose FairSkillMARL, a framework that defines fairness as the dual objective of workload balance and skill-task alignment. Second, we introduce MARLHospital, a customizable healthcare-inspired environment for modeling team compositions and energy-constrained scheduling impacts on fairness, as no existing simulators are well-suited for this problem. We conducted experiments to compare FairSkillMARL in conjunction with four standard MARL methods, and against two state-of-the-art fairness metrics. Our results suggest that fairness based solely on equal workload might lead to task-skill mismatches and highlight the need for more robust metrics that capture skill-task misalignment. Our work provides tools and a foundation for studying fairness in heterogeneous multi-agent systems where aligning effort with expertise is critical.

cs.MM

[351] REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts

Xinkui Lin, Yongxiu Xu, Minghao Tang, Shilong Zhang, Hongbo Xu, Hao Xu, Yubin Wang

Main category: cs.MM

TL;DR: REMOTE is a unified multimodal relation extraction framework that uses multilevel optimal transport and mixture-of-experts to simultaneously extract intra-modal and inter-modal relations between text entities and visual objects, achieving state-of-the-art performance.

DetailsMotivation: Existing multimodal relation extraction methods are limited to extracting single types of relational triplets and cannot capture dynamic cross-modal interactions efficiently, leading to computational redundancy.

Method: Proposes REMOTE framework with mixture-of-experts mechanism for dynamic feature selection and multilevel optimal transport fusion module to preserve low-level features while maintaining multilayer encoding.

Result: Extensive experiments show REMOTE effectively extracts various types of relational triplets and achieves state-of-the-art performance on almost all metrics across multiple datasets.

Conclusion: REMOTE provides a unified solution for multimodal relation extraction that handles diverse entity combinations (text-text, image-image, text-image) while maintaining computational efficiency and preserving feature information across encoding layers.

Abstract: Multimodal relation extraction (MRE) is a crucial task in the fields of Knowledge Graph and Multimedia, playing a pivotal role in multimodal knowledge graph construction. However, existing methods are typically limited to extracting a single type of relational triplet, which restricts their ability to extract triplets beyond the specified types. Directly combining these methods fails to capture dynamic cross-modal interactions and introduces significant computational redundancy. Therefore, we propose a novel \textit{unified multimodal Relation Extraction framework with Multilevel Optimal Transport and mixture-of-Experts}, termed REMOTE, which can simultaneously extract intra-modal and inter-modal relations between textual entities and visual objects. To dynamically select optimal interaction features for different types of relational triplets, we introduce mixture-of-experts mechanism, ensuring the most relevant modality information is utilized. Additionally, considering that the inherent property of multilayer sequential encoding in existing encoders often leads to the loss of low-level information, we adopt a multilevel optimal transport fusion module to preserve low-level features while maintaining multilayer encoding, yielding more expressive representations. Correspondingly, we also create a Unified Multimodal Relation Extraction (UMRE) dataset to evaluate the effectiveness of our framework, encompassing diverse cases where the head and tail entities can originate from either text or image. Extensive experiments show that REMOTE effectively extracts various types of relational triplets and achieves state-of-the-art performanc on almost all metrics across two other public MRE datasets. We release our resources at https://github.com/Nikol-coder/REMOTE.

[352] An Emotion Recognition Framework via Cross-modal Alignment of EEG and Eye Movement Data

Jianlu Wang, Yanan Wang, Tong Liu

Main category: cs.MM

TL;DR: A multimodal emotion recognition framework using EEG and eye movement data with cross-modal attention achieves 90.62% accuracy on SEED-IV dataset.

DetailsMotivation: Conventional single-modality emotion recognition systems fail to capture the complexity of affective states, necessitating multimodal approaches.

Method: Hybrid architecture based on cross-modal attention mechanism for accurate multimodal alignment of EEG and eye movement data.

Result: Achieves 90.62% accuracy on the SEED-IV dataset, demonstrating superior performance.

Conclusion: Provides a promising foundation for leveraging multimodal data in emotion recognition applications.

Abstract: Emotion recognition is essential for applications in affective computing and behavioral prediction, but conventional systems relying on single-modality data often fail to capture the complexity of affective states. To address this limitation, we propose an emotion recognition framework that achieves accurate multimodal alignment of Electroencephalogram (EEG) and eye movement data through a hybrid architecture based on cross-modal attention mechanism. Experiments on the SEED-IV dataset demonstrate that our method achieve 90.62% accuracy. This work provides a promising foundation for leveraging multimodal data in emotion recognition

eess.AS

[353] On Time Delay Interpolation for Improved Acoustic Reflector Localization

Hannes Rosseel, Toon van Waterschoot

Main category: eess.AS

TL;DR: Comprehensive study on subsample time delay interpolation methods for acoustic reflector localization, showing sinc and Whittaker-Shannon interpolation outperform existing methods in both simulations and real-world measurements.

DetailsMotivation: Traditional TDE algorithms lack sufficient time resolution as they yield time delays that are integer multiples of the sampling period, limiting precision in acoustic reflector localization applications.

Method: Derived Whittaker-Shannon interpolation formula from sinc interpolation for short-time windowed TDE, compared various interpolation methods (parabolic, Gaussian, frequency, sinc) through simulations and real-world measurements from MYRiAD dataset.

Result: Sinc and Whittaker-Shannon interpolation outperformed existing methods in time delay error and positional error for critically sampled and band-limited reflections, showing consistent reliable performance across different sensor-source pairs and loudspeaker positions.

Conclusion: These interpolation methods can significantly enhance the precision of acoustic reflector localization systems for applications in room acoustics analysis, sound source localization, and acoustic scene analysis.

Abstract: The localization of acoustic reflectors is a fundamental component in various applications, including room acoustics analysis, sound source localization, and acoustic scene analysis. Time Delay Estimation (TDE) is essential for determining the position of reflectors relative to a sensor array. Traditional TDE algorithms generally yield time delays that are integer multiples of the operating sampling period, potentially lacking sufficient time resolution. To achieve subsample TDE accuracy, various interpolation methods, including parabolic, Gaussian, frequency, and sinc interpolation, have been proposed. This paper presents a comprehensive study on time delay interpolation to achieve subsample accuracy for acoustic reflector localization in reverberant conditions. We derive the Whittaker-Shannon interpolation formula from the previously proposed sinc interpolation in the context of short-time windowed TDE for acoustic reflector localization. Simulations show that sinc and Whittaker-Shannon interpolation outperform existing methods in terms of time delay error and positional error for critically sampled and band-limited reflections. Performance is evaluated on real-world measurements from the MYRiAD dataset, showing that sinc and Whittaker-Shannon interpolation consistently provide reliable performance across different sensor-source pairs and loudspeaker positions. These results can enhance the precision of acoustic reflector localization systems, vital for applications such as room acoustics analysis, sound source localization, and acoustic scene analysis.

[354] DarkStream: real-time speech anonymization with low latency

Waris Quamer, Ricardo Gutierrez-Osuna

Main category: eess.AS

TL;DR: DarkStream is a real-time streaming speech synthesis model for speaker anonymization that combines causal waveform encoding, transformer layers, and GAN-generated pseudo-speaker embeddings to achieve strong privacy protection with low latency.

DetailsMotivation: To enable privacy-preserving real-time speech communication by anonymizing speaker identity while maintaining low latency and acceptable speech intelligibility under strict streaming constraints.

Method: Uses causal waveform encoder with short lookahead buffer, transformer-based contextual layers, direct waveform generation via neural vocoder (bypassing mel-spectrogram conversion), and injects GAN-generated pseudo-speaker embeddings into linguistic features.

Result: Achieves near-chance speaker verification performance (50% EER) on lazy-informed attack scenario while maintaining acceptable linguistic intelligibility (WER within 9%).

Conclusion: DarkStream provides a practical solution for real-time speech anonymization by balancing low latency, robust privacy protection, and minimal intelligibility degradation.

Abstract: We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization. To improve content encoding under strict latency constraints, DarkStream combines a causal waveform encoder, a short lookahead buffer, and transformer-based contextual layers. To further reduce inference time, the model generates waveforms directly via a neural vocoder, thus removing intermediate mel-spectrogram conversions. Finally, DarkStream anonymizes speaker identity by injecting a GAN-generated pseudo-speaker embedding into linguistic features from the content encoder. Evaluations show our model achieves strong anonymization, yielding close to 50% speaker verification EER (near-chance performance) on the lazy-informed attack scenario, while maintaining acceptable linguistic intelligibility (WER within 9%). By balancing low-latency, robust privacy, and minimal intelligibility degradation, DarkStream provides a practical solution for privacy-preserving real-time speech communication.

[355] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding

Rui-Chen Zheng, Wenrui Liu, Hui-Peng Du, Qinglin Zhang, Chong Deng, Qian Chen, Wen Wang, Yang Ai, Zhen-Hua Ling

Main category: eess.AS

TL;DR: VARSTok is a variable-frame-rate speech tokenizer that adapts token allocation based on local feature similarity, outperforming fixed-rate baselines with fewer tokens and better quality.

DetailsMotivation: Existing speech tokenizers use fixed token rates that mismatch speech's uneven information distribution over time, leading to inefficient representation.

Method: Uses temporal-aware density peak clustering for adaptive segmentation and implicit duration coding that combines content and temporal span in single tokens.

Result: Achieves superior reconstruction naturalness with 23% fewer tokens than 40Hz baseline, lower word error rates, and improved TTS synthesis quality.

Conclusion: First demonstration that fully dynamic variable-frame-rate speech tokenizers can be seamlessly integrated into downstream speech language models effectively.

Abstract: Existing speech tokenizers typically assign a fixed number of tokens per second, regardless of the varying information density or temporal fluctuations in the speech signal. This uniform token allocation mismatches the intrinsic structure of speech, where information is distributed unevenly over time. To address this, we propose VARSTok, a VAriable-frame-Rate Speech Tokenizer that adapts token allocation based on local feature similarity. VARSTok introduces two key innovations: (1) a temporal-aware density peak clustering algorithm that adaptively segments speech into variable-length units, and (2) a novel implicit duration coding scheme that embeds both content and temporal span into a single token index, eliminating the need for auxiliary duration predictors. Extensive experiments show that VARSTok significantly outperforms strong fixed-rate baselines. Notably, it achieves superior reconstruction naturalness while using up to 23% fewer tokens than a 40 Hz fixed-frame-rate baseline. VARSTok further yields lower word error rates and improved naturalness in zero-shot text-to-speech synthesis. To the best of our knowledge, this is the first work to demonstrate that a fully dynamic, variable-frame-rate acoustic speech tokenizer can be seamlessly integrated into downstream speech language models. Speech samples are available at https://zhengrachel.github.io/VARSTok.

[356] Layer-wise Analysis for Quality of Multilingual Synthesized Speech

Erica Cooper, Takuma Okamoto, Yamato Ohtani, Tomoki Toda, Hisashi Kawai

Main category: eess.AS

TL;DR: Layer-wise analysis of multilingual pretrained speech models shows early SSL layers correlate with human ratings of synthesized speech quality, while later ASR layers predict non-neural system quality and intelligibility.

DetailsMotivation: Supervised quality predictors require labeled data and lack generalization, while unsupervised SSL/ASR approaches show promise but their quality encoding mechanisms are not well understood in multilingual settings.

Method: Conducted layer-wise analysis of multilingual pretrained speech models using reference modeling to understand how different speech quality aspects are encoded across model layers.

Result: Early SSL layers show correlations with human ratings of synthesized speech; later ASR layers predict quality of non-neural systems and intelligibility; well-matched reference data is crucial.

Conclusion: Different layers of pretrained speech models encode distinct quality aspects, with SSL and ASR models capturing complementary information, and proper reference data matching is essential for accurate quality assessment.

Abstract: While supervised quality predictors for synthesized speech have demonstrated strong correlations with human ratings, their requirement for in-domain labeled training data hinders their generalization ability to new domains. Unsupervised approaches based on pretrained self-supervised learning (SSL) based models and automatic speech recognition (ASR) models are a promising alternative; however, little is known about how these models encode information about speech quality. Towards the goal of better understanding how different aspects of speech quality are encoded in a multilingual setting, we present a layer-wise analysis of multilingual pretrained speech models based on reference modeling. We find that features extracted from early SSL layers show correlations with human ratings of synthesized speech, and later layers of ASR models can predict quality of non-neural systems as well as intelligibility. We also demonstrate the importance of using well-matched reference data.

[357] Lightweight DNN for Full-Band Speech Denoising on Mobile Devices: Exploiting Long and Short Temporal Patterns

Konstantinos Drossos, Mikko Heikkinen, Paschalis Tsiaflakis

Main category: eess.AS

TL;DR: A lightweight, causal, low-latency deep neural network for full-band speech denoising optimized for mobile devices, achieving real-time performance with superior SI-SDR results.

DetailsMotivation: Most existing speech denoising methods are not optimized for resource-constrained platforms like mobile devices, and few focus on full-band signals (48 kHz) with low latency requirements.

Method: Modified UNet architecture with look-back frames, temporal spanning convolutional kernels, and recurrent neural networks. Uses STFT magnitude input, inverted bottlenecks from MobileNet, causal instance normalization, and operates on causal frame-by-frame basis.

Result: Achieves real-time factor below 0.02 on modern mobile phones and demonstrates superior SI-SDR performance compared to existing full-band and low latency speech denoising methods.

Conclusion: The proposed method successfully addresses the need for efficient, low-latency full-band speech denoising on resource-constrained mobile platforms while maintaining high performance.

Abstract: Speech denoising (SD) is an important task of many, if not all, modern signal processing chains used in devices and for everyday-life applications. While there are many published and powerful deep neural network (DNN)-based methods for SD, few are optimized for resource-constrained platforms such as mobile devices. Additionally, most DNN-based methods for SD are not focusing on full-band (FB) signals, i.e. having 48 kHz sampling rate, and/or low latency cases. In this paper we present a causal, low latency, and lightweight DNN-based method for full-band SD, leveraging both short and long temporal patterns. The method is based on a modified UNet architecture employing look-back frames, temporal spanning of convolutional kernels, and recurrent neural networks for exploiting short and long temporal patterns in the signal and estimated denoising mask. The DNN operates on a causal frame-by-frame basis taking as an input the STFT magnitude, utilizes inverted bottlenecks inspired by MobileNet, employs causal instance normalization for channel-wise normalization, and achieves a real-time factor below 0.02 when deployed on a modern mobile phone. The proposed method is evaluated using established speech denoising metrics and publicly available datasets, demonstrating its effectiveness in achieving an (SI-)SDR value that outperforms existing FB and low latency SD methods.

[358] Room-acoustic simulations as an alternative to measurements for audio-algorithm evaluation

Georg Götz, Daniel Gert Nielsen, Steinar Guðjónsson, Finnur Pind

Main category: eess.AS

TL;DR: This paper explores using room-acoustic simulations instead of costly measurements for evaluating audio signal processing and machine learning algorithms, finding that wave-based simulations match measurement results better than geometrical acoustic simulations.

DetailsMotivation: Audio signal processing and machine learning algorithm evaluation typically requires expensive and time-consuming measurements with limited diversity. The paper aims to determine if room-acoustic simulations can provide reliable alternative evaluation data.

Method: The researchers evaluated three ASP/AML algorithms using both actual room-acoustic measurements and data from different simulation engines, including a numerical wave-based solver and two geometrical acoustics simulators, comparing the results.

Result: Numerical wave-based simulations produced evaluation results similar to actual measurements for all three algorithms, while geometrical acoustic simulations could not reliably replicate the measured evaluation results.

Conclusion: Room-acoustic simulations, particularly numerical wave-based methods, show promise for evaluating ASP/AML algorithms as they can provide results comparable to costly measurements, though geometrical acoustic approaches are less reliable.

Abstract: Audio-signal-processing and audio-machine-learning (ASP/AML) algorithms are ubiquitous in modern technology like smart devices, wearables, and entertainment systems. Development of such algorithms and models typically involves a formal evaluation to demonstrate their effectiveness and progress beyond the state-of-the-art. Ideally, a thorough evaluation should cover many diverse application scenarios and room-acoustic conditions. However, in practice, evaluation datasets are often limited in size and diversity because they rely on costly and time-consuming measurements. This paper explores how room-acoustic simulations can be used for evaluating ASP/AML algorithms. To this end, we evaluate three ASP/AML algorithms with room-acoustic measurements and data from different simulation engines, and assess the match between the evaluation results obtained from measurements and simulations. The presented investigation compares a numerical wave-based solver with two geometrical acoustics simulators. While numerical wave-based simulations yielded similar evaluation results as measurements for all three evaluated ASP/AML algorithms, geometrical acoustic simulations could not replicate the measured evaluation results as reliably.

[359] MEAN-RIR: Multi-Modal Environment-Aware Network for Robust Room Impulse Response Estimation

Jiajian Chen, Jiakang Chen, Hang Chen, Qing Wang, Yu Gao, Jun Du

Main category: eess.AS

TL;DR: MEAN-RIR is a multi-modal network that predicts room impulse responses using audio, visual, and textual inputs through an encoder-decoder framework with cross-attention, achieving significant improvements in RIR estimation.

DetailsMotivation: To improve room impulse response prediction by leveraging multi-level environmental information from multiple modalities (audio, visual, text) rather than relying on single modality inputs.

Method: Encoder-decoder framework with separate encoders for audio (reverberant speech), visual (panoramic images), and text inputs. Uses cross-attention modules for modality interaction. Decoder generates direct sound/early reflections and late reverberation masks that modulate filtered noise.

Result: Significantly improves RIR estimation with notable gains in acoustic parameters compared to previous methods.

Conclusion: Multi-modal environmental information from audio, visual, and textual sources effectively enhances room impulse response prediction through cross-modal attention mechanisms.

Abstract: This paper presents a Multi-Modal Environment-Aware Network (MEAN-RIR), which uses an encoder-decoder framework to predict room impulse response (RIR) based on multi-level environmental information from audio, visual, and textual sources. Specifically, reverberant speech capturing room acoustic properties serves as the primary input, which is combined with panoramic images and text descriptions as supplementary inputs. Each input is processed by its respective encoder, and the outputs are fed into cross-attention modules to enable effective interaction between different modalities. The MEAN-RIR decoder generates two distinct components: the first component captures the direct sound and early reflections, while the second produces masks that modulate learnable filtered noise to synthesize the late reverberation. These two components are mixed to reconstruct the final RIR. The results show that MEAN-RIR significantly improves RIR estimation, with notable gains in acoustic parameters.

[360] Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han

Main category: eess.AS

TL;DR: GOAT is a post-training framework that reduces hallucinations in LM-based TTS systems by aligning model distributions through flow optimization, achieving over 50% error reduction without extra training costs.

DetailsMotivation: LM-based TTS systems often generate hallucinated speech that deviates from input text, and existing mitigation strategies require excessive training resources or introduce significant inference latency.

Method: Reformulate TTS generation as trajectory flow optimization, use enhanced Subtrajectory Balance objective with sharpened internal reward as target distribution, integrate reward temperature decay and learning rate optimization.

Result: Reduces over 50% character error rates on challenging test cases and lowers uncertainty by up to 58%.

Conclusion: GOAT effectively mitigates hallucinations in LM-based TTS without requiring massive resources or adding inference cost, demonstrating strong generalization ability.

Abstract: Language Model (LM)-based Text-to-Speech (TTS) systems often generate hallucinated speech that deviates from input text. Existing mitigation strategies either demand excessive training resources or introduce significant inference latency. In this paper, we propose GFlOwNet-guided distribution AlignmenT (GOAT) for LM-based TTS, a post-training framework that mitigates hallucinations without relying on massive resources or inference cost. Specifically, we first conduct an uncertainty analysis, revealing a strong positive correlation between hallucination and model uncertainty. Based on this, we reformulate TTS generation as a trajectory flow optimization problem and introduce an enhanced Subtrajectory Balance objective together with a sharpened internal reward as target distribution. We further integrate reward temperature decay and learning rate optimization for stability and performance balance. Extensive experiments show that GOAT reduce over 50% character error rates on challenging test cases and lowering uncertainty by up to 58%, demonstrating its strong generalization ability and effectiveness.

eess.IV

[361] Inferring the Graph Structure of Images for Graph Neural Networks

Mayur S Gowda, John Shi, Augusto Santos, José M. F. Moura

Main category: eess.IV

TL;DR: Improving GNN accuracy on image datasets by finding alternative graph representations beyond traditional grid graphs and superpixel methods

DetailsMotivation: Traditional grid graph representations of images in datasets like MNIST may not be optimal for Graph Neural Network performance, suggesting that alternative graph structures could improve downstream task accuracy

Method: Finding row correlation, column correlation, and product graphs for each image using correlations between pixel values, building on existing methods from previous research

Result: Experiments show that using these alternative graph representations as input to GNN models improves accuracy compared to traditional grid graph and superpixel methods

Conclusion: Alternative graph representations based on pixel correlations can significantly enhance GNN performance on image classification tasks, offering better accuracy than conventional approaches

Abstract: Image datasets such as MNIST are a key benchmark for testing Graph Neural Network (GNN) architectures. The images are traditionally represented as a grid graph with each node representing a pixel and edges connecting neighboring pixels (vertically and horizontally). The graph signal is the values (intensities) of each pixel in the image. The graphs are commonly used as input to graph neural networks (e.g., Graph Convolutional Neural Networks (Graph CNNs) [1, 2], Graph Attention Networks (GAT) [3], GatedGCN [4]) to classify the images. In this work, we improve the accuracy of downstream graph neural network tasks by finding alternative graphs to the grid graph and superpixel methods to represent the dataset images, following the approach in [5, 6]. We find row correlation, column correlation, and product graphs for each image in MNIST and Fashion-MNIST using correlations between the pixel values building on the method in [5, 6]. Experiments show that using these different graph representations and features as input into downstream GNN models improves the accuracy over using the traditional grid graph and superpixel methods in the literature.

[362] AURAD: Anatomy-Pathology Unified Radiology Synthesis with Progressive Representations

Shuhan Ding, Jingjing Fu, Yu Gu, Naiteek Sangani, Mu Wei, Paul Vozila, Nan Liu, Jiang Bian, Hoifung Poon

Main category: eess.IV

TL;DR: AURAD is a controllable chest X-ray synthesis framework that generates both images and pseudo semantic masks from clinical prompts, ensuring anatomical-pathological consistency and clinical relevance through expert model filtering.

DetailsMotivation: Medical image synthesis needs fine-grained controllability for data augmentation, but existing methods struggle with chest radiographs due to diverse disease patterns intertwined with anatomy and limited high-quality annotations.

Method: Progressive pipeline: first generates pseudo masks from clinical prompts conditioned on anatomical structures, then uses masks to guide image synthesis. Leverages pretrained medical models for output filtering to ensure clinical plausibility.

Result: 78% of synthesized images classified as authentic by radiologists, over 40% of segmentation overlays rated clinically useful. Demonstrates effectiveness across tasks and datasets.

Conclusion: AURAD bridges generative modeling with clinical applications by providing both realistic images and usable segmentation masks, enabling downstream tasks like detection and segmentation in data-scarce settings.

Abstract: Medical image synthesis has become an essential strategy for augmenting datasets and improving model generalization in data-scarce clinical settings. However, fine-grained and controllable synthesis remains difficult due to limited high-quality annotations and domain shifts across datasets. Existing methods, often designed for natural images or well-defined tumors, struggle to generalize to chest radiographs, where disease patterns are morphologically diverse and tightly intertwined with anatomical structures. To address these challenges, we propose AURAD, a controllable radiology synthesis framework that jointly generates high-fidelity chest X-rays and pseudo semantic masks. Unlike prior approaches that rely on randomly sampled masks-limiting diversity, controllability, and clinical relevance-our method learns to generate masks that capture multi-pathology coexistence and anatomical-pathological consistency. It follows a progressive pipeline: pseudo masks are first generated from clinical prompts conditioned on anatomical structures, and then used to guide image synthesis. We also leverage pretrained expert medical models to filter outputs and ensure clinical plausibility. Beyond visual realism, the synthesized masks also serve as labels for downstream tasks such as detection and segmentation, bridging the gap between generative modeling and real-world clinical applications. Extensive experiments and blinded radiologist evaluations demonstrate the effectiveness and generalizability of our method across tasks and datasets. In particular, 78% of our synthesized images are classified as authentic by board-certified radiologists, and over 40% of predicted segmentation overlays are rated as clinically useful. All code, pre-trained models, and the synthesized dataset will be released upon publication.

[363] Multi-modal Uncertainty Robust Tree Cover Segmentation For High-Resolution Remote Sensing Images

Yuanyuan Gui, Wei Li, Yinjian Wang, Xiang-Gen Xia, Mauro Marty, Christian Ginzler, Zuyuan Wang

Main category: eess.IV

TL;DR: MURTreeFormer is a novel multi-modal segmentation framework that addresses temporal misalignment issues in remote sensing by modeling aleatoric uncertainty and reconstructing uncertain patches for robust tree cover mapping.

DetailsMotivation: Temporal misalignments between multi-modal remote sensing data (optical, LiDAR, SAR) acquired days/months apart introduce cross-modal uncertainty that degrades segmentation accuracy for tree cover mapping applications.

Method: Treats one modality as primary and others as auxiliary, models patch-level uncertainty via probabilistic latent representation, uses VAE-based resampling to reconstruct uncertain patches, and integrates gradient magnitude attention module with refinement head for structure guidance and detail preservation.

Result: Extensive experiments on Shanghai and Zurich datasets demonstrate significant improvement in segmentation performance and effective reduction of temporally induced aleatoric uncertainty.

Conclusion: MURTreeFormer provides an effective solution for handling temporal misalignments in multi-modal remote sensing data, enabling more robust and accurate tree cover segmentation despite acquisition time differences.

Abstract: Recent advances in semantic segmentation of multi-modal remote sensing images have significantly improved the accuracy of tree cover mapping, supporting applications in urban planning, forest monitoring, and ecological assessment. Integrating data from multiple modalities-such as optical imagery, light detection and ranging (LiDAR), and synthetic aperture radar (SAR)-has shown superior performance over single-modality methods. However, these data are often acquired days or even months apart, during which various changes may occur, such as vegetation disturbances (e.g., logging, and wildfires) and variations in imaging quality. Such temporal misalignments introduce cross-modal uncertainty, especially in high-resolution imagery, which can severely degrade segmentation accuracy. To address this challenge, we propose MURTreeFormer, a novel multi-modal segmentation framework that mitigates and leverages aleatoric uncertainty for robust tree cover mapping. MURTreeFormer treats one modality as primary and others as auxiliary, explicitly modeling patch-level uncertainty in the auxiliary modalities via a probabilistic latent representation. Uncertain patches are identified and reconstructed from the primary modality’s distribution through a VAE-based resampling mechanism, producing enhanced auxiliary features for fusion. In the decoder, a gradient magnitude attention (GMA) module and a lightweight refinement head (RH) are further integrated to guide attention toward tree-like structures and to preserve fine-grained spatial details. Extensive experiments on multi-modal datasets from Shanghai and Zurich demonstrate that MURTreeFormer significantly improves segmentation performance and effectively reduces the impact of temporally induced aleatoric uncertainty.

[364] INR meets Multi-Contrast MRI Reconstruction

Natascha Niessen, Carolin M. Pirkl, Ana Beatriz Solana, Hannah Eichhorn, Veronika Spieker, Wenqi Huang, Tim Sprenger, Marion I. Menzel, Julia A. Schnabel

Main category: eess.IV

TL;DR: Proposed implicit neural representation network for multi-contrast MRI reconstruction that leverages complementary undersampling patterns across contrasts to achieve higher acceleration rates while maintaining image quality.

DetailsMotivation: Multi-contrast MRI sequences provide valuable tissue information but have long scan times. Undersampling accelerates acquisition but requires advanced reconstruction techniques to maintain image quality.

Method: Uses complementary undersampling patterns across contrasts (capturing center k-space contrast info, complementary high-frequency undersampling) and an implicit neural representation network that jointly reconstructs all contrast images.

Result: Outperforms state-of-the-art parallel imaging compressed sensing (PICS) reconstruction method, even at higher acceleration factors, when applied to MPnRAGE multi-contrast MRI sequence.

Conclusion: The proposed INR method effectively leverages redundant anatomical information across multi-contrast sequences to achieve higher acceleration rates while maintaining reconstruction quality superior to existing methods.

Abstract: Multi-contrast MRI sequences allow for the acquisition of images with varying tissue contrast within a single scan. The resulting multi-contrast images can be used to extract quantitative information on tissue microstructure. To make such multi-contrast sequences feasible for clinical routine, the usually very long scan times need to be shortened e.g. through undersampling in k-space. However, this comes with challenges for the reconstruction. In general, advanced reconstruction techniques such as compressed sensing or deep learning-based approaches can enable the acquisition of high-quality images despite the acceleration. In this work, we leverage redundant anatomical information of multi-contrast sequences to achieve even higher acceleration rates. We use undersampling patterns that capture the contrast information located at the k-space center, while performing complementary undersampling across contrasts for high frequencies. To reconstruct this highly sparse k-space data, we propose an implicit neural representation (INR) network that is ideal for using the complementary information acquired across contrasts as it jointly reconstructs all contrast images. We demonstrate the benefits of our proposed INR method by applying it to multi-contrast MRI using the MPnRAGE sequence, where it outperforms the state-of-the-art parallel imaging compressed sensing (PICS) reconstruction method, even at higher acceleration factors.

[365] VLSM-Ensemble: Ensembling CLIP-based Vision-Language Models for Enhanced Medical Image Segmentation

Julia Dietlmeier, Oluwabukola Grace Adegboro, Vayangi Ganepola, Claudia Mazo, Noel E. O’Connor

Main category: eess.IV

TL;DR: Ensembling vision-language segmentation models with low-complexity CNNs improves performance on medical image segmentation tasks, achieving up to 6.3% Dice score improvement on polyp datasets.

DetailsMotivation: Vision-language models for image segmentation show great potential but implementations based on CLIP and BiomedCLIP lag behind more sophisticated architectures like CRIS. Instead of focusing on text prompt engineering, the authors explore ensembling approaches to narrow this performance gap.

Method: The authors ensemble vision-language segmentation models (VLSMs) with a low-complexity CNN architecture to improve segmentation performance across multiple medical imaging datasets.

Result: Achieved significant Dice score improvement of 6.3% on BKAI polyp dataset using ensembled BiomedCLIPSeg, with gains ranging from 1% to 6% across other datasets. Results vary across different datasets, with some outperforming and others underperforming the CRIS benchmark.

Conclusion: Ensembling works differently across various medical imaging datasets, indicating this approach requires further investigation. The method shows promise for improving vision-language segmentation models but performance is dataset-dependent.

Abstract: Vision-language models and their adaptations to image segmentation tasks present enormous potential for producing highly accurate and interpretable results. However, implementations based on CLIP and BiomedCLIP are still lagging behind more sophisticated architectures such as CRIS. In this work, instead of focusing on text prompt engineering as is the norm, we attempt to narrow this gap by showing how to ensemble vision-language segmentation models (VLSMs) with a low-complexity CNN. By doing so, we achieve a significant Dice score improvement of 6.3% on the BKAI polyp dataset using the ensembled BiomedCLIPSeg, while other datasets exhibit gains ranging from 1% to 6%. Furthermore, we provide initial results on additional four radiology and non-radiology datasets. We conclude that ensembling works differently across these datasets (from outperforming to underperforming the CRIS model), indicating a topic for future investigation by the community. The code is available at https://github.com/juliadietlmeier/VLSM-Ensemble.

[366] Exploring Autoregressive Vision Foundation Models for Image Compression

Huu-Tai Phung, Yu-Hsiang Lin, Yen-Kuan Ho, Wen-Hsiao Peng

Main category: eess.IV

TL;DR: First attempt to repurpose vision foundation models (VFMs) as image codecs for low-rate compression, showing superior perceptual quality at extremely low bitrates compared to specialized learned codecs.

DetailsMotivation: To explore the generation capability of vision foundation models for image compression, leveraging their encoder-decoder architecture and autoregressive modeling that resembles end-to-end learned image codecs.

Method: Repurpose the autoregressive model in VFMs for entropy coding by predicting next tokens based on previously coded tokens, rather than relying solely on conditional generation for image reconstruction.

Result: Certain pre-trained general-purpose VFMs demonstrate superior perceptual quality at extremely low bitrates compared to specialized learned image codecs optimized for distortion or perceptual quality.

Conclusion: This approach paves the way for leveraging VFMs for low-rate, semantically rich image compression as a promising research direction.

Abstract: This work presents the first attempt to repurpose vision foundation models (VFMs) as image codecs, aiming to explore their generation capability for low-rate image compression. VFMs are widely employed in both conditional and unconditional generation scenarios across diverse downstream tasks, e.g., physical AI applications. Many VFMs employ an encoder-decoder architecture similar to that of end-to-end learned image codecs and learn an autoregressive (AR) model to perform next-token prediction. To enable compression, we repurpose the AR model in VFM for entropy coding the next token based on previously coded tokens. This approach deviates from early semantic compression efforts that rely solely on conditional generation for reconstructing input images. Extensive experiments and analysis are conducted to compare VFM-based codec to current SOTA codecs optimized for distortion or perceptual quality. Notably, certain pre-trained, general-purpose VFMs demonstrate superior perceptual quality at extremely low bitrates compared to specialized learned image codecs. This finding paves the way for a promising research direction that leverages VFMs for low-rate, semantically rich image compression.

[367] Generation of realistic cardiac ultrasound sequences with ground truth motion and speckle decorrelation

Thierry Judge, Nicolas Duchateau, Khuram Faraz, Pierre-Marc Jodoin, Olivier Bernard

Main category: eess.IV

TL;DR: Improved ultrasound simulation framework that incorporates speckle decorrelation for more realistic left ventricular strain estimation training data

DetailsMotivation: Existing ultrasound simulation pipelines lack realism because they don't account for speckle decorrelation, which limits their effectiveness for training machine learning algorithms

Method: Builds on existing ultrasound simulation by adding dynamic speckle variation model using coherence maps derived from correlation values measured from real ultrasound data, adapting locally over time

Result: Evaluated on 98 patients from CAMUS database, achieved lower mean absolute error compared to baseline, better reproducing decorrelation behavior from clinical data

Conclusion: The proposed framework successfully addresses the limitation of previous methods by explicitly modeling speckle decorrelation, resulting in more realistic simulated ultrasound sequences for machine learning applications

Abstract: Simulated ultrasound image sequences are key for training and validating machine learning algorithms for left ventricular strain estimation. Several simulation pipelines have been proposed to generate sequences with corresponding ground truth motion, but they suffer from limited realism as they do not consider speckle decorrelation. In this work, we address this limitation by proposing an improved simulation framework that explicitly accounts for speckle decorrelation. Our method builds on an existing ultrasound simulation pipeline by incorporating a dynamic model of speckle variation. Starting from real ultrasound sequences and myocardial segmentations, we generate meshes that guide image formation. Instead of applying a fixed ratio of myocardial and background scatterers, we introduce a coherence map that adapts locally over time. This map is derived from correlation values measured directly from the real ultrasound data, ensuring that simulated sequences capture the characteristic temporal changes observed in practice. We evaluated the realism of our approach using ultrasound data from 98 patients in the CAMUS database. Performance was assessed by comparing correlation curves from real and simulated images. The proposed method achieved lower mean absolute error compared to the baseline pipeline, indicating that it more faithfully reproduces the decorrelation behavior seen in clinical data.

[368] Automatic segmentation of Organs at Risk in Head and Neck cancer patients from CT and MRI scans

Sébastien Quetin, Andrew Heschl, Mauricio Murillo, Rohit Murali, Piotr Pater, George Shenouda, Shirin A. Enger, Farhad Maleki

Main category: eess.IV

TL;DR: A deep learning pipeline using nnU-Net with Modality Dropout achieves state-of-the-art automatic segmentation of 30 head and neck organs-at-risk from CT, MRI, or both modalities.

DetailsMotivation: To develop a robust and flexible deep learning solution for automatic segmentation of organs-at-risk in head and neck cancer patients that can handle different imaging modalities (CT, MRI, or both) and maintain high performance.

Method: Trained nnU-Net pipeline on 296 patients with paired CT and MRI-T1 scans from multiple datasets. Used Modality Dropout during training for robustness to missing modalities, merged left/right OARs during training and separated them at inference based on anatomical position.

Result: Achieved state-of-the-art performance on HaN-Seg challenge (mean Dice Score: 78.12%, Hausdorff Distance: 3.42 mm) and maintained strong agreement with commercial software on TCIA datasets (DS: 77.43%, HD: 3.27 mm) while flagging low-quality contours.

Conclusion: The pipeline establishes new state-of-the-art for fully automated multi-modal segmentation of head and neck organs-at-risk, demonstrating seamless segmentation capability from CT, MRI, or both modalities with robust performance.

Abstract: Purpose: To present a high-performing, robust, and flexible deep learning pipeline for automatic segmentation of 30 organs-at-risk (OARs) in head and neck (H&N) cancer patients, using MRI, CT, or both. Method: We trained a segmentation pipeline on paired CT and MRI-T1 scans from 296 patients. We combined data from the H&N OARs CT and MR segmentation (HaN-Seg) challenge and the Burdenko and GLIS-RT datasets from the Cancer Imaging Archive (TCIA). MRI was rigidly registered to CT, and both were stacked as input to an nnU-Net pipeline. Left and right OARs were merged into single classes during training and separated at inference time based on anatomical position. Modality Dropout was applied during the training, ensuring the model would learn from both modalities and robustly handle missing modalities during inference. The trained model was evaluated on the HaN-Seg test set and three TCIA datasets. Predictions were also compared with Limbus AI software. Dice Score (DS) and Hausdorff Distance (HD) were used as evaluation metrics. Results: The pipeline achieved state-of-the-art performance on the HaN-Seg challenge with a mean DS of 78.12% and HD of 3.42 mm. On TCIA datasets, the model maintained strong agreement with Limbus AI software (DS: 77.43% , HD: 3.27 mm), while also flagging low-quality contours. The pipeline can segment seamlessly from the CT, the MRI scan, or both. Conclusion: The proposed pipeline achieved the best DS and HD scores among all HaN-Seg challenge participants and establishes a new state-of-the-art for fully automated, multi-modal segmentation of H&N OARs.

[369] Generating Synthetic Contrast-Enhanced Chest CT Images from Non-Contrast Scans Using Slice-Consistent Brownian Bridge Diffusion Network

Pouya Shiri, Xin Yi, Neel P. Mistry, Samaneh Javadinia, Mohammad Chegini, Seok-Bum Ko, Amirali Baniasadi, Scott J. Adams

Main category: eess.IV

TL;DR: First bridge diffusion model for generating synthetic contrast-enhanced CT angiography from non-contrast CT scans, eliminating need for contrast agents while maintaining 3D anatomical integrity.

DetailsMotivation: Contrast agents in CT imaging pose risks like nephrotoxicity and allergic reactions. Synthetic contrast-enhanced imaging would improve patient safety, accessibility, and reduce healthcare costs.

Method: Uses Slice-Consistent Brownian Bridge Diffusion Model (SC-BBDM) to model complex mappings while maintaining slice consistency. Includes comprehensive preprocessing with resampling, Symmetric Normalization registration, and dilated segmentation masks. Evaluated on two datasets from Coltea-Lung dataset.

Result: Demonstrates effectiveness in preserving vascular structures while enhancing contrast fidelity compared to baseline methods on both aorta-only and aorta+heart datasets.

Conclusion: Proposed framework successfully generates high-fidelity synthetic contrast-enhanced CTA images without actual contrast administration, operating efficiently under low memory budget while maintaining full 3D anatomical integrity.

Abstract: Contrast-enhanced computed tomography (CT) imaging is essential for diagnosing and monitoring thoracic diseases, including aortic pathologies. However, contrast agents pose risks such as nephrotoxicity and allergic-like reactions. The ability to generate high-fidelity synthetic contrast-enhanced CT angiography (CTA) images without contrast administration would be transformative, enhancing patient safety and accessibility while reducing healthcare costs. In this study, we propose the first bridge diffusion-based solution for synthesizing contrast-enhanced CTA images from non-contrast CT scans. Our approach builds on the Slice-Consistent Brownian Bridge Diffusion Model (SC-BBDM), leveraging its ability to model complex mappings while maintaining consistency across slices. Unlike conventional slice-wise synthesis methods, our framework preserves full 3D anatomical integrity while operating in a high-resolution 2D fashion, allowing seamless volumetric interpretation under a low memory budget. To ensure robust spatial alignment, we implement a comprehensive preprocessing pipeline that includes resampling, registration using the Symmetric Normalization method, and a sophisticated dilated segmentation mask to extract the aorta and surrounding structures. We create two datasets from the Coltea-Lung dataset: one containing only the aorta and another including both the aorta and heart, enabling a detailed analysis of anatomical context. We compare our approach against baseline methods on both datasets, demonstrating its effectiveness in preserving vascular structures while enhancing contrast fidelity.

[370] MitoDetect++: A Domain-Robust Pipeline for Mitosis Detection and Atypical Subtyping

Esha Sadia Nasir, Jiaqi Lv, Mostafa Jahanifar, Shan E Ahmed Raza

Main category: eess.IV

TL;DR: MitoDetect++ is a deep learning pipeline for mitosis detection and atypical mitosis classification that achieves 0.892 balanced accuracy using U-Net with EfficientNetV2-L backbone and Virchow2 transformer with LoRA fine-tuning.

DetailsMotivation: Automated detection and classification of mitotic figures, especially distinguishing atypical from normal mitoses, remains a critical challenge in computational pathology that needs to be addressed.

Method: U-Net-based encoder-decoder with EfficientNetV2-L backbone and attention modules for detection; Virchow2 vision transformer with LoRA fine-tuning for classification; enhanced with strong augmentations, focal loss, stratified cross-validation, and test-time augmentation.

Result: Achieved a balanced accuracy of 0.892 across validation domains, demonstrating strong performance and generalization capabilities.

Conclusion: The method shows clinical applicability and scalability across tasks, effectively addressing both mitosis detection and atypical mitosis classification challenges in computational pathology.

Abstract: Automated detection and classification of mitotic figures especially distinguishing atypical from normal remain critical challenges in computational pathology. We present MitoDetect++, a unified deep learning pipeline designed for the MIDOG 2025 challenge, addressing both mitosis detection and atypical mitosis classification. For detection (Track 1), we employ a U-Net-based encoder-decoder architecture with EfficientNetV2-L as the backbone, enhanced with attention modules, and trained via combined segmentation losses. For classification (Track 2), we leverage the Virchow2 vision transformer, fine-tuned efficiently using Low-Rank Adaptation (LoRA) to minimize resource consumption. To improve generalization and mitigate domain shifts, we integrate strong augmentations, focal loss, and group-aware stratified 5-fold cross-validation. At inference, we deploy test-time augmentation (TTA) to boost robustness. Our method achieves a balanced accuracy of 0.892 across validation domains, highlighting its clinical applicability and scalability across tasks.

[371] Adaptive Learning Strategies for Mitotic Figure Classification in MIDOG2025 Challenge

Biwen Meng, Xi Long, Jingxin Liu

Main category: eess.IV

TL;DR: Adapting UNI2 foundation model with Visual Prompt Tuning and stain normalization techniques improves atypical mitosis detection, achieving top-10 performance in MIDOG2025 challenge.

DetailsMotivation: Atypical mitotic figures are clinically important indicators but challenging to detect reliably due to morphological ambiguity and scanner variability.

Method: Three UNI2 adaptation variants: (1) LoRA + UNI2, (2) VPT + UNI2 + Vahadane Normalizer, (3) VPT + UNI2 + GRL + Stain TTA. Best results from combining Visual Prompt Tuning with stain normalization and test-time augmentation.

Result: Final submission achieved balanced accuracy of 0.8837 and ROC-AUC of 0.9513, ranking within top 10 teams on preliminary leaderboard.

Conclusion: Prompt-based adaptation combined with stain-normalization test-time augmentation offers a promising strategy for atypical mitosis classification under diverse imaging conditions.

Abstract: Atypical mitotic figures (AMFs) are clinically relevant indicators of abnormal cell division, yet their reliable detection remains challenging due to morphological ambiguity and scanner variability. In this work, we investigated three variants of adapting the pathology foundation model UNI2 for the MIDOG2025 Track 2 challenge: (1) LoRA + UNI2, (2) VPT + UNI2 + Vahadane Normalizer, and (3) VPT + UNI2 + GRL + Stain TTA. We observed that the integration of Visual Prompt Tuning (VPT) with stain normalization techniques contributed to improved generalization. The best robustness was achieved by further incorporating test-time augmentation (TTA) with Vahadane and Macenko stain normalization. Our final submission achieved a balanced accuracy of 0.8837 and an ROC-AUC of 0.9513 on the preliminary leaderboard, ranking within the top 10 teams. These results suggest that prompt-based adaptation combined with stain-normalization TTA offers a promising strategy for atypical mitosis classification under diverse imaging conditions.

Last updated: 2025-09-15
Built with Hugo, theme modified on Stack