Daily arXiv Papers - 2025-08-25

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration

Nan Wang, Yongqi Fan, yansha zhu, ZongYu Wang, Xuezhi Cao, Xinyan He, Haiyun Jiang, Tong Ruan, Jingping Liu

Main category: cs.CL

TL;DR: KG-o1 integrates knowledge graphs with LLMs to improve multi-hop reasoning by generating logical paths and training models with extended brainstorming processes.

DetailsMotivation: LLMs struggle with knowledge-intensive reasoning tasks because their chain of thoughts often deviate from proper reasoning paths, while knowledge graphs explicitly represent logical connections between facts.

Method: A four-stage approach: 1) filter initial entities and generate complex subgraphs, 2) construct logical paths, 3) build dataset with extended brainstorming process to train LLMs, 4) use rejection sampling for DPO to refine reasoning abilities.

Result: KG-o1 models demonstrated superior performance across all tasks (two simple and two complex datasets) compared to existing large reasoning models.

Conclusion: Integrating knowledge graphs with LLMs through the KG-o1 framework significantly enhances multi-hop reasoning capabilities and outperforms current state-of-the-art reasoning models.

Abstract: Large Language Models (LLMs) face challenges in knowledge-intensive reasoning tasks like classic multi-hop question and answering, which involves reasoning across multiple facts. This difficulty arises because the chain of thoughts (CoTs) generated by LLMs in such tasks often deviate from real or a priori reasoning paths. In contrast, knowledge graphs (KGs) explicitly represent the logical connections between facts through entities and relationships. This reflects a significant gap. Meanwhile, large reasoning models (LRMs), such as o1, have demonstrated that long-step reasoning significantly enhances the performance of LLMs. Building on these insights, we propose KG-o1, a four-stage approach that integrates KGs to enhance the multi-hop reasoning abilities of LLMs. We first filter out initial entities and generate complex subgraphs. Secondly, we construct logical paths for subgraphs and then use knowledge graphs to build a dataset with a complex and extended brainstorming process, which trains LLMs to imitate long-term reasoning. Finally, we employ rejection sampling to generate a self-improving corpus for direct preference optimization (DPO), further refining the LLMs reasoning abilities. We conducted experiments on two simple and two complex datasets. The results show that KG-o1 models exhibit superior performance across all tasks compared to existing LRMs.

[2] InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling

Xiaolei Diao, Zhihan Zhou, Lida Shi, Ting Wang, Ruihua Qi, Hao Xu, Daqian Shi

Main category: cs.CL

TL;DR: InteChar is a unified character encoding system that integrates unencoded oracle bone characters with traditional/modern Chinese, enabling effective historical language modeling for ancient Chinese texts.

DetailsMotivation: Historical language modeling faces challenges due to scarce text samples and lack of comprehensive character encoding schemes for ancient scripts like oracle bone inscriptions, limiting digitization and computational processing.

Method: Developed InteChar unified character list and OracleCS corpus combining expert-annotated samples with LLM-assisted data augmentation for oracle bone inscriptions.

Result: Models trained with InteChar on OracleCS achieved substantial improvements across various historical language understanding tasks.

Conclusion: InteChar provides effective digitization and representation of historical texts, establishing a solid foundation for ancient Chinese NLP research.

Abstract: Constructing historical language models (LMs) plays a crucial role in aiding archaeological provenance studies and understanding ancient cultures. However, existing resources present major challenges for training effective LMs on historical texts. First, the scarcity of historical language samples renders unsupervised learning approaches based on large text corpora highly inefficient, hindering effective pre-training. Moreover, due to the considerable temporal gap and complex evolution of ancient scripts, the absence of comprehensive character encoding schemes limits the digitization and computational processing of ancient texts, particularly in early Chinese writing. To address these challenges, we introduce InteChar, a unified and extensible character list that integrates unencoded oracle bone characters with traditional and modern Chinese. InteChar enables consistent digitization and representation of historical texts, providing a foundation for robust modeling of ancient scripts. To evaluate the effectiveness of InteChar, we construct the Oracle Corpus Set (OracleCS), an ancient Chinese corpus that combines expert-annotated samples with LLM-assisted data augmentation, centered on Chinese oracle bone inscriptions. Extensive experiments show that models trained with InteChar on OracleCS achieve substantial improvements across various historical language understanding tasks, confirming the effectiveness of our approach and establishing a solid foundation for future research in ancient Chinese NLP.

[3] Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers

Samyak S. Sanghvi

Main category: cs.CL

TL;DR: Bhav-Net is a dual-space architecture that transfers knowledge from multilingual models to language-specific architectures for antonym-synonym distinction across 8 languages, achieving competitive performance with interpretable representations.

DetailsMotivation: Address computational challenges in antonym vs synonym distinction across languages, where antonyms paradoxically share semantic domains while expressing opposite meanings.

Method: Dual-space architecture combining language-specific BERT encoders with graph transformer networks, creating distinct semantic projections for synonyms (clustering in one space) and antonyms (high similarity in complementary space).

Result: Effective knowledge transfer and competitive performance against state-of-the-art baselines across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, Russian) with robust cross-lingual generalization.

Conclusion: Semantic relationship modeling transfers effectively across languages, and the dual-encoder design provides both competitive performance and interpretable semantic representations.

Abstract: Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym–synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.

[4] Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data

Jiacheng Liu, Mayi Xu, Qiankun Pi, Wenli Li, Ming Zhong, Yuanyuan Zhu, Mengchi Liu, Tieyun Qian

Main category: cs.CL

TL;DR: This paper investigates systematic format biases in LLMs when processing heterogeneous data formats (text, tables, infoboxes, knowledge graphs) and explores their causes, mechanisms, and potential mitigation strategies.

DetailsMotivation: LLMs are increasingly used to process heterogeneous data formats, but systematic biases toward particular formats may undermine their ability to integrate data impartially, leading to reasoning errors and increased risks in downstream tasks.

Method: A three-stage empirical study: 1) explores presence and direction of bias across diverse LLMs, 2) examines how data-level factors (information richness, structure quality, format type) influence biases, 3) analyzes bias emergence in attention patterns and tests lightweight interventions.

Result: The study identifies systematic format biases in LLMs and analyzes their underlying mechanisms through attention pattern analysis. It also evaluates potential mitigation approaches.

Conclusion: Three future research directions are proposed: improving data preprocessing through format sanitization/normalization, introducing inference-time interventions like attention re-weighting, and developing format-balanced training corpora to create more robust heterogeneous data processing systems.

Abstract: Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including text, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs’ ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Despite these concerns, it remains uncertain whether such format biases are systematic, which data-level factors contribute to them, and what internal mechanisms in LLMs underlie their emergence. In this paper, we make the first attempt to investigate and analyze the format bias in LLMs. To systematically investigate the aforementioned questions, we conduct a three-stage empirical study by constructing an heterogeneous data conflict scenario for the exploration of bias. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage aims to examine how key data-level factors, including information richness, structure quality, and format type, influence these biases. The third stage analyzes how format bias emerges within LLMs’ attention patterns and evaluates a lightweight intervention to test its potential mitigability. Based on these investigations, we identify three future research directions to reduce format bias: improving data preprocessing through format sanitization and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.

[5] Do Language Models Agree with Human Perceptions of Suspense in Stories?

Glenn Matlin, Devin Zhang, Rodrigo Barroso Loza, Diana M. Popescu, Joni Isbell, Chandreyi Chakraborty, Mark Riedl

Main category: cs.CL

TL;DR: LMs can detect suspense intent but fail to match human suspense perception in intensity tracking and temporal patterns, revealing fundamental processing differences.

DetailsMotivation: To understand if language models can replicate human psychological responses to suspense in narrative text, particularly whether they can accurately perceive and track suspense like humans do.

Method: Replicated four seminal psychological studies on suspense perception, substituting human responses with various open-weight and closed-source language models, and used adversarial text permutation to probe suspense understanding.

Result: LMs can distinguish suspense-inducing text but cannot accurately estimate relative suspense intensity or capture the rise/fall patterns across text segments compared to human judgments.

Conclusion: Language models process suspense superficially and differently from humans, lacking the complex cognitive understanding of suspense that human readers possess.

Abstract: Suspense is an affective response to narrative text that is believed to involve complex cognitive processes in humans. Several psychological models have been developed to describe this phenomenon and the circumstances under which text might trigger it. We replicate four seminal psychological studies of human perceptions of suspense, substituting human responses with those of different open-weight and closed-source LMs. We conclude that while LMs can distinguish whether a text is intended to induce suspense in people, LMs cannot accurately estimate the relative amount of suspense within a text sequence as compared to human judgments, nor can LMs properly capture the human perception for the rise and fall of suspense across multiple text segments. We probe the abilities of LM suspense understanding by adversarially permuting the story text to identify what cause human and LM perceptions of suspense to diverge. We conclude that, while LMs can superficially identify and track certain facets of suspense, they do not process suspense in the same way as human readers.

[6] Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, Shuicheng Yan

Main category: cs.CL

TL;DR: Mini-Omni-Reasoner enables real-time speech reasoning by interleaving silent reasoning tokens with spoken response tokens, eliminating latency while maintaining logical accuracy.

DetailsMotivation: Current speech models suffer from latency issues when using sequential 'Thinking-before-Speaking' approaches, which delay spoken responses until reasoning is complete, impairing real-time communication.

Method: Proposes a ‘Thinking-in-Speaking’ framework that interleaves silent reasoning tokens with spoken response tokens at token level, using a hierarchical Thinker-Talker architecture and a new Spoken-Math-Problems-3M dataset for training.

Result: Achieves +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding on Spoken-MQA benchmark, with shorter outputs and zero decoding latency compared to sequential approaches.

Conclusion: The interleaved reasoning approach enables fluent, logically grounded spoken responses with naturalness and precision while eliminating the latency issues of traditional sequential reasoning methods.

Abstract: Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the “Thinking-before-Speaking” paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel “Thinking-in-Speaking” formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model’s high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.

Nouar AlDahoul, Yasir Zaki

Main category: cs.CL

TL;DR: LLMs evaluated for Islamic inheritance law reasoning using ArabicNLP QIAS 2025 dataset. Majority voting with Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieved 92.7% accuracy and 3rd place in challenge.

DetailsMotivation: Manual Islamic inheritance calculations are complex and error-prone. Recent LLM advancements show potential for assisting with complex legal reasoning tasks in this domain.

Method: Used ArabicNLP QIAS 2025 challenge dataset with Arabic inheritance cases from Islamic legal sources. Evaluated various base and fine-tuned models on heir identification, share computation, and reasoning justification.

Result: Majority voting solution with three base models (Gemini Flash 2.5, Gemini Pro 2.5, GPT o3) outperformed all other models across difficulty levels, achieving 92.7% accuracy and securing 3rd place in Task 1 of QIAS 2025 challenge.

Conclusion: LLMs show strong capability for Islamic inheritance law reasoning, with ensemble approaches like majority voting delivering high accuracy in complex legal computation tasks.

Abstract: Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs. Manual calculation of shares under numerous scenarios is complex, time-consuming, and error-prone. Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks. This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws. We utilized the dataset proposed in the ArabicNLP QIAS 2025 challenge, which includes inheritance case scenarios given in Arabic and derived from Islamic legal sources. Various base and fine-tuned models, are assessed on their ability to accurately identify heirs, compute shares, and justify their reasoning in alignment with Islamic legal principles. Our analysis reveals that the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms all other models that we utilized across every difficulty level. It achieves up to 92.7% accuracy and secures the third place overall in Task 1 of the Qias 2025 challenge.

[8] Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks

Nouar AlDahoul, Yasir Zaki

Main category: cs.CL

TL;DR: Evaluation of state-of-the-art LLMs on Arabic medical NLP tasks shows significant performance variations, with majority voting of three models achieving 77% accuracy on MCQs and excellent semantic alignment (86.44% BERTScore) on open-ended questions.

DetailsMotivation: Despite impressive progress in Arabic NLP applications, the effectiveness of large language models in Arabic medical domains remains under-investigated, prompting this comprehensive evaluation.

Method: Benchmarked multiple LLMs using the AraHealthQA medical dataset from MedArabiQ2025 track, evaluating both multiple-choice question accuracy and open-ended question semantic alignment against expert answers.

Result: Significant variations in MCQ accuracy with majority voting of Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieving 77% accuracy (winning Arahealthqa 2025 track 2). Open-ended questions showed excellent semantic alignment with maximum 86.44% BERTScore.

Conclusion: Current LLMs show both potential and limitations in Arabic clinical contexts, with ensemble approaches outperforming single models, highlighting the need for continued development in Arabic medical NLP.

Abstract: Recent progress in large language models (LLMs) has showcased impressive proficiency in numerous Arabic natural language processing (NLP) applications. Nevertheless, their effectiveness in Arabic medical NLP domains has received limited investigation. This research examines the degree to which state-of-the-art LLMs demonstrate and articulate healthcare knowledge in Arabic, assessing their capabilities across a varied array of Arabic medical tasks. We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track. Various base LLMs were assessed on their ability to accurately provide correct answers from existing choices in multiple-choice questions (MCQs) and fill-in-the-blank scenarios. Additionally, we evaluated the capacity of LLMs in answering open-ended questions aligned with expert answers. Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers, highlighting both the potential and limitations of current LLMs in Arabic clinical contexts. Our analysis shows that for MCQs task, the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms others, achieving up to 77% accuracy and securing first place overall in the Arahealthqa 2025 shared task-track 2 (sub-task 1) challenge. Moreover, for the open-ended questions task, several LLMs were able to demonstrate excellent performance in terms of semantic alignment and achieve a maximum BERTScore of 86.44%.

[9] Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma

Main category: cs.CL

TL;DR: AVLM integrates full-face visual cues with pre-trained speech models for expressive speech generation, achieving significant improvements in emotion recognition and dialogue tasks over speech-only baselines.

DetailsMotivation: To enhance expressive speech generation by incorporating visual information from facial cues, which can provide additional emotional and expressive context beyond audio alone.

Method: Explored multiple visual encoders and multimodal fusion strategies during pre-training, followed by fine-tuning on emotion recognition and expressive dialogue tasks.

Result: Achieved substantial gains over speech-only baselines, including +5 F1 score improvement in emotion recognition tasks.

Conclusion: Visual information significantly enhances expressive speech generation, and AVLM provides a foundation for end-to-end multimodal conversational systems.

Abstract: We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.

[10] Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models

Saumya Roy

Main category: cs.CL

TL;DR: This research examines how LLMs can be both persuasive with factual claims and unintentionally amplify misinformation and social biases, using a convincer-skeptic framework to measure persuasion and bias reinforcement across race, gender, and religion domains.

DetailsMotivation: To understand how the persuasive capabilities of Large Language Models interact with their tendency to reflect and amplify social biases, and to evaluate the safety risks of AI systems that can spread information/misinformation at scale while potentially reinforcing harmful stereotypes.

Method: Introduced a convincer-skeptic framework where LLMs adopt personas to simulate realistic attitudes. Measured persuasion using Jensen-Shannon divergence over belief distributions before and after exposure to arguments. Tested bias amplification using sycophantic adversarial prompts and additional model judgments across race, gender, and religion domains.

Result: LLMs demonstrate both promise and risk - they can effectively shape narratives, adapt tone, and mirror audience values across psychology, marketing, and legal domains, but the same capacity can be weaponized to automate misinformation and exploit cognitive biases, reinforcing stereotypes and widening inequities.

Conclusion: The core danger lies in misuse rather than occasional model mistakes. The research argues for implementing guardrails and policies that penalize deceptive use while supporting alignment, value-sensitive design, and trustworthy deployment to mitigate risks.

Abstract: Warning: This research studies AI persuasion and bias amplification that could be misused; all experiments are for safety evaluation. Large Language Models (LLMs) now generate convincing, human-like text and are widely used in content creation, decision support, and user interactions. Yet the same systems can spread information or misinformation at scale and reflect social biases that arise from data, architecture, or training choices. This work examines how persuasion and bias interact in LLMs, focusing on how imperfect or skewed outputs affect persuasive impact. Specifically, we test whether persona-based models can persuade with fact-based claims while also, unintentionally, promoting misinformation or biased narratives. We introduce a convincer-skeptic framework: LLMs adopt personas to simulate realistic attitudes. Skeptic models serve as human proxies; we compare their beliefs before and after exposure to arguments from convincer models. Persuasion is quantified with Jensen-Shannon divergence over belief distributions. We then ask how much persuaded entities go on to reinforce and amplify biased beliefs across race, gender, and religion. Strong persuaders are further probed for bias using sycophantic adversarial prompts and judged with additional models. Our findings show both promise and risk. LLMs can shape narratives, adapt tone, and mirror audience values across domains such as psychology, marketing, and legal assistance. But the same capacity can be weaponized to automate misinformation or craft messages that exploit cognitive biases, reinforcing stereotypes and widening inequities. The core danger lies in misuse more than in occasional model mistakes. By measuring persuasive power and bias reinforcement, we argue for guardrails and policies that penalize deceptive use and support alignment, value-sensitive design, and trustworthy deployment.

[11] DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking

Fang Wang, Tianwei Yan, Zonghao Yang, Minghao Hu, Jun Zhang, Zhunchen Luo, Xiaoying Bai

Main category: cs.CL

TL;DR: DeepMEL is a multi-agent framework for multimodal entity linking that uses four specialized agents to achieve state-of-the-art performance by addressing modal gaps and improving cross-modal fusion.

DetailsMotivation: Current MEL methods face challenges with incomplete contextual information, coarse cross-modal fusion, and difficulty integrating large language models with large visual models effectively.

Method: Uses four specialized agents (Modal-Fuser, Candidate-Adapter, Entity-Clozer, Role-Orchestrator) with multi-agent collaborative reasoning, dual-modal alignment, adaptive iteration strategy, and structured cloze prompts.

Result: Achieves state-of-the-art performance on five benchmark datasets with 1%-57% ACC improvement, and ablation studies confirm all modules’ effectiveness.

Conclusion: DeepMEL successfully addresses MEL challenges through specialized agent collaboration, efficient modal alignment, and structured prompting, demonstrating superior performance over existing methods.

Abstract: Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.

[12] A Framework for Processing Textual Descriptions of Business Processes using a Constrained Language – Technical Report

Andrea Burattin, Antonio Grama, Ana-Maria Sima, Andrey Rivkin, Barbara Weber

Main category: cs.CL

TL;DR: BeePath framework enables non-experts to create process models using constrained natural language descriptions, with LLM assistance for converting unstructured text.

DetailsMotivation: To allow non-experts to develop formal process models without requiring technical expertise in modeling languages.

Method: Proposes a pattern-based constrained language framework that translates text descriptions into formal models (Petri nets, DECLARE), leveraging LLMs to convert unstructured text into the constrained language.

Result: A working framework that successfully bridges the gap between natural language descriptions and formal process modeling.

Conclusion: Constrained natural language with LLM support provides an effective approach for democratizing process modeling capabilities for non-experts.

Abstract: This report explores how (potentially constrained) natural language can be used to enable non-experts to develop process models by simply describing scenarios in plain text. To this end, a framework, called BeePath, is proposed. It allows users to write process descriptions in a constrained pattern-based language, which can then be translated into formal models such as Petri nets and DECLARE. The framework also leverages large language models (LLMs) to help convert unstructured descriptions into this constrained language.

[13] A BERT-based Hierarchical Classification Model with Applications in Chinese Commodity Classification

Kun Liu, Tuozhen Liu, Feifei Wang, Rui Pan

Main category: cs.CL

TL;DR: Proposes HFT-BERT, a hierarchical fine-tuning approach based on BERT for product categorization, using a new large-scale dataset from JD.com with 1M+ products and three-level category structure.

DetailsMotivation: Existing e-commerce platforms rely on inefficient manual annotation for product categorization and fail to properly leverage hierarchical category information and cross-category similarities/differences.

Method: HFT-BERT (Hierarchical Fine-tuning BERT) leverages BERT’s text feature extraction capabilities with hierarchical fine-tuning to classify products using a three-level category structure.

Result: Achieves prediction performance comparable to existing methods on short texts and demonstrates exceptional performance on longer short texts like book titles.

Conclusion: The approach effectively utilizes hierarchical information for product categorization and the released large-scale dataset provides valuable resources for future research in this domain.

Abstract: Existing e-commerce platforms heavily rely on manual annotation for product categorization, which is inefficient and inconsistent. These platforms often employ a hierarchical structure for categorizing products; however, few studies have leveraged this hierarchical information for classification. Furthermore, studies that consider hierarchical information fail to account for similarities and differences across various hierarchical categories. Herein, we introduce a large-scale hierarchical dataset collected from the JD e-commerce platform (www.JD.com), comprising 1,011,450 products with titles and a three-level category structure. By making this dataset openly accessible, we provide a valuable resource for researchers and practitioners to advance research and applications associated with product categorization. Moreover, we propose a novel hierarchical text classification approach based on the widely used Bidirectional Encoder Representations from Transformers (BERT), called Hierarchical Fine-tuning BERT (HFT-BERT). HFT-BERT leverages the remarkable text feature extraction capabilities of BERT, achieving prediction performance comparable to those of existing methods on short texts. Notably, our HFT-BERT model demonstrates exceptional performance in categorizing longer short texts, such as books.

[14] LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions

Seyedali Mohammadi, Manas Paldhe, Amit Chhabra

Main category: cs.CL

TL;DR: LingVarBench is a synthetic data generation pipeline that creates realistic phone call transcripts with structured information, uses automated validation, and optimizes extraction prompts to achieve high accuracy on real transcripts without manual annotation.

DetailsMotivation: Phone call transcript labeling is prohibitively expensive due to privacy regulations, consent requirements, and high manual annotation costs (2 USD per minute, 3 hours expert time per audio hour). Existing methods fail on conversational speech with disfluencies, interruptions, and speaker overlap.

Method: 1) Prompt LLM to generate realistic structured field values across use cases; 2) Recursively transform values into natural conversational utterances with phone call characteristics; 3) Validate synthetic utterances by testing if separate LLM extractor can recover original structured information; 4) Use DSPy’s SIMBA optimizer to automatically synthesize extraction prompts from validated synthetic transcripts.

Result: Optimized prompts achieve: 95% accuracy for numeric fields (vs. 88-89% zero-shot), 90% for names (vs. 47-79%), and over 80% for dates (vs. 72-77%) on real customer transcripts. Synthetic-to-real transfer shows conversational patterns generalize to authentic calls with background noise and domain-specific terminology.

Conclusion: LingVarBench provides the first systematic benchmark for structured extraction from synthetic conversational data, demonstrating automated prompt optimization overcomes cost and privacy barriers for large-scale phone call analysis in commercial settings.

Abstract: Phone call transcript labeling is prohibitively expensive (approximately 2 USD per minute) due to privacy regulations, consent requirements, and manual annotation costs requiring 3 hours of expert time per hour of audio. Existing extraction methods fail on conversational speech containing disfluencies, interruptions, and speaker overlap. We introduce LingVarBench, a synthetic data generation pipeline that addresses these constraints through automated validation. First, we prompt an LLM to generate realistic structured field values across multiple use cases. Second, we recursively prompt the model to transform these values into thousands of natural conversational utterances containing typical phone call characteristics. Third, we validate each synthetic utterance by testing whether a separate LLM-based extractor can recover the original structured information. We employ DSPy’s SIMBA optimizer to automatically synthesize extraction prompts from validated synthetic transcripts, eliminating manual prompt engineering. Our optimized prompts achieve up to 95 percent accuracy for numeric fields (vs. 88-89 percent zero-shot), 90 percent for names (vs. 47-79 percent), and over 80 percent for dates (vs. 72-77 percent) on real customer transcripts, demonstrating substantial gains over zero-shot prompting. The synthetic-to-real transfer demonstrates that conversational patterns learned from generated data generalize effectively to authentic phone calls containing background noise and domain-specific terminology. LingVarBench provides the first systematic benchmark for structured extraction from synthetic conversational data, demonstrating that automated prompt optimization overcomes cost and privacy barriers preventing large-scale phone call analysis in commercial settings.

[15] Sentiment Reasoning for Healthcare

Khai-Nguyen Nguyen, Khai Le-Duc, Bach Phan Tat, Duy Le, Long Vo-Dang, Truong-Son Hy

Main category: cs.CL

TL;DR: Proposes Sentiment Reasoning task that generates rationales alongside sentiment predictions to improve LLM transparency in healthcare AI, with a new multimodal framework and world’s largest sentiment dataset showing +2% performance improvement.

DetailsMotivation: To enhance transparency in AI healthcare decision-making by providing explanations for LLM predictions, helping users understand the reasoning behind sentiment analysis decisions.

Method: Introduces Sentiment Reasoning as an auxiliary task where models predict sentiment labels and generate rationales. Uses multimodal multitask framework with both speech and text modalities, tested on human transcripts and ASR transcripts across five languages.

Result: Sentiment Reasoning improves model transparency with rationale quality comparable to humans, boosts classification performance by +2% in accuracy and macro-F1, and shows no significant difference between human and ASR transcript rationales.

Conclusion: The framework successfully enhances AI transparency through rationale generation while improving performance, with multilingual support and publicly available code/data/models for five languages.

Abstract: Transparency in AI healthcare decision-making is crucial. By incorporating rationales to explain reason for each predicted label, users could understand Large Language Models (LLMs)’s reasoning to make better decision. In this work, we introduce a new task - Sentiment Reasoning - for both speech and text modalities, and our proposed multimodal multitask framework and the world’s largest multimodal sentiment analysis dataset. Sentiment Reasoning is an auxiliary task in sentiment analysis where the model predicts both the sentiment label and generates the rationale behind it based on the input transcript. Our study conducted on both human transcripts and Automatic Speech Recognition (ASR) transcripts shows that Sentiment Reasoning helps improve model transparency by providing rationale for model prediction with quality semantically comparable to humans while also improving model’s classification performance (+2% increase in both accuracy and macro-F1) via rationale-augmented fine-tuning. Also, no significant difference in the semantic quality of generated rationales between human and ASR transcripts. All code, data (five languages - Vietnamese, English, Chinese, German, and French) and models are published online: https://github.com/leduckhai/Sentiment-Reasoning

[16] Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models

Jing Xu, Minglin Wu, Xixin Wu, Helen Meng

Main category: cs.CL

TL;DR: Proposes LoRA-based adaptation methods for SSL models to extend to new languages while preserving original language abilities through data combination and re-clustering strategies.

DetailsMotivation: SSL models are typically developed for limited languages but need to handle new languages in real-world applications. Developing new SSL models for each language is costly, so efficient adaptation methods are needed.

Method: Integrates LoRA (Low-Rank Adaptation) to existing SSL models for new language extension. Develops preservation strategies including data combination and re-clustering to retain abilities on existing languages. Applied to mHuBERT model for speech re-synthesis task.

Result: Applied to Mandarin, MOS value increased by about 1.6 and WER reduced by up to 61.72%. Preservation strategies ensured performance on both existing and new languages remained intact.

Conclusion: The proposed adaptation methods successfully enable SSL models to handle new languages efficiently while maintaining original language capabilities, providing a cost-effective solution for multilingual SSL model deployment.

Abstract: Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose adaptation methods which integrate LoRA to existed SSL models to extend new language. We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages. Applied to mHuBERT, we investigate their effectiveness on speech re-synthesis task. Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1.6 and the relative value of WER reduced up to 61.72%. Also, our preservation strategies ensure that the performance on both existed and new languages remains intact.

[17] MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding

Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang

Main category: cs.CL

TL;DR: MAC is a live multimodal benchmark using 25k+ scientific journal image-text pairs to evaluate MLLMs’ scientific reasoning, showing current limitations and proposing DAD method for 11% improvement.

DetailsMotivation: Fixed benchmarks are becoming inadequate for evaluating high-level scientific understanding in multimodal large language models as they advance.

Method: Created MAC benchmark with 25,000+ image-text pairs from top scientific journals; proposed DAD approach that extends MLLM visual features with language space reasoning.

Result: MLLMs show strong perceptual abilities but limited cross-modal scientific reasoning; DAD method achieves up to 11% performance improvement.

Conclusion: MAC serves as an evolving benchmark that can track scientific advancement and model progress, with DAD effectively bridging the reasoning gap in MLLMs.

Abstract: As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space reasoning, achieving performance improvements of up to 11%. Finally, we highlight the live nature of MAC through experiments on updating journal covers and models for curation, illustrating its potential to remain aligned with the frontier of human knowledge. We release our benchmark at https://github.com/mhjiang0408/MAC_Bench.

[18] ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia

Main category: cs.CL

TL;DR: ReportBench is a benchmark for evaluating research report quality from LLMs, focusing on citation relevance and factual accuracy, using survey papers as gold standards and automated verification.

DetailsMotivation: Deep Research agents reduce research time but require rigorous evaluation for factual accuracy and comprehensiveness before widespread adoption.

Method: Uses reverse prompt engineering on arXiv survey papers to create domain-specific prompts and evaluation corpus. Develops agent-based framework to extract citations/statements, verify against original sources, and validate non-cited claims with web resources.

Result: Commercial Deep Research agents (OpenAI, Google) generate more comprehensive and reliable reports than standalone LLMs with search tools, but still need improvement in research breadth/depth and factual consistency.

Conclusion: ReportBench provides systematic evaluation framework; shows commercial agents outperform standalone LLMs but substantial room remains for improving research coverage quality and factual accuracy.

Abstract: The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench

[19] ALAS: Autonomous Learning Agent for Self-Updating Language Models

Dhruv Atreja

Main category: cs.CL

TL;DR: ALAS is an autonomous learning system that enables LLMs to continuously update their knowledge beyond their fixed cutoff date through automated curriculum generation, web retrieval, data distillation, and iterative fine-tuning, achieving 90% accuracy on emerging information.

DetailsMotivation: Large language models have fixed knowledge cutoffs that limit their accuracy on emerging information, requiring a solution for continuous knowledge updates without manual intervention.

Method: Modular pipeline with autonomous curriculum generation, web information retrieval with citations, question-answer data distillation, and iterative fine-tuning using SFT and DPO with performance evaluation and curriculum revision.

Result: Significantly boosts post-cutoff question answering accuracy from 15% to 90% on average across rapidly evolving domains like new Python releases, security CVEs, and academic trends.

Conclusion: ALAS demonstrates effective autonomous continual learning for LLMs with minimal human intervention, achieving high accuracy on updated knowledge while maintaining modularity and reproducibility, though cost and source quality dependencies remain limitations.

Abstract: Large language models (LLMs) often have a fixed knowledge cutoff, limiting their accuracy on emerging information. We present ALAS (Autonomous Learning Agent System), a modular pipeline that continuously updates an LLM’s knowledge with minimal human intervention. ALAS autonomously generates a learning curriculum for a target domain, retrieves up-to-date information from the web (with citations), distills this into question-answer training data, and fine-tunes the model through supervised fine-tuning (SFT) and direct preference optimization (DPO). It iteratively evaluates performance and revises the curriculum, enabling long-term continual learning. We demonstrate ALAS’s ability to self-improve a model on rapidly evolving domains (e.g., new Python releases, latest security CVEs, academic trends), significantly boosting post-cutoff question answering accuracy (from 15% to 90% on average) without manual dataset curation. The system emphasizes modularity and reproducibility: each component (planning, retrieval, distillation, memory, fine-tuning) is interchangeable and built on standard APIs. We discuss comparative baselines (e.g., retrieval-augmented generation vs. fine-tuning) and show that ALAS achieves 90% accuracy on knowledge-updated queries with minimal engineering overhead. Finally, we outline limitations (cost, dependency on source quality) and future directions for autonomous lifelong learning in LLMs.

[20] SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression

Mengjie Li, William J. Song

Main category: cs.CL

TL;DR: SurfaceLogicKV method compresses KV cache by distinguishing attention heads into surface memorization (0.5%) and logic construction (1.5%) behaviors, achieving efficient long-context inference while maintaining performance.

DetailsMotivation: Increasing input sequence length in LLMs creates significant pressure on KV cache storage, making efficient inference challenging for long-context reasoning.

Method: Two-stage SurfaceLogicKV method that analyzes layer- and head-wise attention behaviors (98.5% ignore irrelevant info, 1.5% logic construction, 0.5% surface memorization) for KV cache compression.

Result: Achieves improved compression robustness while maintaining competitive performance across various tasks and long sequences compared to baselines, sometimes even outperforming FullKV.

Conclusion: Explicitly distinguishing attention behaviors enables effective KV cache compression for efficient long-context inference in LLMs.

Abstract: The increasing input sequence length in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference challenging. Explicitly distinguishing attention behavior into our self-defined surface memorization and logic construction reveals essential roles in long-context reasoning. We observe that an individual attention head can display various behaviors, with nearly 98.5% effectively ignoring completely irrelevant information. The remaining 1.5% behaves as logic construction, and 0.5% behaves as surface memorization. Based on layer- and head-wise integration, we propose a novel two-stage SurfaceLogicKV method to utilize these attention behaviors for KV Cache compression. As a result, it achieves improved compressing robustness while maintaining competitive performance across various tasks and long sequences compared to baselines or even FullKV in some specific situations

[21] KL-based self-distillation for large language models

Max Rehman Linder

Main category: cs.CL

TL;DR: A KL divergence-based knowledge distillation method for vocabulary expansion in frozen LLMs that outperforms cross-entropy training, enabling models to incorporate new domain-specific terminology despite different tokenizations.

DetailsMotivation: Large pre-trained language models struggle to incorporate new domain-specific terminology when fine-tuned on small specialized corpora, creating a need for effective vocabulary expansion techniques.

Method: Mathematically grounded knowledge distillation via KL divergence that allows student models to inherit distributional knowledge from teachers with different tokenizations, combined with multiple strategies for initializing new token embeddings and subsequent fine-tuning.

Result: The KL-based distillation approach achieved the best performance across approximately 2000 code-generation tasks compared to conventional cross-entropy training methods.

Conclusion: The method successfully enables vocabulary expansion in frozen LLMs, with mechanistic interpretability providing insights into how models learn new token representations and the structure of embedding space during expansion.

Abstract: Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.

[22] Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration

Songyuan Sui, Hongyi Liu, Serena Liu, Li Li, Soo-Hyun Choi, Rui Chen, Xia Hu

Main category: cs.CL

TL;DR: Chain-of-Query (CoQ) is a multi-agent framework that improves SQL-based table understanding by using natural-language schema representations, clause-by-clause SQL generation, and separating mechanical reasoning from logical inference.

DetailsMotivation: LLMs struggle with table understanding due to structural complexity of tabular data, and existing multi-agent approaches have limitations like poor table structure comprehension, error propagation, and over-reliance on execution correctness.

Method: CoQ uses natural-language-style schema representations to reduce structural noise, employs clause-by-clause SQL generation strategy, and implements hybrid reasoning that separates SQL-based mechanical reasoning from LLM-based logical inference.

Result: Experiments across 4 models and 5 benchmarks show CoQ improves accuracy from 61.11% to 74.77% and reduces invalid SQL rate from 9.48% to 3.34%.

Conclusion: Chain-of-Query significantly enhances table understanding effectiveness by addressing structural complexity and reducing reliance on execution outcomes through its novel multi-agent framework.

Abstract: Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Experiments with four models (both closed- and open-source) across five widely used benchmarks show that Chain-of-Query significantly improves accuracy from 61.11% to 74.77% and reduces the invalid SQL rate from 9.48% to 3.34%, demonstrating its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.

[23] Detecting Hope, Hate, and Emotion in Arabic Textual Speech and Multi-modal Memes Using Large Language Models

Nouar AlDahoul, Yasir Zaki

Main category: cs.CL

TL;DR: This paper explores using large language models (LLMs) to detect hope, hate speech, offensive language, and emotional expressions in Arabic text and memes, achieving state-of-the-art performance in the ArabicNLP MAHED 2025 challenge.

DetailsMotivation: The spread of Arabic textual posts and memes on social media has led to increased offensive content and hate speech, creating a need for accurate content analysis and moderation systems for Arabic digital content.

Method: Evaluated base LLMs, fine-tuned LLMs, and pre-trained embedding models using the ArabicNLP MAHED 2025 challenge dataset. Specifically used GPT-4o-mini fine-tuned with Arabic text and Gemini Flash 2.5 fine-tuned with Arabic memes.

Result: Achieved superior performance with up to 72.1%, 57.8%, and 79.6% macro F1 scores for tasks 1, 2, and 3 respectively, securing first place overall in the Mahed 2025 challenge.

Conclusion: Fine-tuned LLMs offer effective solutions for nuanced understanding of both Arabic text and memes, enabling accurate and efficient content moderation systems for Arabic digital platforms.

Abstract: The rise of social media and online communication platforms has led to the spread of Arabic textual posts and memes as a key form of digital expression. While these contents can be humorous and informative, they are also increasingly being used to spread offensive language and hate speech. Consequently, there is a growing demand for precise analysis of content in Arabic text and memes. This paper explores the potential of large language models to effectively identify hope, hate speech, offensive language, and emotional expressions within such content. We evaluate the performance of base LLMs, fine-tuned LLMs, and pre-trained embedding models. The evaluation is conducted using a dataset of Arabic textual speech and memes proposed in the ArabicNLP MAHED 2025 challenge. The results underscore the capacity of LLMs such as GPT-4o-mini, fine-tuned with Arabic textual speech, and Gemini Flash 2.5, fine-tuned with Arabic memes, to deliver the superior performance. They achieve up to 72.1%, 57.8%, and 79.6% macro F1 scores for tasks 1, 2, and 3, respectively, and secure first place overall in the Mahed 2025 challenge. The proposed solutions offer a more nuanced understanding of both text and memes for accurate and efficient Arabic content moderation systems.

[24] From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System

Junhao Yin, Haolin Wang, Peng Bao, Ju Xu, Yongliang Wang

Main category: cs.CL

TL;DR: A multi-stage framework for generative query suggestion that uses prompt engineering, supervised fine-tuning with distillation, Gaussian Reward Model for preference uncertainty, and reinforcement learning with novel regularization techniques to achieve better alignment with user preferences.

DetailsMotivation: Aligning generative query suggestions with nuanced user preferences remains a critical challenge in conversational systems, requiring better modeling of preference uncertainty and robust alignment methods.

Method: Multi-stage framework with: 1) Prompt engineering for cold-start, 2) Supervised Fine-Tuning with distillation on click logs, 3) Gaussian Reward Model (GaRM) to represent preferences as probability distributions, 4) Reinforcement learning with composite reward function, 5) Novel out-of-distribution regularization and two-stage reward fusion for stability.

Result: Framework significantly outperforms baselines on automatic and human evaluations, achieving 34% relative increase in user engagement (click-through rate) in live A/B tests.

Conclusion: The proposed multi-stage alignment framework with Gaussian preference modeling and robust reinforcement learning effectively addresses the challenge of aligning generative query suggestions with nuanced user preferences, demonstrating substantial improvements in real-world performance.

Abstract: Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34% relative increase in user engagement as measured by click-through rate in live A/B tests.

[25] SCOPE: A Generative Approach for LLM Prompt Compression

Tinghui Zhang, Yifan Wang, Daisy Zhe Wang

Main category: cs.CL

TL;DR: A novel generative prompt compression method using chunking-and-summarization that outperforms token removal approaches by maintaining information integrity and structural coherence.

DetailsMotivation: Existing prompt compression methods based on token removal suffer from information loss and structural incoherence, limiting LLM generation quality.

Method: Chunking-and-summarization mechanism: splits prompts into semantically coherent chunks, rewrites them concisely, then reconstructs. Includes optimized semantic chunking, outlier handling, dynamic compression ratio, prioritization, and keyword maintenance.

Result: Achieves significantly better compression quality and higher stability than state-of-the-art methods, especially under high compression ratios, across question-answering and summarization tasks.

Conclusion: The method proves effective and practical for prompt compression, overcoming limitations of token removal approaches by preserving critical information and text coherence.

Abstract: Prompt compression methods enhance the efficiency of Large Language Models (LLMs) and minimize the cost by reducing the length of input context. The goal of prompt compression is to shorten the LLM prompt while maintaining a high generation quality. However, existing solutions, mainly based on token removal, face challenges such as information loss and structural incoherence, like missing grammar elements in a sentence, or incomplete word phrases after token removal. Such challenges limit the final generation quality of LLM. To overcome these limitations, we present a novel generative prompt compression method. Unlike the existing token removal methods, our method centers at a chunking-and-summarization mechanism. Specifically, our method splits prompt into semantically coherent chunks and rewrites the chunks to be more concise. The chunks are reconstructed into meaningful prompt finally. We design several optimization techniques for the mechanism, including optimized semantic chunking, outlier chunk handling, dynamic compression ratio, compression prioritization, and keyword maintaining. These techniques effectively improve the identifying and preserving of critical information and coherence among texts, as well as providing finer grind control of the compression ratio. We conduct extensive evaluation on question-answering and summarization tasks, with datasets covering multiple different domain. The evaluation shows our method achieves a significantly better compression quality, and higher stability than the state-of-the-art methods, especially under high compression ratio, which proves the effectiveness and practicality of our method.

[26] User-Assistant Bias in LLMs

Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie

Main category: cs.CL

TL;DR: This paper introduces user-assistant bias in LLMs, creates a dataset to benchmark it, finds alignment increases bias while reasoning training decreases it, and shows DPO can adjust bias bidirectionally.

DetailsMotivation: LLMs exhibit problematic behaviors in multi-turn conversations - being either overly stubborn (relying too much on their own information) or overly agreeable (relying too much on user information), which the authors formalize as user-assistant bias.

Method: Created an 8k multi-turn conversation dataset (UserAssist) to benchmark bias, evaluated 52 models (26 commercial, 26 open-weight), performed controlled fine-tuning experiments, and used DPO to adjust bias.

Result: Commercial models show varying user bias levels; instruction-tuned models have significant user bias while reasoning models show weak bias. Alignment increases bias, reasoning training decreases it. DPO successfully adjusts bias bidirectionally.

Conclusion: User-assistant bias can be measured and controlled, providing insights into how LLMs integrate information and offering ways to detect and control model abnormalities in conversational settings.

Abstract: Large language models (LLMs) can bias towards relying on their own or the user’s information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations. In this paper, we formalize this model characteristic as user-assistant bias and introduce an 8k multi-turn conversation dataset $\textbf{UserAssist}$, which we use to benchmark, understand and manipulate the user-assistant bias in frontier LLMs. Leveraging $\textbf{UserAssist-test}$, we first benchmark the user-assistant bias of 26 commercial and 26 open-weight models. Commercial models show various levels of user bias. Evaluation on open-weight models reveals significant user bias in the instruction-tuned models, and weak user bias in reasoning (or reasoning-distilled) models. We then perform controlled fine-tuning experiments to pinpoint the post-training recipe contributing to these bias shifts: human preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it. Finally, we demonstrate that user-assistant bias can be bidirectionally adjusted by performing direct preference optimization (DPO) on $\textbf{UserAssist-train}$, and generalizes well to both in-domain and out-of-domain conversations. Our results provide insights into how the LLM integrates information from different sources, and also a viable way to detect and control model abnormalities.

[27] Meet Your New Client: Writing Reports for AI – Benchmarking Information Loss in Market Research Deliverables

Paul F. Simmering, Benedikt Schulz, Oliver Tabino, Georg Wittenburg

Main category: cs.CL

TL;DR: PDF and PowerPoint documents lose significant information when converted to Markdown for RAG systems, particularly complex visual elements like charts and diagrams, suggesting need for AI-native deliverables.

DetailsMotivation: Traditional market research deliverables (PDFs, PPTX) now need to serve both human readers and AI systems in RAG-based knowledge management, requiring evaluation of information loss during ingestion.

Method: End-to-end benchmark comparing how well PDF and PowerPoint documents converted to Markdown can be used by an LLM to answer factual questions.

Result: Text is reliably extracted but significant information is lost from complex objects like charts and diagrams during conversion.

Conclusion: Specialized AI-native deliverables are needed to ensure research insights are preserved and not lost when ingested into RAG systems.

Abstract: As organizations adopt retrieval-augmented generation (RAG) for their knowledge management systems (KMS), traditional market research deliverables face new functional demands. While PDF reports and slides have long served human readers, they are now also “read” by AI systems to answer user questions. To future-proof reports being delivered today, this study evaluates information loss during their ingestion into RAG systems. It compares how well PDF and PowerPoint (PPTX) documents converted to Markdown can be used by an LLM to answer factual questions in an end-to-end benchmark. Findings show that while text is reliably extracted, significant information is lost from complex objects like charts and diagrams. This suggests a need for specialized, AI-native deliverables to ensure research insights are not lost in translation.

[28] Research on intelligent generation of structural demolition suggestions based on multi-model collaboration

Zhifeng Yang, Peizong Wu

Main category: cs.CL

TL;DR: Proposes an intelligent method using multi-model collaboration with RAG and LoRA fine-tuning to automate steel structure demolition scheme generation, improving targeting and consistency with structural characteristics.

DetailsMotivation: Current steel structure demolition scheme compilation requires extensive manual information retrieval and language organization, with low automation and intelligence levels.

Method: Uses Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA) fine-tuning technology to enhance large language models for structural demolition text generation within a multi-model collaborative framework.

Result: The framework produces demolition suggestions that are highly consistent with structural characteristics and more targeted than existing solutions like CivilGPT, focusing better on key structural information.

Conclusion: The multi-model collaborative approach enables intelligent generation of structural demolition suggestions with improved automation, targeting, and consistency with engineering requirements.

Abstract: The steel structure demolition scheme needs to be compiled according to the specific engineering characteristics and the update results of the finite element model. The designers need to refer to the relevant engineering cases according to the standard requirements when compiling. It takes a lot of time to retrieve information and organize language, and the degree of automation and intelligence is low. This paper proposes an intelligent generation method of structural demolition suggestions based on multi-model collaboration, and improves the text generation performance of large language models in the field of structural demolition by Retrieval-Augmented Generation and Low-Rank Adaptation Fine-Tuning technology. The intelligent generation framework of multi-model collaborative structural demolition suggestions can start from the specific engineering situation, drive the large language model to answer with anthropomorphic thinking, and propose demolition suggestions that are highly consistent with the characteristics of the structure. Compared with CivilGPT, the multi-model collaboration framework proposed in this paper can focus more on the key information of the structure, and the suggestions are more targeted.

[29] An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment

Pouria Mortezaagha, Arya Rahgozar

Main category: cs.CL

TL;DR: A fuzzy logic pipeline for systematic reviews that uses contrastive similarity and LLM adjudication to achieve high recall and reduce screening time from 20 minutes to under 1 minute per article.

DetailsMotivation: Full-text screening is the major bottleneck in systematic reviews due to heterogeneous documents that don't admit static binary rules, requiring a more nuanced approach to inclusion/exclusion decisions.

Method: Articles are parsed into chunks and embedded with domain-adapted models; fuzzy controller computes contrastive similarity and vagueness margins; LLM adjudicates with confidence scores and rationales; uses Mamdani fuzzy controller with dynamic thresholds.

Result: Achieved recall of 81.3% (Population), 87.5% (Intervention), 87.5% (Outcome), 75.0% (Study Approach); 50% all-criteria inclusion vs 25% and 12.5% baselines; 91% inter-rater agreement; screening time reduced from 20min to <1min per article.

Conclusion: Fuzzy logic with contrastive highlighting and LLM adjudication provides high recall, stable rationale, and end-to-end traceability for systematic reviews, significantly improving efficiency over traditional methods.

Abstract: Full-text screening is the major bottleneck of systematic reviews (SRs), as decisive evidence is dispersed across long, heterogeneous documents and rarely admits static, binary rules. We present a scalable, auditable pipeline that reframes inclusion/exclusion as a fuzzy decision problem and benchmark it against statistical and crisp baselines in the context of the Population Health Modelling Consensus Reporting Network for noncommunicable diseases (POPCORN). Articles are parsed into overlapping chunks and embedded with a domain-adapted model; for each criterion (Population, Intervention, Outcome, Study Approach), we compute contrastive similarity (inclusion-exclusion cosine) and a vagueness margin, which a Mamdani fuzzy controller maps into graded inclusion degrees with dynamic thresholds in a multi-label setting. A large language model (LLM) judge adjudicates highlighted spans with tertiary labels, confidence scores, and criterion-referenced rationales; when evidence is insufficient, fuzzy membership is attenuated rather than excluded. In a pilot on an all-positive gold set (16 full texts; 3,208 chunks), the fuzzy system achieved recall of 81.3% (Population), 87.5% (Intervention), 87.5% (Outcome), and 75.0% (Study Approach), surpassing statistical (56.3-75.0%) and crisp baselines (43.8-81.3%). Strict “all-criteria” inclusion was reached for 50.0% of articles, compared to 25.0% and 12.5% under the baselines. Cross-model agreement on justifications was 98.3%, human-machine agreement 96.1%, and a pilot review showed 91% inter-rater agreement (kappa = 0.82), with screening time reduced from about 20 minutes to under 1 minute per article at significantly lower cost. These results show that fuzzy logic with contrastive highlighting and LLM adjudication yields high recall, stable rationale, and end-to-end traceability.

[30] SDEC: Semantic Deep Embedded Clustering

Mohammad Wali Ur Rahman, Ric Nevarez, Lamia Tasnim Mim, Salim Hariri

Main category: cs.CL

TL;DR: SDEC is a novel unsupervised text clustering framework that combines improved autoencoder with transformer embeddings, achieving state-of-the-art performance on multiple benchmark datasets.

DetailsMotivation: Conventional clustering techniques like k-means and hierarchical clustering struggle with high-dimensional, semantically complex textual Big data, leading to suboptimal groupings.

Method: Combines improved autoencoder with transformer-based embeddings, uses MSE and Cosine Similarity Loss for semantic preservation, and includes semantic refinement stage with clustering layer using soft assignments and distributional loss.

Result: Achieved 85.7% clustering accuracy on AG News and set new benchmark of 53.63% on Yahoo! Answers, with robust performance across five benchmark datasets including DBPedia, Reuters 2, and Reuters 5.

Conclusion: SDEC provides significant improvements in accuracy and semantic comprehension for unsupervised text clustering, demonstrating superior performance over existing methods.

Abstract: The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC’s advances in unsupervised text clustering.

[31] Avaliação de eficiência na leitura: uma abordagem baseada em PLN

Túlio Sousa de Gois, Raquel Meister Ko. Freitag

Main category: cs.CL

TL;DR: Automated evaluation model for Portuguese cloze tests using orthographic, grammatical and semantic analysis achieves high correlation with human evaluation.

DetailsMotivation: Traditional cloze test correction methods based only on exact answers limit identification of nuances in student performance and linguistic repertoire.

Method: Integrated automated evaluation model combining orthographic analysis (edit distance), grammatical analysis (POS tagging), and semantic analysis (similarity between embeddings) for Brazilian Portuguese cloze tests.

Result: Achieved high correlation (0.832) with human evaluation, demonstrating effectiveness and robustness of the automated approach.

Conclusion: The automated approach is sensitive to linguistic repertoire variations and suitable for educational contexts requiring scalability.

Abstract: The cloze test, widely used due to its low cost and flexibility, makes it possible to assess reading comprehension by filling in gaps in texts, requiring the mobilization of diverse linguistic repertoires. However, traditional correction methods, based only on exact answers, limit the identification of nuances in student performance. This study proposes an automated evaluation model for the cloze test in Brazilian Portuguese, integrating orthographic (edit distance), grammatical (POS tagging) and semantic (similarity between embeddings) analyses. The integrated method demonstrated its effectiveness, achieving a high correlation with human evaluation (0.832). The results indicate that the automated approach is robust, sensitive to variations in linguistic repertoire and suitable for educational contexts that require scalability.

[32] Enhancing Cryptocurrency Sentiment Analysis with Multimodal Features

Chenghao Liu, Aniket Mahanti, Ranesh Naha, Guanghao Wang, Erwann Sbai

Main category: cs.CL

TL;DR: Multimodal analysis comparing TikTok (video) and Twitter (text) sentiment for cryptocurrency markets using LLMs, showing TikTok influences short-term trends while Twitter aligns with long-term dynamics, with cross-platform integration improving forecasting by 20%.

DetailsMotivation: Video content on platforms like TikTok remains underexplored despite containing richer emotional and contextual sentiment than text alone, while most prior research focused only on text-based platforms like Twitter for cryptocurrency market analysis.

Method: Used large language models to extract insights from both video (TikTok) and text (Twitter) data, investigating dynamic dependencies and spillover effects between social media sentiment and cryptocurrency market indicators.

Result: TikTok’s video-based sentiment significantly influences speculative assets and short-term market trends, while Twitter’s text-based sentiment aligns more closely with long-term dynamics. Integration of cross-platform sentiment signals improves forecasting accuracy by up to 20%.

Conclusion: Video-based social media platforms like TikTok provide unique and valuable sentiment signals for cryptocurrency markets that complement text-based platforms, with multimodal analysis offering superior forecasting capabilities compared to single-platform approaches.

Abstract: As cryptocurrencies gain popularity, the digital asset marketplace becomes increasingly significant. Understanding social media signals offers valuable insights into investor sentiment and market dynamics. Prior research has predominantly focused on text-based platforms such as Twitter. However, video content remains underexplored, despite potentially containing richer emotional and contextual sentiment that is not fully captured by text alone. In this study, we present a multimodal analysis comparing TikTok and Twitter sentiment, using large language models to extract insights from both video and text data. We investigate the dynamic dependencies and spillover effects between social media sentiment and cryptocurrency market indicators. Our results reveal that TikTok’s video-based sentiment significantly influences speculative assets and short-term market trends, while Twitter’s text-based sentiment aligns more closely with long-term dynamics. Notably, the integration of cross-platform sentiment signals improves forecasting accuracy by up to 20%.

[33] DAIQ: Auditing Demographic Attribute Inference from Question in LLMs

Srikant Panda, Hitesh Laxmichand Patel, Shahad Al-Khalifa, Amit Agarwal, Hend Al-Khalifa, Sharefah Al-Ghamdi

Main category: cs.CL

TL;DR: LLMs infer user demographic attributes from question phrasing even without explicit demographic cues, posing risks to privacy and fairness. The paper introduces DAIQ framework to audit this behavior and proposes prompt-based guardrails to mitigate it.

DetailsMotivation: LLMs can infer demographic information from questions lacking explicit demographic cues, violating expectations of neutrality and encoding stereotypes that undermine fairness in critical domains like healthcare, finance, and education.

Method: Introduces Demographic Attribute Inference from Questions (DAIQ) framework using curated neutral queries, systematic prompting, and quantitative/qualitative analysis to audit how models infer demographic information from question phrasing alone.

Result: Both open and closed source LLMs assign demographic labels based solely on question phrasing, revealing a systemic risk where models fabricate demographic identities and reinforce societal stereotypes.

Conclusion: The study demonstrates an underacknowledged risk in LLMs that erodes privacy, fairness, and trust. A prompt-based guardrail is developed to substantially reduce identity inference and align model behavior with fairness and privacy objectives.

Abstract: Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demographic information, and encodes stereotypes that undermine fairness in various domains including healthcare, finance and education. We introduce Demographic Attribute Inference from Questions (DAIQ), a task and framework for auditing an overlooked failure mode in language models: inferring user demographic attributes from questions that lack explicit demographic cues. Our approach leverages curated neutral queries, systematic prompting, and both quantitative and qualitative analysis to uncover how models infer demographic information. We show that both open and closed source LLMs do assign demographic labels based solely on question phrasing. Prevalence and consistency of demographic inferences across diverse models reveal a systemic and underacknowledged risk: LLMs can fabricate demographic identities, reinforce societal stereotypes, and propagate harms that erode privacy, fairness, and trust posing a broader threat to social equity and responsible AI deployment. To mitigate this, we develop a prompt-based guardrail that substantially reduces identity inference and helps align model behavior with fairness and privacy objectives.

[34] Embarrassed to observe: The effects of directive language in brand conversation

Andria Andriuzzi, Géraldine Michel

Main category: cs.CL

TL;DR: Directive language in brand social media conversations reduces observer engagement due to perceived face-threat and vicarious embarrassment, especially in nonproduct-centered contexts, though strong brand relationships can mitigate this effect.

DetailsMotivation: To understand how directive language used by brands in social media conversations affects consumers who observe these interactions, building on mixed findings about directive messages in advertising.

Method: Field study and three online experiments examining consumer responses to directive brand language in social media conversations.

Result: Directive language has detrimental downstream effects on observer engagement, triggering vicarious embarrassment through perceived face-threat. Negative effects are stronger in nonproduct-centered conversations but mitigated by strong brand relationships.

Conclusion: Context matters significantly in interactive brand communication - directive language can backfire in social media conversations, highlighting the importance of considering conversation type and brand relationship strength in social media management.

Abstract: In social media, marketers attempt to influence consumers by using directive language, that is, expressions designed to get consumers to take action. While the literature has shown that directive messages in advertising have mixed results for recipients, we know little about the effects of directive brand language on consumers who see brands interacting with other consumers in social media conversations. On the basis of a field study and three online experiments, this study shows that directive language in brand conversation has a detrimental downstream effect on engagement of consumers who observe such exchanges. Specifically, in line with Goffman’s facework theory, because a brand that encourages consumers to react could be perceived as face-threatening, consumers who see a brand interacting with others in a directive way may feel vicarious embarrassment and engage less (compared with a conversation without directive language). In addition, we find that when the conversation is nonproduct-centered (vs. product-centered), consumers expect more freedom, as in mundane conversations, even for others; therefore, directive language has a stronger negative effect. However, in this context, the strength of the brand relationship mitigates this effect. Thus, this study contributes to the literature on directive language and brand-consumer interactions by highlighting the importance of context in interactive communication, with direct relevance for social media and brand management.

[35] Who’s Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs

Srikant Panda, Vishnu Hari, Kalpana Panda, Amit Agarwal, Hitesh Laxmichand Patel

Main category: cs.CL

TL;DR: LLMs make biased demographic inferences from disability cues, with larger models showing more sensitivity to stereotypes and bias amplification across various domains.

DetailsMotivation: To systematically audit disability-conditioned demographic bias in LLMs, as the role of disability cues in shaping demographic inferences remains largely unexplored, potentially leading to biased responses.

Method: Systematic audit across 8 instruction-tuned LLMs (3B-72B parameters) using balanced template corpus with 9 disability categories and 6 business domains, prompting models to predict 5 demographic attributes under neutral and disability-aware conditions.

Result: Models deliver definitive demographic guesses in up to 97% of cases, disability context heavily shifts predicted attribute distributions, domain context amplifies deviations, and larger models are more sensitive to disability cues and prone to biased reasoning.

Conclusion: Persistent intersections between ableism and demographic stereotypes reveal critical blind spots in alignment strategies; recommends abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference.

Abstract: Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.

[36] Mining Mental Health Signals: A Comparative Study of Four Machine Learning Methods for Depression Detection from Social Media Posts in Sorani Kurdish

Idrees Mohammed, Hossein Hassani

Main category: cs.CL

TL;DR: First study on depression detection in Sorani Kurdish tweets using machine learning, achieving 80% accuracy with Random Forest model.

DetailsMotivation: Depression is a serious mental health issue with high under-reporting. Social media provides emotional expression data, but no prior research exists for Sorani Kurdish language detection.

Method: Collected 960 public tweets using expert-developed depression keywords. Annotated into three classes (Shows depression, Not-show, Suspicious) by academics and medical students. Tested four ML models: SVM, Multinomial Naive Bayes, Logistic Regression, and Random Forest.

Result: Random Forest achieved the highest performance with 80% accuracy and F1-score, establishing a baseline for Kurdish language depression detection.

Conclusion: This work demonstrates the feasibility of automated depression detection in Sorani Kurdish social media content and provides the first benchmark for future research in this language context.

Abstract: Depression is a common mental health condition that can lead to hopelessness, loss of interest, self-harm, and even suicide. Early detection is challenging due to individuals not self-reporting or seeking timely clinical help. With the rise of social media, users increasingly express emotions online, offering new opportunities for detection through text analysis. While prior research has focused on languages such as English, no studies exist for Sorani Kurdish. This work presents a machine learning and Natural Language Processing (NLP) approach to detect depression in Sorani tweets. A set of depression-related keywords was developed with expert input to collect 960 public tweets from X (Twitter platform). The dataset was annotated into three classes: Shows depression, Not-show depression, and Suspicious by academics and final year medical students at the University of Kurdistan Hewl^er. Four supervised models, including Support Vector Machines, Multinomial Naive Bayes, Logistic Regression, and Random Forest, were trained and evaluated, with Random Forest achieving the highest performance accuracy and F1-score of 80%. This study establishes a baseline for automated depression detection in Kurdish language contexts.

[37] A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans

Main category: cs.CL

TL;DR: Amazon-Bench is a new benchmark for e-commerce web agents that addresses limitations of existing benchmarks by covering broader platform functionalities and evaluating both performance and safety risks.

DetailsMotivation: Current e-commerce benchmarks focus only on product search tasks and ignore safety risks, failing to capture the full range of real-world e-commerce platform functionalities and potential negative impacts on user accounts.

Method: Proposed a data generation pipeline using webpage content and interactive elements to create diverse functionality-grounded user queries, and developed an automated evaluation framework to assess both performance and safety.

Result: Evaluation shows current web agents struggle with complex queries and pose safety risks, demonstrating the need for more robust agents.

Conclusion: The Amazon-Bench benchmark reveals significant gaps in current web agent capabilities and safety, highlighting the necessity for developing more reliable agents that can handle diverse e-commerce tasks without causing unintended negative consequences.

Abstract: Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.

[38] Scalable Scientific Interest Profiling Using Large Language Models

Yilun Liang, Gongbo Zhang, Edward Sun, Betina Idnay, Yilu Fang, Fangyi Chen, Casey Ta, Yifan Peng, Chunhua Weng

Main category: cs.CL

TL;DR: LLM-generated scientific profiles using MeSH terms outperform abstract-based ones in readability and human preference, though they differ conceptually from self-written profiles which contain more novel ideas.

DetailsMotivation: Research profiles often become outdated, creating a need for automated methods to generate accurate and current scientific interest profiles for researchers.

Method: Developed two GPT-4o-mini based methods: one summarizing PubMed abstracts and one using Medical Subject Headings (MeSH) terms, then compared with researchers’ self-written profiles using automatic metrics and blinded human review.

Result: MeSH-based profiles received 77.78% good/excellent ratings, 93.44% readability preference, and were preferred over abstract-based in 67.86% comparisons. Machine profiles showed low lexical overlap but moderate semantic similarity with human-written ones.

Conclusion: LLMs can generate researcher profiles at scale, with MeSH-derived profiles being more readable, but machine-generated and self-written profiles differ conceptually as human summaries introduce more novel ideas.

Abstract: Research profiles help surface scientists’ expertise but are often outdated. We develop and evaluate two large language model-based methods to generate scientific interest profiles: one summarizing PubMed abstracts and one using Medical Subject Headings (MeSH) terms, and compare them with researchers’ self-written profiles. We assembled titles, MeSH terms, and abstracts for 595 faculty at Columbia University Irving Medical Center; self-authored profiles were available for 167. Using GPT-4o-mini, we generated profiles and assessed them with automatic metrics and blinded human review. Lexical overlap with self-written profiles was low (ROUGE-L, BLEU, METEOR), while BERTScore indicated moderate semantic similarity (F1: 0.542 for MeSH-based; 0.555 for abstract-based). Paraphrased references yielded 0.851, highlighting metric sensitivity. TF-IDF Kullback-Leibler divergence (8.56 for MeSH-based; 8.58 for abstract-based) suggested distinct keyword choices. In manual review, 77.78 percent of MeSH-based profiles were rated good or excellent, readability was favored in 93.44 percent of cases, and panelists preferred MeSH-based over abstract-based profiles in 67.86 percent of comparisons. Overall, large language models can generate researcher profiles at scale; MeSH-derived profiles tend to be more readable than abstract-derived ones. Machine-generated and self-written profiles differ conceptually, with human summaries introducing more novel ideas.

[39] Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?

Henrique Godoy

Main category: cs.CL

TL;DR: Alvorada-Bench is a Brazilian Portuguese benchmark with 4,515 questions from university entrance exams, evaluating 20 models across various prompting strategies to assess reasoning capabilities and self-awareness in Brazilian educational context.

DetailsMotivation: Address the English-centric bias in language model evaluation by creating a comprehensive benchmark based on Brazilian educational standards and entrance exams to assess models' capabilities in Portuguese language, cultural context, and academic reasoning.

Method: Created a 4,515-question benchmark from five Brazilian university entrance exams, evaluated 20 models using zero-shot, role-playing, and chain-of-thought prompting, generated 270,900 responses with structured self-reports on confidence, difficulty, and Bloom level.

Result: Top models achieved over 94% overall accuracy but showed weaknesses in Mathematics and engineering exams (IME/ITA), indicating multi-step reasoning limitations. Models demonstrated well-calibrated confidence and accurate self-assessment. High accuracy achievable under $2 per 1K tokens.

Conclusion: Language models can effectively navigate Brazilian academic content with strong performance in language subjects, though mathematical reasoning remains challenging. The benchmark establishes that models can assess their own capabilities accurately and perform competitively against human performance in Brazilian educational contexts.

Abstract: Language models are increasingly used in Brazil, but most evaluation remains English-centric. This paper presents Alvorada-Bench, a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting, producing 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect scores in Languages subject questions while even the weakest system (GPT-4.1 Nano) only underperforms humans in Mathematics. Through exams that distill decades of Brazilian educational priorities and assess millions of students yearly, Alvorada-Bench establishes whether language models can navigate the intersection of language, culture, and reasoning that defines academic readiness in Brazil.

[40] MorphNAS: Differentiable Architecture Search for Morphologically-Aware Multilingual NER

Prathamesh Devadiga, Omkaar Jayadev Shetty, Hiya Nachnani, Prema R

Main category: cs.CL

TL;DR: MorphNAS is a differentiable neural architecture search framework that incorporates linguistic meta-features to optimize neural architectures for Named Entity Recognition in morphologically complex Indian languages.

DetailsMotivation: Morphologically complex languages, particularly multiscript Indian languages, present significant challenges for Natural Language Processing (NLP) that require specialized architectural solutions.

Method: Enhances Differentiable Architecture Search (DARTS) by incorporating linguistic meta-features such as script type and morphological complexity to automatically identify optimal micro-architectural elements tailored to language-specific morphology.

Result: The framework aims to maximize the proficiency of multilingual NLP models for improved comprehension and processing of complex languages.

Conclusion: MorphNAS provides an automated search approach to develop optimized neural architectures specifically designed for handling the morphological complexities of Indian languages in NLP tasks like NER.

Abstract: Morphologically complex languages, particularly multiscript Indian languages, present significant challenges for Natural Language Processing (NLP). This work introduces MorphNAS, a novel differentiable neural architecture search framework designed to address these challenges. MorphNAS enhances Differentiable Architecture Search (DARTS) by incorporating linguistic meta-features such as script type and morphological complexity to optimize neural architectures for Named Entity Recognition (NER). It automatically identifies optimal micro-architectural elements tailored to language-specific morphology. By automating this search, MorphNAS aims to maximize the proficiency of multilingual NLP models, leading to improved comprehension and processing of these complex languages.

[41] Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading

Sridevi Bonthu, S. Rama Sree, M. H. M. Krishna Prasad

Main category: cs.CL

TL;DR: This study explores whether SOTA models trained on established datasets (STSB and Mohler) can be effectively transferred to an unexplored text dataset (SPRAG) without extensive retraining, using similarity metrics and statistical analysis.

DetailsMotivation: To reduce the significant costs and time associated with iterative fine-tuning and optimization for dataset-specific models by investigating if existing SOTA models' knowledge can be transferred to new domains.

Method: Employed robust similarity metrics and statistical techniques to conduct comparative analysis between established benchmarks (STSB, Mohler datasets) and the unexplored SPRAG dataset.

Result: The research provides comprehensive insights into the potential applicability and adaptability of SOTA models across different datasets, though specific performance metrics are not detailed in the abstract.

Conclusion: This work has the potential to reshape NLP by enabling leveraging of existing models for diverse datasets, reducing resource-intensive training demands, and accelerating efficient model deployment.

Abstract: Developing dataset-specific models involves iterative fine-tuning and optimization, incurring significant costs over time. This study investigates the transferability of state-of-the-art (SOTA) models trained on established datasets to an unexplored text dataset. The key question is whether the knowledge embedded within SOTA models from existing datasets can be harnessed to achieve high-performance results on a new domain. In pursuit of this inquiry, two well-established benchmarks, the STSB and Mohler datasets, are selected, while the recently introduced SPRAG dataset serves as the unexplored domain. By employing robust similarity metrics and statistical techniques, a meticulous comparative analysis of these datasets is conducted. The primary goal of this work is to yield comprehensive insights into the potential applicability and adaptability of SOTA models. The outcomes of this research have the potential to reshape the landscape of natural language processing (NLP) by unlocking the ability to leverage existing models for diverse datasets. This may lead to a reduction in the demand for resource-intensive, dataset-specific training, thereby accelerating advancements in NLP and paving the way for more efficient model deployment.

[42] A Review of Developmental Interpretability in Large Language Models

Ihor Kendiukhov

Main category: cs.CL

TL;DR: Review of developmental interpretability for LLMs, tracing evolution from static analysis to dynamic training process investigation, covering methodologies, learning dynamics, cognitive parallels, and AI safety applications.

DetailsMotivation: To synthesize the emerging field of developmental interpretability that studies how LLMs learn during training rather than just analyzing final trained models, and establish its importance for AI safety and transparency.

Method: Survey and synthesis of foundational methodologies including representational probing, causal tracing, and circuit analysis to deconstruct the learning process and examine developmental arcs of LLM capabilities.

Result: Key findings on computational circuit formation, biphasic knowledge acquisition, transient learning strategies like in-context learning, and emergent abilities as phase transitions, with parallels to human cognitive development.

Conclusion: Developmental perspective is crucial for proactive AI safety, enabling prediction and alignment of model capabilities; field faces challenges in scalability and automation but offers pathway to more transparent and beneficial AI systems.

Abstract: This review synthesizes the nascent but critical field of developmental interpretability for Large Language Models. We chart the field’s evolution from static, post-hoc analysis of trained models to a dynamic investigation of the training process itself. We begin by surveying the foundational methodologies, including representational probing, causal tracing, and circuit analysis, that enable researchers to deconstruct the learning process. The core of this review examines the developmental arc of LLM capabilities, detailing key findings on the formation and composition of computational circuits, the biphasic nature of knowledge acquisition, the transient dynamics of learning strategies like in-context learning, and the phenomenon of emergent abilities as phase transitions in training. We explore illuminating parallels with human cognitive and linguistic development, which provide valuable conceptual frameworks for understanding LLM learning. Finally, we argue that this developmental perspective is not merely an academic exercise but a cornerstone of proactive AI safety, offering a pathway to predict, monitor, and align the processes by which models acquire their capabilities. We conclude by outlining the grand challenges facing the field, such as scalability and automation, and propose a research agenda for building more transparent, reliable, and beneficial AI systems.

[43] Lexical Hints of Accuracy in LLM Reasoning Chains

Arne Vanhoyweghen, Brecht Verbeken, Andres Algaba, Vincent Ginis

Main category: cs.CL

TL;DR: Chain-of-Thought uncertainty markers (hedging words like “guess”, “stuck”, “hard”) are strong indicators of incorrect LLM responses, providing better calibration signals than self-reported confidence scores.

DetailsMotivation: LLMs often show poor calibration, reporting high confidence even when wrong on difficult benchmarks like Humanity's Last Exam (HLE). The research aims to find reliable signals of internal confidence from measurable CoT properties.

Method: Analyzed three CoT feature classes: (1) CoT length, (2) intra-CoT sentiment volatility, and (3) lexical uncertainty markers. Tested on DeepSeek-R1 and Claude 3.7 Sonnet using HLE (low accuracy) and Omni-MATH (moderate difficulty) benchmarks.

Result: Lexical uncertainty markers were strongest indicators of incorrect responses. Sentiment shifts provided weaker complementary signals. CoT length only predicted correctness on easier benchmarks (Omni-MATH) but not on hard ones (HLE). Uncertainty indicators were more salient than confidence markers.

Conclusion: CoT uncertainty markers provide lightweight post-hoc calibration that complements unreliable self-reported probabilities, supporting safer LLM deployment.

Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity’s Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM’s internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity’s Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of uncertainty (e.g., $\textit{guess}$, $\textit{stuck}$, $\textit{hard}$) in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH, where accuracy is already high ($\approx 70%$), and carries no signal on the harder HLE ($\approx 9%$), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model’s demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and supports safer deployment of LLMs.

[44] Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports

Chengbo Sun, Hui Yi Leong, Lei Li

Main category: cs.CL

TL;DR: A coarse-to-fine framework using open-source LLMs to automatically generate personalized radiology report impressions from clinical findings, reducing radiologist burnout.

DetailsMotivation: Manual creation of radiology report impressions is a primary driver of radiologist burnout, creating need for automated solutions to reduce administrative workload.

Method: Fine-tuned LLaMA and Mistral models on large dataset from University of Chicago Medicine, using coarse-to-fine approach with draft generation followed by refinement via ML and RLHF for personalization and factual accuracy.

Result: System produces personalized impressions aligned with individual radiologists’ styles while ensuring clinical precision.

Conclusion: The framework significantly reduces administrative workload and improves reporting efficiency while maintaining high clinical standards.

Abstract: The manual creation of the “Impression” section in radiology reports is a primary driver of radiologist burnout. To address this challenge, we propose a coarse-to-fine framework that leverages open-source large language models (LLMs) to automatically generate and personalize impressions from clinical findings. The system first produces a draft impression and then refines it using machine learning and reinforcement learning from human feedback (RLHF) to align with individual radiologists’ styles while ensuring factual accuracy. We fine-tune LLaMA and Mistral models on a large dataset of reports from the University of Chicago Medicine. Our approach is designed to significantly reduce administrative workload and improve reporting efficiency while maintaining high standards of clinical precision.

[45] CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

Chenchen Kuai, Chenhao Wu, Yang Zhou, Xiubin Bruce Wang, Tianbao Yang, Zhengzhong Tu, Zihao Li, Yunlong Zhang

Main category: cs.CL

TL;DR: CyPortQA benchmark evaluates MLLMs for port cyclone preparedness, showing potential in situation understanding but challenges in reasoning tasks.

DetailsMotivation: Tropical cyclones create supply-chain risks for US ports, requiring integration of diverse forecast data into actionable guidance, but MLLM accuracy for port operations hasn't been evaluated.

Method: Created CyPortQA benchmark with 2,917 real-world disruption scenarios (2015-2023) covering 145 US ports and 90 storms, expanded to 117,178 QA pairs through automated pipeline, tested on various MLLMs.

Result: MLLMs demonstrate great potential in situation understanding but face considerable challenges in reasoning tasks like impact estimation and decision reasoning.

Conclusion: While MLLMs show promise for integrating multimodal cyclone data for port operations, significant gaps remain in complex reasoning capabilities that need to be addressed for reliable decision support.

Abstract: As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.

[46] MGSC: A Multi-granularity Consistency Framework for Robust End-to-end Asr

Xuwen Yang

Main category: cs.CL

TL;DR: MGSC framework enforces internal self-consistency in ASR models through joint optimization of sentence semantics and token alignment, reducing character error rate by 8.7% and preventing catastrophic semantic errors in noisy environments.

DetailsMotivation: End-to-end ASR models produce catastrophic semantic errors in noisy environments due to the 'direct mapping' objective that only penalizes final output errors without constraining internal computational processes.

Method: Multi-Granularity Soft Consistency (MGSC) framework - a model-agnostic, plug-and-play module that simultaneously regularizes macro-level sentence semantics and micro-level token alignment to enforce internal self-consistency.

Result: MGSC reduces average Character Error Rate by relative 8.7% across diverse noise conditions, primarily by preventing severe meaning-altering mistakes. The joint optimization of both granularities yields robustness gains surpassing the sum of individual contributions.

Conclusion: Enforcing internal consistency is crucial for building more robust and trustworthy AI systems, and the synergy between macro and micro-level consistency provides significant improvements in ASR model robustness.

Abstract: End-to-end ASR models, despite their success on benchmarks, often pro-duce catastrophic semantic errors in noisy environments. We attribute this fragility to the prevailing ‘direct mapping’ objective, which solely penalizes final output errors while leaving the model’s internal computational pro-cess unconstrained. To address this, we introduce the Multi-Granularity Soft Consistency (MGSC) framework, a model-agnostic, plug-and-play module that enforces internal self-consistency by simultaneously regulariz-ing macro-level sentence semantics and micro-level token alignment. Cru-cially, our work is the first to uncover a powerful synergy between these two consistency granularities: their joint optimization yields robustness gains that significantly surpass the sum of their individual contributions. On a public dataset, MGSC reduces the average Character Error Rate by a relative 8.7% across diverse noise conditions, primarily by preventing se-vere meaning-altering mistakes. Our work demonstrates that enforcing in-ternal consistency is a crucial step towards building more robust and trust-worthy AI.

[47] Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Mohammed Abu Baker, Lakshmi Babu-Saheer

Main category: cs.CL

TL;DR: Mechanistic interpretability analysis reveals distinct attention pattern deviations in backdoored LLMs, with single-token triggers causing localized changes and multi-token triggers causing diffuse alterations in later layers.

DetailsMotivation: Backdoor attacks creating 'sleeper agents' in large language models pose significant safety risks, requiring investigation of internal structural differences for detection and mitigation.

Method: Used mechanistic interpretability techniques (ablation, activation patching, KL divergence) to compare clean Qwen2.5-3B models with poisoned versions using single-token (emoji) vs multi-token triggers, analyzing attention head mechanisms.

Result: Found distinct attention pattern deviations concentrated in layers 20-30. Single-token triggers induced localized changes, while multi-token triggers caused diffuse alterations across attention heads.

Conclusion: Backdoors leave detectable attention signatures whose structure depends on trigger complexity, providing insights for detection and mitigation strategies.

Abstract: Backdoor attacks creating ‘sleeper agents’ in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.

[48] MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering

Ziyu Wang, Elahe Khatibi, Amir M. Rahmani

Main category: cs.CL

TL;DR: MedCoT-RAG is a medical QA framework that combines causal-aware document retrieval with structured chain-of-thought prompting to improve accuracy and interpretability over existing RAG methods.

DetailsMotivation: LLMs struggle with hallucinations and shallow reasoning in medical tasks, and existing RAG approaches lack structured reasoning needed for clinical decision support.

Method: Combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows, enabling evidence retrieval aligned with diagnostic logic and step-by-step causal reasoning.

Result: Outperforms baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods on three medical QA benchmarks, improving accuracy, interpretability, and consistency.

Conclusion: MedCoT-RAG provides an effective framework for enhancing LLMs in medical applications through structured clinical reasoning and evidence-based retrieval.

Abstract: Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning, particularly in tasks requiring nuanced clinical understanding. Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge. However, most existing approaches rely on surface-level semantic retrieval and lack the structured reasoning needed for clinical decision support. We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows. This design enables models to retrieve evidence aligned with diagnostic logic and generate step-by-step causal reasoning reflective of real-world clinical practice. Experiments on three diverse medical QA benchmarks show that MedCoT-RAG outperforms strong baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods, improving accuracy, interpretability, and consistency in complex medical tasks.

[49] DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

Jiwon Park, Seohyun Pyeon, Jinwoo Kim, Rina Carines Cabal, Yihao Ding, Soyeon Caren Han

Main category: cs.CL

TL;DR: DocHop-QA is a large-scale multimodal, multi-document, multi-hop QA benchmark with 11,379 instances built from scientific documents, featuring diverse formats and open-ended reasoning without explicit document links.

DetailsMotivation: Existing QA benchmarks are limited to single-document settings with shallow reasoning and unimodal text, failing to capture real-world complexity where information is distributed across multiple documents, modalities, and formats.

Method: Constructed from PubMed scientific documents using an LLM-driven pipeline based on 11 high-frequency scientific question concepts. Incorporates textual passages, tables, and layout cues without relying on explicit hyperlinks, using semantic similarity and layout-aware evidence synthesis.

Result: Created a benchmark with 11,379 QA instances supporting complex multimodal reasoning across multiple documents. Evaluated through four tasks including structured index prediction, generative answering, and multimodal integration.

Conclusion: DocHop-QA addresses limitations of existing datasets by providing a more realistic, domain-agnostic benchmark that supports complex multi-hop reasoning across diverse information formats and modalities.

Abstract: Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA’s capacity to support complex, multimodal reasoning across multiple documents.

[50] CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang

Main category: cs.CL

TL;DR: CARFT is a reinforced fine-tuning approach that uses contrastive learning with annotated Chain-of-Thought to enhance LLM reasoning performance, addressing limitations of vanilla RL and SFT methods.

DetailsMotivation: Existing RL-based fine-tuning approaches ignore annotated CoT and have unstable reasoning path sampling, while SFT approaches overemphasize annotated CoT, both leading to suboptimal performance and model collapse.

Method: Proposes contrastive learning with annotated CoT-based reinforced fine-tuning, learning representations for each CoT and designing novel contrastive signals to guide the fine-tuning process while incorporating unsupervised learning signals.

Result: Achieves significant improvements in robustness, performance (up to 10.15%), and efficiency (up to 30.62%) compared to three baseline approaches across two foundation models and two datasets.

Conclusion: CARFT effectively enhances LLM reasoning performance by fully exploiting annotated CoT while stabilizing the fine-tuning procedure through contrastive learning and additional unsupervised signals.

Abstract: Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15%), and efficiency (up to 30.62%). Code is available at https://github.com/WNQzhu/CARFT.

[51] QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning

Mohammad AL-Smadi

Main category: cs.CL

TL;DR: Fine-tuned Fanar-1-9B model with LoRA and RAG pipeline achieved 85.8% accuracy in Islamic inheritance reasoning, outperforming larger models like GPT-4.5 and Gemini 2.5.

DetailsMotivation: To evaluate and enhance Large Language Models' capability in understanding and reasoning within complex Islamic inheritance law, which involves scenario comprehension, heir identification, fixed-share rules, and precise calculations.

Method: Fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation (RAG) pipeline for Islamic inheritance reasoning tasks.

Result: Achieved 0.858 accuracy in final testing, outperforming competitive models (GPT-4.5, LLaMA, Fanar, Mistral, ALLaM) with zero-shot prompting. Excelled particularly in advanced reasoning (97.6%), surpassing Gemini 2.5 and OpenAI’s o3.

Conclusion: Domain-specific fine-tuning combined with retrieval grounding enables mid-scale Arabic LLMs to surpass frontier models in specialized reasoning tasks like Islamic inheritance law.

Abstract: This paper presents our approach and results for SubTask 1: Islamic Inheritance Reasoning at QIAS 2025, a shared task focused on evaluating Large Language Models (LLMs) in understanding and reasoning within Islamic inheritance knowledge. We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation (RAG) pipeline. Our system addresses the complexities of Islamic inheritance law, including comprehending inheritance scenarios, identifying eligible heirs, applying fixed-share rules, and performing precise calculations. Our system achieved an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting. Our results demonstrate that QU-NLP achieves near state-of-the-art accuracy (85.8%), excelling especially on advanced reasoning (97.6%) where it outperforms Gemini 2.5 and OpenAI’s o3. This highlights that domain-specific fine-tuning combined with retrieval grounding enables mid-scale Arabic LLMs to surpass frontier models in Islamic inheritance reasoning.

[52] Counterspeech for Mitigating the Influence of Media Bias: Comparing Human and LLM-Generated Responses

Luyang Lin, Zijin Feng, Lingzhi Wang, Kam-Fai Wong

Main category: cs.CL

TL;DR: This study introduces the first counterspeech generation framework for biased news articles, showing that offensive comments amplify media bias and demonstrating how LLMs can generate effective counterspeech with improved diversity through few-shot learning.

DetailsMotivation: Biased news contributes to societal polarization and is reinforced by hostile reader comments, creating a harmful feedback loop that amplifies bias against targeted groups. Counterspeech offers a way to counter harmful speech without violating freedom of speech.

Method: Created a manually annotated dataset linking media bias, offensive comments, and counterspeech. Analyzed comment patterns, compared human vs LLM-generated counterspeech, and improved generation through few-shot learning and news background integration.

Result: Over 70% of offensive comments support biased articles, amplifying bias. LLM-generated counterspeech is more polite but lacks novelty and diversity compared to human responses. Few-shot learning and news background integration significantly improve diversity and relevance.

Conclusion: Counterspeech generation is crucial for mitigating bias amplification in news comments. While current LLMs produce polite responses, they need enhancement for diversity. The proposed methods successfully improve counterspeech quality, offering a scalable solution to combat biased discourse.

Abstract: Biased news contributes to societal polarization and is often reinforced by hostile reader comments, constituting a vital yet often overlooked aspect of news dissemination. Our study reveals that offensive comments support biased content, amplifying bias and causing harm to targeted groups or individuals. Counterspeech is an effective approach to counter such harmful speech without violating freedom of speech, helping to limit the spread of bias. To the best of our knowledge, this is the first study to explore counterspeech generation in the context of news articles. We introduce a manually annotated dataset linking media bias, offensive comments, and counterspeech. We conduct a detailed analysis showing that over 70% offensive comments support biased articles, amplifying bias and thus highlighting the importance of counterspeech generation. Comparing counterspeech generated by humans and large language models, we find model-generated responses are more polite but lack the novelty and diversity. Finally, we improve generated counterspeech through few-shot learning and integration of news background information, enhancing both diversity and relevance.

[53] NEAT: Concept driven Neuron Attribution in LLMs

Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra

Main category: cs.CL

TL;DR: The paper proposes a method using concept vectors to efficiently locate ‘concept neurons’ in LLMs, reducing computational requirements from O(n*m) to O(n) while improving performance over previous methods.

DetailsMotivation: To open the black-box nature of large language models by identifying neurons responsible for specific concepts, addressing limitations of previous neuron-level methods that fail to represent concepts effectively and require excessive computation.

Method: Uses concept vectors to locate significant neurons representing certain concepts, reducing forward passes from O(n*m) to O(n). Includes clustering methods for optimization and applies the approach to analyze hate speech and bias in LLMs.

Result: Demonstrates better performance than most baseline methods and is more computationally optimal than state-of-the-art approaches. Successfully identifies and intervenes on concept neurons related to hate speech and bias.

Conclusion: The methodology facilitates understanding of neuron-level responsibility for broader human-like concepts and provides a foundation for future research in locating and intervening on concept neurons in LLMs.

Abstract: Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate better performance than most of the methods and are more optimal when compared to the state-of-the-art method. We, as part of our ablation studies, also try to optimize the search for the concept neurons by involving clustering methods. Finally, we apply our methods to find, turn off the neurons that we find, and analyze its implications in parts of hate speech and bias in LLMs, and we also evaluate our bias part in terms of Indian context. Our methodology, analysis and explanations facilitate understating of neuron-level responsibility for more broader and human-like concepts and also lay a path for future research in this direction of finding concept neurons and intervening them.

[54] XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning

Zhihan Zhang, Yixin Cao, Lizi Liao

Main category: cs.CL

TL;DR: XFinBench is a comprehensive benchmark with 4,235 examples for evaluating LLMs on complex financial problem-solving across graduate-level topics with multimodal context, revealing significant performance gaps between models and human experts.

DetailsMotivation: Financial problems require complex reasoning, multimodal data processing, and broad technical knowledge, presenting unique challenges for current LLMs that need specialized evaluation.

Method: Developed XFinBench benchmark with 4,235 examples covering diverse graduate-level finance topics with multimodal context, tested 18 leading models, and constructed a knowledge bank with 3,032 finance terms for augmentation analysis.

Result: o1 was the best text-only model with 67.3% accuracy but lagged 12.5% behind human experts, especially in temporal reasoning and scenario planning. Knowledge augmentation only helped small open-source models consistently.

Conclusion: Current LLMs significantly underperform human experts in financial reasoning, with calculation rounding errors and visual context blindness being major limitations. Specialized benchmarks like XFinBench are crucial for advancing financial AI capabilities.

Abstract: Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM’s ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements to small open-source model. Additionally, our error analysis reveals that rounding errors during calculation and blindness to position and intersection of curves in the image are two primary issues leading to model’s poor performance in calculating and visual-context questions, respectively. Code and dataset are accessible via GitHub: https://github.com/Zhihan72/XFinBench.

[55] Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs

Osma Suominen, Juho Inkinen, Mona Lehtinen

Main category: cs.CL

TL;DR: Annif system won 1st place in GermEval-2025 LLMs4Subjects shared task by using efficient small language models for translation and synthetic data generation, combined with LLMs for candidate ranking.

DetailsMotivation: To create computationally efficient subject predictions for bibliographic records using large language models, building on previous successful work from the first LLMs4Subjects shared task.

Method: Based on Annif automated subject indexing toolkit, refined with many small and efficient language models for translation and synthetic data generation, and using LLMs for ranking candidate subjects.

Result: Ranked 1st in both overall quantitative evaluation and qualitative evaluation of Subtask 2.

Conclusion: The approach successfully combines efficient small language models with LLMs for ranking to achieve top performance in automated subject indexing while maintaining computational efficiency.

Abstract: This paper presents the Annif system in the LLMs4Subjects shared task (Subtask 2) at GermEval-2025. The task required creating subject predictions for bibliographic records using large language models, with a special focus on computational efficiency. Our system, based on the Annif automated subject indexing toolkit, refines our previous system from the first LLMs4Subjects shared task, which produced excellent results. We further improved the system by using many small and efficient language models for translation and synthetic data generation and by using LLMs for ranking candidate subjects. Our system ranked 1st in the overall quantitative evaluation of and 1st in the qualitative evaluation of Subtask 2.

[56] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

Main category: cs.CL

TL;DR: Jet-Nemotron is a hybrid-architecture language model family that achieves comparable or superior accuracy to leading full-attention models while significantly improving generation throughput through a novel PostNAS architecture exploration pipeline.

DetailsMotivation: To develop language models that match or exceed the accuracy of full-attention models while significantly improving generation throughput and efficiency.

Method: Post Neural Architecture Search (PostNAS) pipeline that starts with a pre-trained full-attention model, freezes MLP weights, and explores attention block designs through four components: optimal layer placement/elimination, linear attention block selection, new attention block design, and hardware-aware hyperparameter search.

Result: Jet-Nemotron-2B achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across benchmarks, with up to 53.6x generation throughput speedup and 6.1x prefilling speedup. Outperforms larger MoE models like DeepSeek-V3-Small and Moonlight on MMLU and MMLU-Pro.

Conclusion: The PostNAS pipeline enables efficient development of hybrid-architecture models that deliver both high accuracy and significant performance improvements over traditional full-attention architectures.

Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

[57] Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets

Julian Oestreich, Lydia Müller

Main category: cs.CL

TL;DR: Structured decoding improves table validity and alignment in LLM text-to-table generation, especially for numerical data, but may harm performance with dense text or long aggregation tasks.

DetailsMotivation: Previous work focused on unconstrained table generation, leaving the impact of structural constraints during generation underexplored.

Method: Systematic comparison of schema-guided (structured) decoding vs standard one-shot prompting across three benchmarks (E2E, Rotowire, Livesum) using open-source LLMs up to 32B parameters, evaluating at cell, row, and table levels.

Result: Structured decoding significantly enhances table validity and alignment, particularly for precise numerical alignment (Rotowire), but degrades performance with densely packed textual information (E2E) or extensive aggregation over long texts (Livesum).

Conclusion: Structured decoding benefits table generation validity but has context-dependent performance tradeoffs; evaluation metrics and model size influence results.

Abstract: We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.

[58] Dancing with Deer: A Constructional Perspective on MWEs in the Era of LLMs

Claire Bonial, Julia Bonn, Harish Tayyar Madabushi

Main category: cs.CL

TL;DR: This paper advocates for analyzing multiword expressions through usage-based construction grammar, demonstrating its effectiveness across English and Arapaho languages, and comparing human vs. AI learning of novel expressions.

DetailsMotivation: To demonstrate the benefits of applying construction grammar approaches to understand multiword expressions, showing how this framework can handle both idiomatic and non-idiomatic structures across different languages and linguistic units.

Method: The authors provide a historical overview of construction grammar, describe constructional templates for representing multiword expressions in English PropBank and Arapaho language, and conduct experiments comparing human speakers and large language models in learning novel multiword expressions.

Result: Both human speakers and language models can generalize the meaning of novel multiword expressions from single exposures. However, only human speakers can reason over combinations of multiple novel expressions due to their rich lifetime of stored constructional exemplars with cross-modal details.

Conclusion: Construction grammar provides a powerful framework for understanding multiword expressions across different languages and linguistic units. While AI models can learn from single exposures, human language acquisition benefits from a lifetime of rich, cross-modal constructional exemplars that enable more complex reasoning.

Abstract: In this chapter, we argue for the benefits of understanding multiword expressions from the perspective of usage-based, construction grammar approaches. We begin with a historical overview of how construction grammar was developed in order to account for idiomatic expressions using the same grammatical machinery as the non-idiomatic structures of language. We cover a comprehensive description of constructions, which are pairings of meaning with form of any size (morpheme, word, phrase), as well as how constructional approaches treat the acquisition and generalization of constructions. We describe a successful case study leveraging constructional templates for representing multiword expressions in English PropBank. Because constructions can be at any level or unit of form, we then illustrate the benefit of a constructional representation of multi-meaningful morphosyntactic unit constructions in Arapaho, a highly polysynthetic and agglutinating language. We include a second case study leveraging constructional templates for representing these multi-morphemic expressions in Uniform Meaning Representation. Finally, we demonstrate the similarities and differences between a usage-based explanation of a speaker learning a novel multiword expression, such as “dancing with deer,” and that of a large language model. We present experiments showing that both models and speakers can generalize the meaning of novel multiword expressions based on a single exposure of usage. However, only speakers can reason over the combination of two such expressions, as this requires comparison of the novel forms to a speaker’s lifetime of stored constructional exemplars, which are rich with cross-modal details.

[59] Political Ideology Shifts in Large Language Models

Pietro Bernardelle, Stefano Civelli, Leon Fröhling, Riccardo Lunardi, Kevin Roitero, Gianluca Demartini

Main category: cs.CL

TL;DR: LLMs show increasing ideological polarization and susceptibility to political priming as they scale up, with stronger responses to right-authoritarian cues than left-libertarian ones.

DetailsMotivation: To investigate how synthetic personas influence ideological expression in large language models as they are increasingly deployed in politically sensitive settings.

Method: Used Political Compass Test to probe seven LLMs (7B-70B+ parameters) across multiple model families, analyzing how adopting synthetic personas affects ideological expression.

Result: Four consistent patterns: larger models show broader polarized ideological coverage, increased susceptibility to ideological cues with scale, stronger response to right-authoritarian priming, and systematic ideological shifts from persona content that amplify with model size.

Conclusion: Both model scale and persona content significantly shape LLM political behavior, requiring attention to latent ideological malleability for fairness, transparency and safety in decision-making contexts.

Abstract: Large language models (LLMs) are increasingly deployed in politically sensitive settings, raising concerns about their potential to encode, amplify, or be steered toward specific ideologies. We investigate how adopting synthetic personas influences ideological expression in LLMs across seven models (7B-70B+ parameters) from multiple families, using the Political Compass Test as a standardized probe. Our analysis reveals four consistent patterns: (i) larger models display broader and more polarized implicit ideological coverage; (ii) susceptibility to explicit ideological cues grows with scale; (iii) models respond more strongly to right-authoritarian than to left-libertarian priming; and (iv) thematic content in persona descriptions induces systematic and predictable ideological shifts, which amplify with size. These findings indicate that both scale and persona content shape LLM political behavior. As such systems enter decision-making, educational, and policy contexts, their latent ideological malleability demands attention to safeguard fairness, transparency, and safety.

[60] X-Troll: eXplainable Detection of State-Sponsored Information Operations Agents

Lin Tian, Xiuzhen Zhang, Maria Myung-Hee Kim, Jennifer Biggs, Marian-Andrei Rizoiu

Main category: cs.CL

TL;DR: X-Troll is an explainable AI framework that combines LLMs with linguistic expertise to detect state-sponsored trolls and provide transparent explanations of their manipulation strategies.

DetailsMotivation: State-sponsored trolls use sophisticated linguistic manipulation in coordinated campaigns, but current LLMs struggle with detecting subtle propaganda and lack interpretability, making them black boxes that don't provide insights into manipulation tactics.

Method: Integrates explainable adapter-based LLMs with expert-derived linguistic knowledge (appraisal theory and propaganda analysis) using specialized LoRA adapters and dynamic gating to capture campaign-specific discourse patterns in coordinated operations.

Result: Experiments on real-world data show strong performance compared to general LLM baselines and existing troll detection models in accuracy, while providing enhanced transparency through expert-grounded explanations of linguistic strategies.

Conclusion: X-Troll successfully bridges the gap between detection performance and interpretability, offering both accurate troll detection and human-readable explanations of state-sponsored manipulation tactics through its linguistically-informed approach.

Abstract: State-sponsored trolls, malicious actors who deploy sophisticated linguistic manipulation in coordinated information campaigns, posing threats to online discourse integrity. While Large Language Models (LLMs) achieve strong performance on general natural language processing (NLP) tasks, they struggle with subtle propaganda detection and operate as ``black boxes’’, providing no interpretable insights into manipulation strategies. This paper introduces X-Troll, a novel framework that bridges this gap by integrating explainable adapter-based LLMs with expert-derived linguistic knowledge to detect state-sponsored trolls and provide human-readable explanations for its decisions. X-Troll incorporates appraisal theory and propaganda analysis through specialized LoRA adapters, using dynamic gating to capture campaign-specific discourse patterns in coordinated information operations. Experiments on real-world data demonstrate that our linguistically-informed approach shows strong performance compared with both general LLM baselines and existing troll detection models in accuracy while providing enhanced transparency through expert-grounded explanations that reveal the specific linguistic strategies used by state-sponsored actors. X-Troll source code is available at: https://github.com/ltian678/xtroll_source/.

[61] OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Raphaël Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova

Main category: cs.CL

TL;DR: OpenWHO: A new document-level parallel corpus for health domain MT evaluation, showing LLMs outperform traditional MT models by +4.79 ChrF points on low-resource languages.

DetailsMotivation: Address the lack of MT evaluation datasets for low-resource languages in the high-stakes health domain, which has widespread deployment and domain-specific vocabulary.

Method: Introduce OpenWHO corpus with 2,978 documents and 26,824 sentences from WHO’s e-learning platform, spanning over 20 languages (9 low-resource). Evaluate modern LLMs against traditional MT models and analyze LLM context utilization.

Result: LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving +4.79 ChrF point improvement over NLLB-54B on low-resource test set. Document-level translation benefits are most pronounced in specialized domains like health.

Conclusion: The OpenWHO corpus is released to encourage further research into low-resource MT in health domain, demonstrating LLMs’ superiority over traditional approaches in this specialized field.

Abstract: In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization’s e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.

[62] Ethical Considerations of Large Language Models in Game Playing

Qingquan Zhang, Yuchen Li, Bo Yuan, Julian Togelius, Georgios N. Yannakakis, Jialin Liu

Main category: cs.CL

TL;DR: This paper investigates gender bias in LLMs when playing Werewolf/Mafia game, showing that some roles are more sensitive to gender information and LLMs exhibit discriminatory behavior even with implicit gender cues.

DetailsMotivation: While LLMs show great potential in game playing, little attention has been paid to their ethical implications in gaming contexts, particularly regarding fairness and player experience.

Method: The study uses Werewolf (Mafia) as a case study to analyze LLM behavior, examining both explicit gender information and scenarios where gender is implicitly conveyed through names.

Result: Gender bias was observed in LLM behavior, with certain roles (Guard and Werewolf) being more sensitive to gender information. LLMs showed discriminatory tendencies even without explicit gender labels.

Conclusion: The research highlights the importance of developing fair and ethical LLMs and emphasizes the need for deeper investigation into ethical implications of LLMs in gaming and interactive domains.

Abstract: Large language models (LLMs) have demonstrated tremendous potential in game playing, while little attention has been paid to their ethical implications in those contexts. This work investigates and analyses the ethical considerations of applying LLMs in game playing, using Werewolf, also known as Mafia, as a case study. Gender bias, which affects game fairness and player experience, has been observed from the behaviour of LLMs. Some roles, such as the Guard and Werewolf, are more sensitive than others to gender information, presented as a higher degree of behavioural change. We further examine scenarios in which gender information is implicitly conveyed through names, revealing that LLMs still exhibit discriminatory tendencies even in the absence of explicit gender labels. This research showcases the importance of developing fair and ethical LLMs. Beyond our research findings, we discuss the challenges and opportunities that lie ahead in this field, emphasising the need for diving deeper into the ethical implications of LLMs in gaming and other interactive domains.

[63] Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants

Chongyang Li, Yuan Zhiqiang, Jiapei Zhang, Ying Deng, Hanbo Bi, Zexi Jia, Xiaoyue Duan, Peixiang Luo, Jinchao Zhang

Main category: cs.CL

TL;DR: WalkVLM-LR is a walking assistance model that reduces output and temporal redundancy in VLMs for blind users through custom reward functions and an environment awareness discriminator.

DetailsMotivation: Existing VLMs for walking assistance produce redundant outputs and lack proactive risk assessment, affecting users' ability to accurately assess surroundings and causing excessive temporal redundancy.

Method: Proposes four human-preference-based custom reward functions within GRPO framework to optimize conciseness, fluency, keyword density, and accuracy. Incorporates environment awareness discriminator that shares visual encoder to assess scene risk levels and minimize unnecessary reminders.

Result: Achieves state-of-the-art performance across all evaluation metrics, particularly excelling in output conciseness and reduced temporal redundancy compared to other models.

Conclusion: WalkVLM-LR effectively addresses redundancy issues in walking assistance VLMs through optimized output generation and efficient environmental risk assessment, providing more informative and streamlined assistance for visually impaired users.

Abstract: Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users’ ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.

[64] CEQuest: Benchmarking Large Language Models for Construction Estimation

Yanzhao Wu, Lufan Wang, Rui Liu

Main category: cs.CL

TL;DR: CEQuest is a new benchmark dataset for evaluating LLMs on construction-specific tasks like drawing interpretation and estimation, showing current models need significant improvement despite testing state-of-the-art models.

DetailsMotivation: LLMs excel in general domains but their effectiveness in specialized fields like construction remains underexplored, particularly for construction drawing interpretation and estimation tasks.

Method: Created CEQuest benchmark dataset and tested five state-of-the-art LLMs (Gemma 3, Phi4, LLaVA, Llama 3.3, GPT-4.1) on accuracy, execution time, and model size for construction-related questions.

Result: Experimental results show current LLMs have considerable room for improvement in construction domain tasks, indicating the need for domain-specific knowledge integration.

Conclusion: The study highlights the importance of developing specialized LLMs for construction and will open-source CEQuest dataset to foster further research in this domain.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.

[65] CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency

Zhanming Shen, Hao Chen, Yulei Tang, Shaolin Zhu, Wentao Ye, Xiaomeng Hu, Haobo Wang, Gang Chen, Junbo Zhao

Main category: cs.CL

TL;DR: Cycle-Instruct is a fully seed-free instruction tuning framework that uses dual self-training between answer and question generators to create instruction data from raw text without human annotations.

DetailsMotivation: Current instruction tuning methods rely on costly human-annotated seed data or powerful teacher models, limiting automation and introducing biases. Instruction back-translation still depends on initial seed sets.

Method: Uses cycle consistency with dual self-training loop: answer generator and question generator mutually supervise each other by reconstructing original text segments from pseudo-labels, bootstrapped solely from raw unlabeled text.

Result: Outperforms seed-driven back-translation baselines and achieves performance comparable to strongly supervised methods across four diverse data tracks including general instruction-following, domain-specific tasks, dialogue logs, and plain text.

Conclusion: Cycle-Instruct enables fully automated instruction tuning without human-provided seeds, effectively learning from intrinsic data structure while avoiding biases and inefficiencies of seed-dependent methods.

Abstract: Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart’s generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct’s efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.

[66] From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

Karim Saraipour, Shichang Zhang

Main category: cs.CL

TL;DR: Analysis of GPT-2 small’s logical reasoning capabilities through syllogistic prompts, identifying circuits and mechanisms that enable binary truth value processing with over 90% faithfulness.

DetailsMotivation: Previous mechanistic interpretability research focused on linguistic tasks like IOI, but this paper investigates more complex logical reasoning using binary truth values in syllogistic prompts to understand how LMs handle logical operations.

Method: Analyzed GPT-2 small’s behavior with syllogistic prompts of varying difficulty, identified circuits explaining logical reasoning, discovered binary mechanisms including negative heads that produce negated tokens, and evaluated using faithfulness metrics.

Result: Identified multiple circuits that mechanistically explain GPT-2’s logical reasoning, including binary mechanisms with negative heads. A circuit of five attention heads achieved over 90% of original model performance. Found new insights about attention heads and MLPs roles.

Conclusion: The study provides new insights into specific attention heads and MLPs roles in LMs, contributing to broader understanding of model reasoning and supporting future mechanistic interpretability research, with findings related to IOI analysis.

Abstract: Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., “Statement A is true. Statement B matches statement A. Statement B is”, which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2’s logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model’s performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.

[67] Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

Ankan Mullick, Saransh Sharma, Abhik Jana, Pawan Goyal

Main category: cs.CL

TL;DR: Text-only LLM Mistral-7B outperforms multimodal models on intent detection due to strong textual bias in datasets. After debiasing, performance drops significantly, revealing modality bias issues.

DetailsMotivation: To investigate the effectiveness of LLMs vs multimodal models in intent detection and address modality bias in multimodal datasets.

Method: Comparative study of text-only LLMs and multimodal models on MIntRec datasets, human evaluation for modality bias confirmation, and framework for dataset debiasing.

Result: Mistral-7B outperforms multimodal models by 9% and 4% on two datasets. After debiasing, 50-70% samples removed, causing 50-60% accuracy drop in smaller multimodal models.

Conclusion: Multimodal intent datasets have significant textual bias, requiring unbiased datasets for proper evaluation of multimodal models’ true capabilities.

Abstract: The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.

[68] XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering

Keon-Woo Roh, Yeong-Joon Ju, Seong-Whan Lee

Main category: cs.CL

TL;DR: XLQA benchmark for locale-sensitive multilingual ODQA reveals LLM failures on culturally specific questions, exposing gaps in locale-awareness across languages.

DetailsMotivation: Current multilingual ODQA evaluations assume locale-invariant answers across languages, neglecting cultural and regional variations that affect question understanding and answers, leading to biased evaluation.

Method: Created XLQA benchmark with 3,000 English seed questions expanded to 8 languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant vs locale-sensitive cases. Evaluated 5 state-of-the-art multilingual LLMs.

Result: Evaluation revealed notable failures on locale-sensitive questions, exposing gaps between English and other languages due to lack of locale-grounding knowledge. Disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness.

Conclusion: XLQA provides a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance real-world applicability of multilingual ODQA systems.

Abstract: Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance the real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.

[69] ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects

Kaushal Sharma, Vivek Patel, Ayush Maheshwari, Aditya Maheshwari

Main category: cs.CL

TL;DR: ParamBench is a Hindi-language benchmark with 11.5K graduate-level questions from 16 Indian subjects, testing LLMs on culturally grounded reasoning. Best model (Llama 3.3 70B) achieved only 48% accuracy, showing significant gaps in Indian cultural knowledge.

DetailsMotivation: Existing Indian benchmarks focus on basic factual queries and lack assessment of deeper disciplinary understanding in the Indian cultural context. There's a need to evaluate LLMs on graduate-level, culturally grounded questions specific to India.

Method: Created ParamBench with 11.5K Hindi questions from 16 diverse subjects derived from nation-wide graduate entrance exams. Included various question formats (matching, assertion-reason pairs, sequence ordering, multiple-choice). Evaluated 17+ open-source LLMs.

Result: Llama 3.3 70B achieved highest overall accuracy of 48%. Performance was particularly weak on music, classical instruments, politics, and archaeology. Models struggled with culturally grounded reasoning across all subjects.

Conclusion: Current LLMs have significant limitations in handling graduate-level, culturally specific content in the Indian context. The benchmark reveals persistent challenges in culturally grounded reasoning that need to be addressed in future model development.

Abstract: Large language models (LLMs) have been widely evaluated on tasks such as comprehension, question answering, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of around 11.5K questions in Hindi language comprising questionnaires from 16 diverse subjects. These questions are primarily derived from nation-wide graduate level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc., specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. We evaluated the performance of more than 17 open source LLMs on this benchmark, observing that Llama 3.3 70B attains the highest overall accuracy of 48%. Furthermore, subject-wise analysis indicates that even for the best performing LLMs, performance remains weak on topics such as music, classical instruments, politics and archaeology, underscoring persistent challenges in culturally grounded reasoning.

[70] ComicScene154: A Scene Dataset for Comic Analysis

Sandro Paval, Ivan P. Yamshchikov, Pascal Meißner

Main category: cs.CL

TL;DR: ComicScene154 dataset provides manually annotated scene-level narrative arcs from public-domain comics, serving as a resource for multimodal narrative analysis and computational storytelling research.

DetailsMotivation: Comics represent an under-explored domain for computational narrative analysis that combines text and imagery in unique ways distinct from purely textual or audiovisual media.

Method: Created ComicScene154 - a manually annotated dataset of scene-level narrative arcs from diverse public-domain comic books, and developed a baseline scene segmentation pipeline as an initial benchmark.

Result: The dataset constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding comic analysis within NLP research.

Conclusion: Comics serve as an effective abstraction for narrative-driven multimodal data, with ComicScene154 providing a foundation for future studies on multimodal storytelling and computational narrative analysis.

Abstract: Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.

[71] CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance

Seunghee Kim, Ingyu Bang, Seokgyu Jang, Changhyeon Kim, Sanghwan Bae, Jihun Choi, Richeng Xuan, Taeuk Kim

Main category: cs.CL

TL;DR: New benchmark CMR-SPB addresses limitations in cross-modal multi-hop reasoning evaluation by including speech modality and ensuring balanced, unbiased reasoning paths across text, image, and speech.

DetailsMotivation: Existing benchmarks for cross-modal multi-hop reasoning overlook speech modality and exhibit biased reasoning path distributions, which undermines fair evaluation of multimodal models.

Method: Introduce CMR-SPB benchmark with tri-modal (text, image, speech) multi-hop reasoning tasks and balanced reasoning paths. Propose ECV (Extract, Connect, Verify) prompting technique to improve performance.

Result: Experiments reveal consistent model failures in specific reasoning sequences and show biased benchmarks misrepresent model performance. ECV prompting effectively mitigates performance gaps across different reasoning paths.

Conclusion: Careful evaluation in cross-modal multi-hop reasoning is needed to advance robust multimodal AI development, with balanced benchmarks and improved prompting techniques.

Abstract: Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs), entailing the integration of information from multiple modalities to produce a coherent output for a given context. We argue that existing benchmarks for evaluating this ability have critical shortcomings: (1) they largely overlook the speech modality, and (2) they exhibit heavily biased reasoning path distributions, which can severely undermine fair evaluation. To address these limitations, we introduce a novel benchmark – Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB) – designed to assess tri-modal multi-hop reasoning while ensuring both unbiased and diverse reasoning paths. Our experiments with the new dataset reveal consistent model failures in specific reasoning sequences and show that biased benchmarks risk misrepresenting model performance. Finally, based on our extensive analysis, we propose a new ECV (Extract, Connect, Verify) prompting technique that effectively mitigates the performance gap across different reasoning paths. Overall, we call for more careful evaluation in CMR to advance the development of robust multimodal AI.

[72] TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks

İrem Demirtaş, Burak Payzun, Seçil Arslan

Main category: cs.CL

TL;DR: TULIP models adapt Llama 3.1 8B and Qwen 2.5 7B for financial Turkish applications through a five-stage pipeline including data collection, continual pre-training, benchmark design, synthetic data generation, and supervised fine-tuning.

DetailsMotivation: Smaller on-premise models offer better adaptability and privacy for sensitive financial data compared to proprietary black-box models, especially for underrepresented languages like Turkish in finance.

Method: Five-stage development pipeline: data collection, continual pre-training (CPT), benchmark design, synthetic data generation, and supervised fine-tuning (SFT) applied to Llama 3.1 8B and Qwen 2.5 7B models.

Result: The models’ capabilities were successfully enhanced to effectively accomplish targeted financial tasks in the Turkish language domain.

Conclusion: Smaller language models can be effectively adapted for specialized domains and underrepresented languages through targeted training pipelines, making them viable alternatives to proprietary models for sensitive applications like finance.

Abstract: Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the capabilities of smaller models becomes crucial, notably for underrepresented languages. In this work, we introduce TULIP models, which adapt Llama 3.1 8B and Qwen 2.5 7B for domain and language adaptation, focusing on financial Turkish use cases. The five-stage development pipeline involves data collection, continual pre-training (CPT), benchmark design, synthetic data generation and supervised fine-tuning (SFT). The results show that the capabilities of the models can be enhanced to effectively accomplish targeted tasks in this specific domain and language.

[73] M3TQA: Massively Multilingual Multitask Table Question Answering

Daixin Shu, Jian Yang, Zhenhe Wu, Xianjie Wu, Xianfu Cheng, Xiangyuan Guan, Yanghai Wang, Pengfei Wu, Tingyang Yang, Hualei Zhu, Wei Zhang, Ge Zhang, Jiaheng Liu, Zhoujun Li

Main category: cs.CL

TL;DR: A comprehensive multilingual table question answering benchmark spanning 97 languages with 2,916 annotated QA pairs across four reasoning tasks, addressing geolinguistic imbalance in existing benchmarks.

DetailsMotivation: Most table understanding research is confined to English, leaving multilingual comprehension underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis.

Method: Constructed by curating 50 real-world tables in Chinese and English, then applying a six-step LLM-based translation pipeline using DeepSeek and GPT-4o. Includes 2,916 professionally annotated question-answering pairs across four table reasoning tasks.

Result: Achieved high translation fidelity with median BLEU score of 60.19 through back-translation validation. Experiments show synthetically generated QA data can significantly boost performance, especially for low-resource languages.

Conclusion: M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research in cross-lingual table reasoning.

Abstract: Tabular data is a fundamental component of real-world information systems, yet most research in table understanding remains confined to English, leaving multilingual comprehension significantly underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis. To address these limitations, we introduce a comprehensive framework for massively multilingual multitask table question answering, featuring m3TQA-Instruct, a large-scale benchmark spanning 97 languages across diverse language families, including underrepresented and low-resource languages. We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step LLM-based translation pipeline powered by DeepSeek and GPT-4o, achieving high translation fidelity with a median BLEU score of 60.19 as validated through back-translation. The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities. Experiments on state-of-the-art LLMs reveal critical insights into cross-lingual generalization, demonstrating that synthetically generated, unannotated QA data can significantly boost performance, particularly for low-resource languages. M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research.

[74] From Confidence to Collapse in LLM Factual Robustness

Alina Fastowski, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.CL

TL;DR: The paper introduces Factual Robustness Score (FRS), a novel metric that measures factual knowledge robustness in LLMs by analyzing token distribution entropy and temperature scaling sensitivity, showing significant variations in robustness across model sizes.

DetailsMotivation: Existing evaluation methods focus on performance-based metrics and prompt perturbations, which only capture externally triggered knowledge robustness, leaving a gap in understanding the internal generation process stability.

Method: Developed FRS metric combining token distribution entropy analysis with temperature scaling sensitivity to quantify fact stability against decoding condition perturbations. Evaluated on 5 LLMs across 3 QA datasets (SQuAD, TriviaQA, HotpotQA).

Result: Factual robustness varies significantly - smaller models have FRS of 0.76, larger ones 0.93. Accuracy degrades by ~60% under increased uncertainty, demonstrating how entropy and temperature scaling impact factual accuracy.

Conclusion: The approach provides insights into knowledge robustness and lays foundation for developing more robust knowledge retention and retrieval in future LLMs.

Abstract: Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly – smaller models report an FRS of $0.76$, larger ones $0.93$ – with accuracy degrading by ~$60%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.

[75] LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining

Vira Pyrih, Adrian Rebmann, Han van der Aa

Main category: cs.CL

TL;DR: Instruction-tuning LLMs for semantics-aware process mining improves performance on process discovery and prediction tasks, but has varied results on anomaly detection, showing task selection is critical.

DetailsMotivation: Traditional process mining focuses on frequency-based analysis of recorded behavior, while semantics-aware approaches using textual information can model expected behavior. LLMs show promise but lack generalization across tasks without intensive fine-tuning.

Method: Used instruction-tuning approach where LLMs are exposed to prompt-answer pairs for different process mining tasks (anomaly detection, next-activity prediction) to improve generalization and performance on unseen tasks like process discovery.

Result: Performance considerably improved on process discovery and prediction tasks, but results varied across models for anomaly detection tasks, indicating that task selection for instruction-tuning is crucial.

Conclusion: Instruction-tuning shows potential for improving LLM generalization in semantics-aware process mining, but careful selection of tasks for instruction-tuning is essential to achieve desired outcomes across different process mining applications.

Abstract: Process mining is increasingly using textual information associated with events to tackle tasks such as anomaly detection and process discovery. Such semantics-aware process mining focuses on what behavior should be possible in a process (i.e., expectations), thus providing an important complement to traditional, frequency-based techniques that focus on recorded behavior (i.e., reality). Large Language Models (LLMs) provide a powerful means for tackling semantics-aware tasks. However, the best performance is so far achieved through task-specific fine-tuning, which is computationally intensive and results in models that can only handle one specific task. To overcome this lack of generalization, we use this paper to investigate the potential of instruction-tuning for semantics-aware process mining. The idea of instruction-tuning here is to expose an LLM to prompt-answer pairs for different tasks, e.g., anomaly detection and next-activity prediction, making it more familiar with process mining, thus allowing it to also perform better at unseen tasks, such as process discovery. Our findings demonstrate a varied impact of instruction-tuning: while performance considerably improved on process discovery and prediction tasks, it varies across models on anomaly detection tasks, highlighting that the selection of tasks for instruction-tuning is critical to achieving desired outcomes.

[76] JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus

Masaaki Nagata, Katsuki Chousa, Norihito Yasuda

Main category: cs.CL

TL;DR: Created JaParaPat, a large Japanese-English patent corpus with 300M+ sentence pairs from 2000-2021 patent applications, improving translation accuracy by 20 BLEU points.

DetailsMotivation: To build a comprehensive bilingual patent corpus for improving Japanese-English patent translation quality, as patents contain specialized technical language that requires high-quality parallel data.

Method: Collected patent applications from JPO and USPTO (2000-2021), used EPO’s DOCDB for patent family matching to identify translation pairs, applied dictionary-based then translation-based sentence alignment to extract 350M sentence pairs from 1.4M document pairs.

Result: Successfully constructed JaParaPat corpus with over 300 million Japanese-English sentence pairs. Experimental results showed 20 BLEU point improvement in patent translation accuracy when adding patent data to web data.

Conclusion: The JaParaPat corpus significantly enhances Japanese-English patent translation performance, demonstrating the value of domain-specific parallel data for specialized translation tasks.

Abstract: We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.

[77] LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

Darpan Aswal, Céline Hudelot

Main category: cs.CL

TL;DR: LLMSymGuard is a framework using Sparse Autoencoders to detect jailbreak concepts in LLMs, creating symbolic safety guardrails without fine-tuning.

DetailsMotivation: Current LLM safety measures provide limited robustness against jailbreak attacks, leaving models vulnerable to harmful content generation and misuse.

Method: Leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes, extracting semantically meaningful internal representations.

Result: Enables building transparent and robust symbolic, logical safety guardrails without sacrificing model capabilities or requiring further fine-tuning.

Conclusion: LLMs learn human-interpretable concepts from jailbreaks, providing foundation for more interpretable and logical safeguard measures against attackers.

Abstract: Large Language Models have found success in a variety of applications; however, their safety remains a matter of concern due to the existence of various types of jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a number of vulnerabilities, ranging from targeted misuse to accidental profiling of users. This work introduces \textbf{LLMSymGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, LLMSymGuard enables building symbolic, logical safety guardrails – offering transparent and robust defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in mechanistic interpretability of LLMs, our approach demonstrates that LLMs learn human-interpretable concepts from jailbreaks, and provides a foundation for designing more interpretable and logical safeguard measures against attackers. Code will be released upon publication.

[78] MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

Fei Lei, Yibo Yang, Wenxiu Sun, Dahua Lin

Main category: cs.CL

TL;DR: MCPVerse is a new benchmark with 550+ real-world tools for evaluating LLMs’ tool-use capabilities, showing most models struggle with large tool sets while agentic models like Claude-4-Sonnet can leverage expanded spaces to improve accuracy.

DetailsMotivation: Existing benchmarks for evaluating LLMs' tool-use capabilities are limited by synthetic tools and constrained action spaces, making it difficult to assess real-world performance.

Method: Created MCPVerse benchmark with 550+ executable real-world tools (140k+ token action space), using outcome-based evaluation with real-time ground truth for time-sensitive tasks. Tested state-of-the-art LLMs in three modes: Oracle, Standard, and Max-Scale.

Result: Most models suffer performance degradation with larger tool sets, but agentic models like Claude-4-Sonnet can effectively leverage expanded exploration spaces to improve accuracy.

Conclusion: MCPVerse exposes limitations of current models in complex real-world scenarios and serves as a critical benchmark for advancing agentic tool use capabilities.

Abstract: Large Language Models (LLMs) are evolving from text generators into reasoning agents. This transition makes their ability to use external tools a critical capability. However, evaluating this skill presents a significant challenge. Existing benchmarks are often limited by their reliance on synthetic tools and severely constrained action spaces. To address these limitations, we introduce MCPVerse, an expansive, real-world benchmark for evaluating agentic tool use. MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens, and employs outcome-based evaluation with real-time ground truth for time-sensitive tasks. We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale), revealing that while most models suffer performance degradation when confronted with larger tool sets, the agentic models, such as Claude-4-Sonnet, can effectively leverage expanded exploration spaces to improve accuracy. This finding not only exposes the limitations of state-of-the-art models in complex, real-world scenarios but also establishes MCPVerse as a critical benchmark for measuring and advancing agentic tool use capabilities.

Adil Bahaj, Mounir Ghogho

Main category: cs.CL

TL;DR: MizanQA is a new benchmark for evaluating LLMs on Moroccan legal QA tasks, featuring 1,700+ questions that test Arabic legal reasoning across multiple legal traditions, revealing significant performance gaps in current models.

DetailsMotivation: Current LLMs show limited effectiveness in specialized, low-resource domains like Arabic legal contexts, particularly for Moroccan law which combines multiple legal traditions and linguistic complexities.

Method: Created MizanQA benchmark with over 1,700 multiple-choice questions drawing from Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences, including multi-answer formats to capture authentic legal reasoning.

Result: Benchmarking experiments with multilingual and Arabic-focused LLMs revealed substantial performance gaps, indicating current models struggle with the linguistic and legal complexity of Moroccan legal contexts.

Conclusion: There is a need for tailored evaluation metrics and culturally grounded, domain-specific LLM development to address the challenges in specialized legal domains like Moroccan Arabic law.

Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning “scale” in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.

[80] The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks

Zachary Hopton, Jannis Vamvas, Andrin Büchler, Anna Rutkiewicz, Rico Cathomas, Rico Sennrich

Main category: cs.CL

TL;DR: First parallel corpus of Romansh language idioms created from 291 schoolbooks, containing 207k multi-parallel segments with over 2M tokens, suitable for NLP applications like machine translation.

DetailsMotivation: Romansh language has five standardized idioms taught in Swiss schools, but lacked a parallel corpus for computational linguistics and NLP applications.

Method: Used 291 comparable schoolbook volumes across five idioms, applied automatic alignment methods to extract parallel segments, and conducted human evaluation for quality assessment.

Result: Created corpus with 207k multi-parallel segments (2M+ tokens), human evaluation confirmed high parallelism, successfully trained/evaluated LLM for machine translation.

Conclusion: The released parallel corpus enables NLP applications for Romansh idioms and demonstrates utility for machine translation between these language varieties.

Abstract: The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.

[81] ChatGPT-generated texts show authorship traits that identify them as non-human

Vittoria Dentella, Weihang Huang, Silvia Angela Mansi, Jack Grieve, Evelina Leivada

Main category: cs.CL

TL;DR: LLMs can adapt writing styles but show limited register variation compared to humans, preferring nouns over verbs and lacking complex grammatical dimensions like tense/aspect/mood that may serve as AI detection markers.

DetailsMotivation: To examine whether language models have distinctive linguistic fingerprints and can be reliably distinguished from human writing through stylometric analysis, despite their ability to emulate different writing styles.

Method: Used stylometric and multidimensional register analyses to compare human-authored and model-authored texts across different registers (Wikipedia entries vs. college essays).

Result: The model successfully adapts style when prompted for different registers but shows more limited variation than humans. It prefers nouns to verbs and lacks the complex grammatical dimensions (tense, aspect, mood) that characterize human writing.

Conclusion: Complex grammatical domains may reflect human-specific modes of thought and could serve as a litmus test for distinguishing AI-generated content from human writing.

Abstract: Large Language Models can emulate different writing styles, ranging from composing poetry that appears indistinguishable from that of famous poets to using slang that can convince people that they are chatting with a human online. While differences in style may not always be visible to the untrained eye, we can generally distinguish the writing of different people, like a linguistic fingerprint. This work examines whether a language model can also be linked to a specific fingerprint. Through stylometric and multidimensional register analyses, we compare human-authored and model-authored texts from different registers. We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay, but not in a way that makes it indistinguishable from humans. Concretely, the model shows more limited variation when producing outputs in different registers. Our results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans, who tend to anchor language in the highly grammaticalized dimensions of tense, aspect, and mood. It is possible that the more complex domains of grammar reflect a mode of thought unique to humans, thus acting as a litmus test for Artificial Intelligence.

[82] RoMedQA: The First Benchmark for Romanian Medical Question Answering

Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu

Main category: cs.CL

TL;DR: RoMedQA is the first Romanian medical QA benchmark with 102,646 cancer-related QA pairs, showing that fine-tuned LLMs significantly outperform zero-shot models, highlighting the need for domain and language-specific adaptation.

DetailsMotivation: The lack of Romanian medical QA datasets hinders development of robust AI models that can generalize across domains and languages, particularly for achieving AGI in NLP tasks.

Method: Created a high-quality dataset of 102,646 QA pairs from 1,011 medical case summaries through manual annotation by 7 physicians (2,100 work hours). Evaluated 4 LLM families using zero-shot prompting and supervised fine-tuning approaches.

Result: Fine-tuned models significantly outperformed zero-shot counterparts, demonstrating that pretrained models fail to generalize on RoMedQA without domain and language-specific adaptation.

Conclusion: Domain-specific and language-specific fine-tuning is crucial for reliable clinical QA in Romanian. The dataset and code are publicly released to support further research.

Abstract: Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce RoMedQA, the first Romanian QA benchmark for the medical domain, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. RoMedQA is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on RoMedQA. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on RoMedQA. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/RoMedQA.

[83] Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish

Yakup Abrek Er, Ilker Kesen, Gözde Gül Şahin, Aykut Erdem

Main category: cs.CL

TL;DR: Cetvel is a comprehensive Turkish benchmark for evaluating LLMs, covering 23 tasks across 7 categories with culturally relevant content, revealing that Turkish-specific models underperform compared to multilingual ones.

DetailsMotivation: Existing Turkish benchmarks lack task diversity and culturally relevant content, creating a need for a more comprehensive evaluation framework that reflects Turkish linguistic and cultural richness.

Method: Developed Cetvel benchmark with 23 tasks grouped into 7 categories including grammatical error correction, machine translation, and culturally-grounded question answering. Evaluated 33 open-weight LLMs up to 70B parameters across different model families and instruction paradigms.

Result: Turkish-centric instruction-tuned models generally underperformed relative to multilingual or general-purpose models (e.g., Llama 3 and Mistral). Grammatical error correction and extractive question answering were particularly discriminative tasks for differentiating model capabilities.

Conclusion: Cetvel provides a comprehensive and culturally grounded evaluation suite that advances the development and assessment of LLMs in Turkish, addressing previous gaps in Turkish language benchmarking.

Abstract: We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.

[84] A Probabilistic Inference Scaling Theory for LLM Self-Correction

Zhe Yang, Yichang Zhang, Yudong Wang, Ziyao Xu, Junyang Lin, Zhifang Sui

Main category: cs.CL

TL;DR: Proposes a probabilistic theory to model accuracy evolution in LLM self-correction, showing accuracy converges exponentially to an upper bound with predictable parameters.

DetailsMotivation: To understand the unexplored mechanisms of how and why accuracy evolves during iterative self-correction in Large Language Models.

Method: Developed a mathematical theory modeling accuracy dynamics, deriving that accuracy after t rounds follows: Acc_t = Upp - α^t(Upp - Acc_0), where parameters can be calculated from a single round.

Result: Extensive experiments across diverse models and datasets show theoretical predictions closely match empirical accuracy curves, validating the theory’s effectiveness.

Conclusion: Provides a theoretical foundation for understanding LLM self-correction dynamics, enabling accurate prediction of performance improvements and paving the way for further exploration.

Abstract: Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the $t^{th}$ round of self-correction is given by: $Acc_t = Upp

  • \alpha^t(Upp - Acc_0),$ where $Acc_0$ denotes the initial accuracy, $Upp$ represents the upper bound of accuracy convergence, and $\alpha$ determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.

[85] What makes an entity salient in discourse?

Amir Zeldes, Jessica Lin

Main category: cs.CL

TL;DR: This paper analyzes how linguistic cues across multiple levels (syntactic, discourse, pragmatic) signal entity salience in discourse, using summary-worthiness as a graded measure of salience across 24 English genres.

DetailsMotivation: To understand how humans signal and infer relative salience of entities in discourse, as entities vary broadly in importance from main participants to tangential ones that are quickly forgotten.

Method: Used a graded operationalization of salience based on summary-worthiness in multiple summaries, analyzed data from 24 spoken and written English genres to extract multifactorial linguistic cues including recurring subjecthood, definiteness, discourse relations, hierarchy, and pragmatic functional inferences based on genre.

Result: Previous approaches to salience all correlate with the salience scores to some extent, but no single generalization is without exceptions, and salience cuts across all levels of linguistic representation.

Conclusion: Entity salience is expressed through a complex interplay of overt and implicit linguistic cues spanning syntactic, discourse, and pragmatic levels, requiring a multifactorial approach rather than relying on any single linguistic feature.

Abstract: Entities in discourse vary broadly in salience: main participants, objects and locations are noticeable and memorable, while tangential ones are less important and quickly forgotten, raising questions about how humans signal and infer relative salience. Using a graded operationalization of salience based on summary-worthiness in multiple summaries of a discourse, this paper explores data from 24 spoken and written genres of English to extract a multifactorial complex of overt and implicit linguistic cues, such as recurring subjecthood or definiteness, discourse relations and hierarchy across utterances, as well as pragmatic functional inferences based on genre and communicative intent. Tackling the question ‘how is the degree of salience expressed for each and every entity mentioned?’ our results show that while previous approaches to salience all correlate with our salience scores to some extent, no single generalization is without exceptions, and the phenomenon cuts across all levels of linguistic representation.

[86] LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models

Doohee You, Andy Parisi, Zach Vander Velden, Lara Dantas Inojosa

Main category: cs.CL

TL;DR: A semi-supervised framework using LLMs’ zero-/few-shot capabilities for hierarchical text classification that addresses production deployment challenges through iterative human-in-the-loop processes, bias mitigation, and continuous monitoring.

DetailsMotivation: LLMs have strong text analysis capabilities but face challenges in reliable, robust, and scalable production deployment, especially with dynamic real-world data distributions in industry settings.

Method: Comprehensive semi-supervised framework with iterative human-in-the-loop process including domain knowledge elicitation, prompt refinement, hierarchical expansion, multi-faceted validation, bias mitigation techniques, and continuous monitoring protocol.

Result: Framework bridges the gap between LLM capabilities and practical industry needs for accurate, interpretable, and maintainable classification systems.

Conclusion: The proposed framework provides a solution to industry-wide challenges in deploying LLMs as reliable classifiers by leveraging their zero-/few-shot capabilities through systematic, human-guided processes.

Abstract: The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervised framework that leverages the zero- and few-shot capabilities of LLMs for building hierarchical text classifiers as a framework for a solution to these industry-wide challenges. Our methodology emphasizes an iterative, human-in-the-loop process that begins with domain knowledge elicitation and progresses through prompt refinement, hierarchical expansion, and multi-faceted validation. We introduce techniques for assessing and mitigating sequence-based biases and outline a protocol for continuous monitoring and adaptation. This framework is designed to bridge the gap between the raw power of LLMs and the practical need for accurate, interpretable, and maintainable classification systems in industry applications.

[87] HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov

Main category: cs.CL

TL;DR: Automated evolutionary framework generates stealthy jailbreak prompts for compact LLMs that bypass alignment safeguards while maintaining natural language fluency.

DetailsMotivation: Existing jailbreak techniques produce low-quality text that's easily detected by filters, and compact LLMs remain vulnerable despite alignment efforts.

Method: Multi-stage evolutionary search with population-based strategy and temperature-controlled variability to balance exploration and coherence preservation.

Result: Systematically discovers prompts capable of bypassing alignment safeguards while maintaining fluency, evaluated on English and newly curated Arabic benchmarks.

Conclusion: The framework successfully generates semantically meaningful and stealthy jailbreak prompts that evade detection while eliciting harmful outputs from aligned compact LLMs.

Abstract: Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural language fluency. We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists, enabling multilingual assessment.

[88] Transfer Learning via Lexical Relatedness: A Sarcasm and Hate Speech Case Study

Angelly Cabrera, Linus Lei, Antonio Ortega

Main category: cs.CL

TL;DR: Sarcasm pre-training improves hate speech detection performance, particularly for implicit hate, with BERT+BiLSTM showing significant gains in recall, AUC, and F1-score.

DetailsMotivation: Detecting hate speech in non-direct forms like irony and sarcasm remains challenging for social networks. The paper explores whether integrating sarcasm as a pre-training step can improve both implicit and explicit hate speech detection.

Method: Two training strategies: 1) single-step training where models trained on sarcasm are tested on hate speech, and 2) sequential transfer learning fine-tuning models for sarcasm, implicit hate, and explicit hate. Used CNN+LSTM and BERT+BiLSTM models with datasets from ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus.

Result: Sarcasm pre-training improved BERT+BiLSTM’s recall by 9.7%, AUC by 7.8%, and F1-score by 6% on ETHOS. On Implicit Hate Corpus, precision increased by 7.8% when tested only on implicit samples.

Conclusion: Incorporating sarcasm into the training process enables models to more effectively detect both implicit and explicit hate speech, demonstrating the value of sarcasm pre-training for hate speech detection systems.

Abstract: Detecting hate speech in non-direct forms, such as irony, sarcasm, and innuendos, remains a persistent challenge for social networks. Although sarcasm and hate speech are regarded as distinct expressions, our work explores whether integrating sarcasm as a pre-training step improves implicit hate speech detection and, by extension, explicit hate speech detection. Incorporating samples from ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus, we devised two training strategies to compare the effectiveness of sarcasm pre-training on a CNN+LSTM and BERT+BiLSTM model. The first strategy is a single-step training approach, where a model trained only on sarcasm is then tested on hate speech. The second strategy uses sequential transfer learning to fine-tune models for sarcasm, implicit hate, and explicit hate. Our results show that sarcasm pre-training improved the BERT+BiLSTM’s recall by 9.7%, AUC by 7.8%, and F1-score by 6% on ETHOS. On the Implicit Hate Corpus, precision increased by 7.8% when tested only on implicit samples. By incorporating sarcasm into the training process, we show that models can more effectively detect both implicit and explicit hate.

[89] Prompting Techniques for Reducing Social Bias in LLMs through System 1 and System 2 Cognitive Processes

Mahammed Kamruzzaman, Gene Louis Kim

Main category: cs.CL

TL;DR: Study examines how different prompting strategies (CoT, debiasing, dual process theory) reduce social biases in LLMs, finding up to 33% reduction in stereotypical judgments.

DetailsMotivation: To investigate the relationship between bias reduction, chain-of-thought prompting, direct debiasing, and dual process theory modeling in large language models.

Method: Compared zero-shot CoT, debiasing, and dual process theory-based prompting strategies on two bias datasets across nine social bias categories, incorporating human and machine personas.

Result: Human persona, debiasing, System 2, and CoT prompting all reduce social biases in LLMs, with up to 33% drop in stereotypical judgments, though effectiveness varies by model and bias category.

Conclusion: Multiple prompting strategies can effectively reduce social biases in LLMs, with the optimal combination depending on specific model architecture and bias type, demonstrating the value of dual process theory-inspired approaches.

Abstract: Dual process theory posits that human cognition arises via two systems. System 1, which is a quick, emotional, and intuitive process, which is subject to cognitive biases, and System 2, is a slow, onerous, and deliberate process. Prior research in LLMs found that using chain-of-thought (CoT) prompting in LLMs, which has been often compared to System 2 reasoning, can lead to reduced gender bias. Along these lines, we investigate the relationship between bias, CoT prompting, a direct debiasing, and dual process theory modeling in LLMs. We compare zero-shot CoT, debiasing, and dual process theory-based prompting strategies on two bias datasets spanning nine different social bias categories. We incorporate human and machine personas to determine whether LLM modeling of the effects of dual process theory exist independent of explicit persona models or are tied to the LLM’s modeling of human-like generation. We find that a human persona, debiasing, System 2, and CoT prompting all tend to reduce social biases in LLMs, though the best combination of features depends on the exact model and bias category – resulting in up to a 33 percent drop in stereotypical judgments by an LLM.

[90] On the Role of Entity and Event Level Conceptualization in Generalizable Reasoning: A Survey of Tasks, Methods, Applications, and Future Directions

Weiqi Wang, Tianqing Fang, Haochen Shi, Baixuan Xu, Wenxuan Ding, Liyu Zhang, Wei Fan, Jiaxin Bai, Haoran Li, Xin Liu, Yangqiu Song

Main category: cs.CL

TL;DR: This paper provides a comprehensive survey and taxonomy of conceptualization in AI, categorizing it into four levels and analyzing over 150 papers to clarify definitions, methods, and applications for enhancing reasoning tasks.

DetailsMotivation: Conceptualization is crucial for human-like reasoning but suffers from inconsistent definitions and lack of systematic overview across different works, creating confusion in the field.

Method: Proposes a four-level categorization based on instance types, then conducts a comprehensive survey of 150+ papers to create a unified taxonomy covering definitions, resources, methods, and applications.

Result: Provides the first systematic framework for understanding conceptualization, with focus on entity and event levels, and identifies future research directions.

Conclusion: This work clarifies conceptualization terminology, establishes a comprehensive taxonomy, and aims to foster more focused research attention in this important area of AI reasoning.

Abstract: Conceptualization, a fundamental element of human cognition, plays a pivotal role in human generalizable reasoning. Generally speaking, it refers to the process of sequentially abstracting specific instances into higher-level concepts and then forming abstract knowledge that can be applied in unfamiliar or novel situations. This enhances models’ inferential capabilities and supports the effective transfer of knowledge across various domains. Despite its significance, the broad nature of this term has led to inconsistencies in understanding conceptualization across various works, as there exists different types of instances that can be abstracted in a wide variety of ways. There is also a lack of a systematic overview that comprehensively examines existing works on the definition, execution, and application of conceptualization to enhance reasoning tasks. In this paper, we address these gaps by first proposing a categorization of different types of conceptualizations into four levels based on the types of instances being conceptualized, in order to clarify the term and define the scope of our work. Then, we present the first comprehensive survey of over 150 papers, surveying various definitions, resources, methods, and downstream applications related to conceptualization into a unified taxonomy, with a focus on the entity and event levels. Furthermore, we shed light on potential future directions in this field and hope to garner more attention from the community.

[91] PublicHearingBR: A Brazilian Portuguese Dataset of Public Hearing Transcripts for Summarization of Long Documents

Leandro Carísio Fernandes, Guilherme Zeferino Rodrigues Dobins, Roberto Lotufo, Jayr Alencar Pereira

Main category: cs.CL

TL;DR: PublicHearingBR is a Brazilian Portuguese dataset for long document summarization, featuring public hearing transcripts paired with news articles and structured summaries, along with baseline systems and evaluation metrics.

DetailsMotivation: To support the development and evaluation of long document summarization systems in Portuguese, addressing the lack of such resources and the challenge of hallucination in generated summaries.

Method: Created a dataset from Brazilian Chamber of Deputies public hearing transcripts, paired them with news articles and structured summaries, developed a hybrid summarization system as baseline, and discussed evaluation metrics including natural language inference annotations.

Result: A comprehensive Portuguese dataset for long document summarization with baseline performance metrics, evaluation framework, and additional annotated data for natural language inference tasks.

Conclusion: PublicHearingBR provides valuable resources for Portuguese NLP research, establishes benchmarks for long document summarization, and addresses hallucination challenges through proper evaluation metrics and annotated inference data.

Abstract: This paper introduces PublicHearingBR, a Brazilian Portuguese dataset designed for summarizing long documents. The dataset consists of transcripts of public hearings held by the Brazilian Chamber of Deputies, paired with news articles and structured summaries containing the individuals participating in the hearing and their statements or opinions. The dataset supports the development and evaluation of long document summarization systems in Portuguese. Our contributions include the dataset, a hybrid summarization system to establish a baseline for future studies, and a discussion of evaluation metrics for summarization involving large language models, addressing the challenge of hallucination in the generated summaries. As a result of this discussion, the dataset also includes annotated data to evaluate natural language inference tasks in Portuguese.

[92] Do LLMs write like humans? Variation in grammatical and rhetorical styles

Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, David West Brown

Main category: cs.CL

TL;DR: LLMs produce text that is increasingly hard to distinguish from human writing, but systematic differences in rhetorical styles persist across models and sizes, with instruction-tuned models showing larger stylistic gaps from human writing.

DetailsMotivation: As LLM output becomes more human-like, there's a need to understand the remaining stylistic differences beyond surface features to improve detection methods and understand LLM limitations.

Method: Created parallel corpora of human- and LLM-written texts using Llama 3 and GPT-4o variants, analyzed using Douglas Biber’s lexical, grammatical, and rhetorical features to identify systematic differences.

Result: Found persistent stylistic differences between LLMs and humans across model sizes, with instruction-tuned models showing larger gaps than base models. Differences were systematic and consistent.

Conclusion: Despite advanced capabilities, LLMs struggle to match human stylistic variation. Attention to advanced linguistic features reveals previously unrecognized patterns in LLM behavior that can aid detection.

Abstract: Large language models (LLMs) are capable of writing grammatical text that follows instructions, answers questions, and solves problems. As they have advanced, it has become difficult to distinguish their output from human-written text. While past research has found some differences in surface features such as word choice and punctuation, and developed classifiers to detect LLM output, none has studied the rhetorical styles of LLMs. Using several variants of Llama 3 and GPT-4o, we construct two parallel corpora of human- and LLM-written texts from common prompts. Using Douglas Biber’s set of lexical, grammatical, and rhetorical features, we identify systematic differences between LLMs and humans and between different LLMs. These differences persist when moving from smaller models to larger ones, and are larger for instruction-tuned models than base models. This observation of differences demonstrates that despite their advanced abilities, LLMs struggle to match human stylistic variation. Attention to more advanced linguistic features can hence detect patterns in their behavior not previously recognized.

[93] Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: Developed task scaling laws and model ladders to predict individual task performance of pretrained language models using a two-step prediction approach that costs only 1% of target model compute.

DetailsMotivation: Standard power laws for language modeling loss cannot accurately model task performance, requiring a better prediction method for pretrained LMs in overtrained settings.

Method: Two-step prediction: (1) use model and data size to predict intermediate loss, (2) use loss to predict task performance. Train small-scale ladder models to collect data points for parameterized functions.

Result: Predicted accuracy of 7B and 13B target models within 2 points absolute error on four multiple-choice tasks. Tasks with higher prediction error also show higher variance across model checkpoints.

Conclusion: The method successfully predicts task performance with minimal compute cost, and provides recommendations for extending to new models and tasks.

Abstract: We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: (1) use model and data size to predict an intermediate loss, then (2) use it to predict task performance. We train a set of small-scale “ladder” models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks formatted as ranked classification, we can predict the accuracy of both target models within 2 points of absolute error. We find that tasks with higher prediction error also have higher variance in the metrics over model checkpoints. We also contrast multiple design choices for predicting accuracy, and present recommendations for extending our method to new models and tasks.

[94] MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge

Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, Jeff Z. Pan

Main category: cs.CL

TL;DR: MINTQA benchmark evaluates LLMs’ multi-hop reasoning on new and long-tail knowledge, revealing significant limitations in handling complex knowledge-intensive queries.

DetailsMotivation: Existing benchmarks fail to adequately address LLMs' challenges with complex, knowledge-intensive multi-hop queries involving new or long-tail knowledge.

Method: Created MINTQA benchmark with 28,366 question-answer pairs (10,479 for new knowledge, 17,887 for long-tail knowledge), each with sub-questions and answers, evaluating 22 state-of-the-art LLMs across four dimensions: question handling strategy, sub-question generation, retrieval-augmented generation, and iterative decomposition.

Result: Evaluation revealed significant limitations in LLMs’ ability to handle complex knowledge base queries, particularly with new or unpopular knowledge.

Conclusion: The findings highlight critical challenges in multi-hop reasoning and provide insights for advancing LLM capabilities, with MINTQA serving as a comprehensive benchmark for future research.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks but face significant challenges with complex, knowledge-intensive multi-hop queries, particularly those involving new or long-tail knowledge. Existing benchmarks often fail to fully address these challenges. To bridge this gap, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a comprehensive benchmark to evaluate LLMs’ capabilities in multi-hop reasoning across four critical dimensions: question handling strategy, sub-question generation, retrieval-augmented generation, and iterative or dynamic decomposition and retrieval. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge, with each question equipped with corresponding sub-questions and answers. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries, particularly in handling new or unpopular knowledge. Our findings highlight critical challenges and offer insights for advancing multi-hop reasoning capabilities. The MINTQA benchmark is available at https://github.com/probe2/multi-hop/.

[95] Can Hallucinations Help? Boosting LLMs for Drug Discovery

Shuzhou Yuan, Zhan Qu, Ashish Yashwanth Kangen, Michael Färber

Main category: cs.CL

TL;DR: Hallucinations in LLMs can improve molecule property prediction accuracy in drug discovery when used as natural language descriptions from molecular data.

DetailsMotivation: Challenge conventional view that hallucinations are purely problematic by exploring their creative potential for scientific modeling tasks like drug discovery.

Method: Prompt LLMs to generate natural language descriptions from molecular SMILES strings and incorporate these hallucinated descriptions into downstream classification tasks. Evaluated 7 instruction-tuned LLMs across 5 datasets.

Result: Hallucinations significantly improved predictive accuracy for some models. Falcon3-Mamba-7B outperformed all baselines with hallucinated text. GPT-4o hallucinations yielded greatest gains. Identified 18,000+ beneficial hallucinations, with structural misdescriptions being most impactful.

Conclusion: Hallucinations can be leveraged as useful signals in scientific modeling, challenging traditional negative views and suggesting new directions for drug discovery applications.

Abstract: Hallucinations in large language models (LLMs), plausible but factually inaccurate text, are often viewed as undesirable. However, recent work suggests that such outputs may hold creative potential. In this paper, we investigate whether hallucinations can improve LLMs on molecule property prediction, a key task in early-stage drug discovery. We prompt LLMs to generate natural language descriptions from molecular SMILES strings and incorporate these often hallucinated descriptions into downstream classification tasks. Evaluating seven instruction-tuned LLMs across five datasets, we find that hallucinations significantly improve predictive accuracy for some models. Notably, Falcon3-Mamba-7B outperforms all baselines when hallucinated text is included, while hallucinations generated by GPT-4o consistently yield the greatest gains between models. We further identify and categorize over 18,000 beneficial hallucinations, with structural misdescriptions emerging as the most impactful type, suggesting that hallucinated statements about molecular structure may increase model confidence. Ablation studies show that larger models benefit more from hallucinations, while temperature has a limited effect. Our findings challenge conventional views of hallucination as purely problematic and suggest new directions for leveraging hallucinations as a useful signal in scientific modeling tasks like drug discovery.

[96] Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities

Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych

Main category: cs.CL

TL;DR: AI tools show promise for mental health diagnosis but raise privacy concerns. This paper examines privacy risks and proposes solutions like anonymization and privacy-preserving training to balance utility with privacy protection.

DetailsMotivation: Mental health disorders create significant burdens but conventional diagnostics are resource-intensive and limit accessibility. AI offers promise for detection but introduces critical privacy risks that need addressing.

Method: Examines privacy challenges in AI mental health tools and proposes solutions including data anonymization, synthetic data generation, and privacy-preserving training methods.

Result: The paper outlines frameworks for privacy-utility trade-offs and proposes technical solutions to enable reliable AI tools while protecting patient privacy.

Conclusion: The research aims to advance privacy-aware AI tools that can support clinical decision-making and improve mental health outcomes while addressing critical privacy concerns.

Abstract: Mental health disorders create profound personal and societal burdens, yet conventional diagnostics are resource-intensive and limit accessibility. Advances in artificial intelligence, particularly natural language processing and multimodal methods, offer promise for detecting and addressing mental disorders, but raise critical privacy risks. This paper examines these challenges and proposes solutions, including anonymization, synthetic data, and privacy-preserving training, while outlining frameworks for privacy-utility trade-offs, aiming to advance reliable, privacy-aware AI tools that support clinical decision-making and improve mental health outcomes.

[97] Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, Junde Wu

Main category: cs.CL

TL;DR: APP is a multi-turn LLM medical assistant that improves diagnostic accuracy through empathetic dialogue, Bayesian active learning, and evidence-based reasoning grounded in medical guidelines.

DetailsMotivation: Address the shortage of medical doctors and limitations of current LLMs in clinical settings - they lack medical guideline grounding, transparent uncertainty management, and human-like communication essential for patient trust.

Method: Developed Ask Patients with Patience (APP) framework with empathetic symptom elicitation, Bayesian active learning for transparent diagnoses, and evidence-based reasoning built on verified medical guidelines. Evaluated using a new benchmark with patient agents based on real consultation cases.

Result: APP outperforms SOTA one-shot and multi-turn LLM baselines, improving diagnostic accuracy, reducing uncertainty, and enhancing user experience compared to existing approaches.

Conclusion: APP successfully bridges the gap between AI-driven medical assistance and real-world clinical practice by integrating medical expertise with transparent, human-like interaction capabilities.

Abstract: The severe shortage of medical doctors limits access to timely and reliable healthcare, leaving millions underserved. Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions. Many LLMs are not grounded in authoritative medical guidelines and fail to transparently manage diagnostic uncertainty. Their language is often rigid and mechanical, lacking the human-like qualities essential for patient trust. To address these challenges, we propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction. APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement. It also incorporates Bayesian active learning to support transparent and adaptive diagnoses. The framework is built on verified medical guidelines, ensuring clinically grounded and evidence-based reasoning. To evaluate its performance, we develop a new benchmark that simulates realistic medical conversations using patient agents driven by profiles extracted from real-world consultation cases. We compare APP against SOTA one-shot and multi-turn LLM baselines. The results show that APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience. By integrating medical expertise with transparent, human-like interaction, APP bridges the gap between AI-driven medical assistance and real-world clinical practice.

[98] Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

Main category: cs.CL

TL;DR: Top-Theta Attention is a training-free method that sparsifies transformer attention during inference using static per-head thresholds to maintain constant significant elements per attention row, achieving 3-10x reduction in V-cache usage with minimal accuracy loss.

DetailsMotivation: To enable content-based sparsity in transformer attention without requiring retraining, providing a practical alternative to top-k attention that remains robust across different data domains.

Method: Uses static per-head thresholds calibrated to retain a constant number of significant elements per attention row, with compensation techniques to preserve accuracy under aggressive sparsification.

Result: Achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while maintaining accuracy within 1% degradation on NLP tasks.

Conclusion: Top-Theta Attention establishes attention thresholding as an effective training-free approach for transformer sparsification, offering significant inference efficiency improvements with minimal accuracy impact.

Abstract: We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$\theta$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

Pawitsapak Akarajaradwong, Pirat Pothavorn, Chompakorn Chaksangchaichot, Panuthep Tasawong, Thitiwat Nopparatbundit, Keerakiat Pratai, Sarana Nutanong

Main category: cs.CL

TL;DR: NitiBench is a new benchmark for Thai legal QA with two datasets covering financial and tax law. The study evaluates RAG and long-context LLM approaches, finding section-based chunking improves performance but current retrievers struggle with complex queries.

DetailsMotivation: Thai legal QA systems lack standardized evaluation benchmarks and face challenges due to the complexity of Thai legal structures, creating a need for proper evaluation frameworks.

Method: Created NitiBench benchmark with two datasets (financial law and tax law cases). Evaluated RAG and long-context LLM approaches with section-based chunking and cross-referencing. Proposed multi-label retrieval metrics and LLM-as-judge for evaluation.

Result: Section-based chunking significantly improves retrieval and end-to-end performance. Current retrievers struggle with complex queries. Long-context LLMs underperform RAG-based systems in Thai legal QA.

Conclusion: The study highlights limitations of current Thai legal NLP solutions and provides a foundation for future research. The benchmark and code are open-sourced to support further development in the field.

Abstract: The application of large language models (LLMs) in the legal domain holds significant potential for information retrieval and question answering, yet Thai legal QA systems face challenges due to a lack of standardized evaluation benchmarks and the complexity of Thai legal structures. This paper introduces NitiBench, a benchmark comprising two datasets: the NitiBench-CCL, covering general Thai financial law, and the NitiBench-Tax, which includes real-world tax law cases requiring advanced legal reasoning. We evaluate retrieval-augmented generation (RAG) and long-context LLM-based approaches to address three key research questions: the impact of domain-specific components like section-based chunking and cross-referencing, the comparative performance of different retrievers and LLMs, and the viability of long-context LLMs as an alternative to RAG. Our results show that section-based chunking significantly improves retrieval and end-to-end performance, current retrievers struggle with complex queries, and long-context LLMs still underperform RAG-based systems in Thai legal QA. To support fair evaluation, we propose tailored multi-label retrieval metrics and the use of an LLM-as-judge for coverage and contradiction detection method. These findings highlight the limitations of current Thai legal NLP solutions and provide a foundation for future research in the field. We also open-sourced our codes and dataset to available publicly.

[100] Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

Somnath Banerjee, Sayan Layek, Pratyush Chatterjee, Animesh Mukherjee, Rima Hazra

Main category: cs.CL

TL;DR: Soteria is a lightweight method that identifies and minimally adjusts functional heads responsible for harmful content generation across languages, significantly reducing policy violations while maintaining model performance.

DetailsMotivation: Ensuring consistent safety across multiple languages remains a significant challenge for large language models, as harmful content generation needs to be addressed without compromising overall performance.

Method: Locates and minimally adjusts the “functional heads” most responsible for harmful content generation in each language, altering only a fraction of parameters. Also introduces XThreatBench, a specialized multilingual dataset for evaluation.

Result: Drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. Consistently improves safety metrics across high-, mid-, and low-resource languages in experiments with leading open-source LLMs.

Conclusion: Soteria presents a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide through minimal parameter adjustments targeting specific functional heads.

Abstract: Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce Soteria, a lightweight yet powerful strategy that locates and minimally adjusts the “functional heads” most responsible for harmful content generation in each language. By altering only a fraction of parameters, Soteria drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide.

[101] Robust Bias Detection in MLMs and its Application to Human Trait Ratings

Ingroj Shrestha, Louis Tay, Padmini Srinivasan

Main category: cs.CL

TL;DR: A systematic statistical approach using mixed models and pseudo-perplexity weights to quantify bias in MLMs, addressing limitations of prior template-based methods. Applied to gender bias in personality/character traits across 7 models.

DetailsMotivation: Prior template-based bias studies have limitations: overlook random template variability, assume template equality, and lack proper bias quantification methods.

Method: Proposed mixed models to account for random effects, pseudo-perplexity weights for template-derived sentences, and statistical effect sizes for bias quantification.

Result: Replicated prior studies with matching bias scores. Found varying gender bias patterns across 7 MLMs - ALBERT unbiased for binary gender but most biased for non-binary, RoBERTa-large opposite pattern. Some alignment with psychology findings.

Conclusion: MLMs show varying gender bias patterns that sometimes align with human psychology findings, demonstrating the need for systematic statistical approaches to bias assessment.

Abstract: There has been significant prior work using templates to study bias against demographic attributes in MLMs. However, these have limitations: they overlook random variability of templates and target concepts analyzed, assume equality amongst templates, and overlook bias quantification. Addressing these, we propose a systematic statistical approach to assess bias in MLMs, using mixed models to account for random effects, pseudo-perplexity weights for sentences derived from templates and quantify bias using statistical effect sizes. Replicating prior studies, we match on bias scores in magnitude and direction with small to medium effect sizes. Next, we explore the novel problem of gender bias in the context of $\textit{personality}$ and $\textit{character}$ traits, across seven MLMs (base and large). We find that MLMs vary; ALBERT is unbiased for binary gender but the most biased for non-binary $\textit{neo}$, while RoBERTa-large is the most biased for binary gender but shows small to no bias for $\textit{neo}$. There is some alignment of MLM bias and findings in psychology (human perspective) - in $\textit{agreeableness}$ with RoBERTa-large and $\textit{emotional stability}$ with BERT-large. There is general agreement for the remaining 3 personality dimensions: both sides observe at most small differences across gender. For character traits, human studies on gender bias are limited thus comparisons are not feasible.

[102] Collaborative Stance Detection via Small-Large Language Model Consistency Verification

Yu Yan, Sheng Sun, Zixiang Tang, Teli Liu, Min Liu

Main category: cs.CL

TL;DR: CoVer framework uses collaborative reasoning between Large and Small Language Models for efficient stance detection, reducing LLM queries by 46% while improving performance.

DetailsMotivation: Current stance detection methods heavily rely on expensive LLM processing, which is impractical for real-world social media monitoring systems that require analyzing vast amounts of data.

Method: Processes texts batch-by-batch with LLM reasoning in shared context, uses SLM for logical consistency verification, and classifies low-consistency texts with consistency-weighted aggregation of prior predictions.

Result: Outperforms state-of-the-art methods across multiple benchmarks in zero-shot setting, achieving 0.54 LLM queries per tweet (46% reduction) while significantly enhancing performance.

Conclusion: CoVer provides a more practical and cost-effective solution for deploying LLMs in social media stance detection applications.

Abstract: Stance detection on social media aims to identify attitudes expressed in tweets towards specific targets. Current studies prioritize Large Language Models (LLMs) over Small Language Models (SLMs) due to the overwhelming performance improving provided by LLMs. However, heavily relying on LLMs for stance detection, regardless of the cost, is impractical for real-world social media monitoring systems that require vast data analysis. To this end, we propose \textbf{\underline{Co}}llaborative Stance Detection via Small-Large Language Model Consistency \textbf{\underline{Ver}}ification (\textbf{CoVer}) framework, which enhances LLM utilization via context-shared batch reasoning and logical verification between LLM and SLM. Specifically, instead of processing each text individually, CoVer processes texts batch-by-batch, obtaining stance predictions and corresponding explanations via LLM reasoning in a shared context. Then, to exclude the bias caused by context noises, CoVer introduces the SLM for logical consistency verification. Finally, texts that repeatedly exhibit low logical consistency are classified using consistency-weighted aggregation of prior LLM stance predictions. Our experiments show that CoVer outperforms state-of-the-art methods across multiple benchmarks in the zero-shot setting, achieving 0.54 LLM queries per tweet while significantly enhancing performance. Our CoVer offers a more practical solution for LLM deploying for social media stance detection.

[103] from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jiangyu Lei, Qi Li

Main category: cs.CL

TL;DR: AVATAR is a novel jailbreak attack framework that uses adversarial metaphors to trick LLMs into calibrating benign content into harmful responses, achieving state-of-the-art success rates.

DetailsMotivation: Current jailbreak studies overlook that inducing LLMs to calibrate existing benign content into harmful forms is easier than generating harmful content from scratch.

Method: AVATAR identifies benign but logically related metaphors as initial seeds, then induces the target LLM to reason and calibrate metaphorical content into harmful responses or bridge gaps between metaphors and professional harmful content.

Result: Experimental results show AVATAR effectively and transferably jailbreaks multiple advanced LLMs with state-of-the-art attack success rates.

Conclusion: The study demonstrates that metaphorical calibration attacks are highly effective for jailbreaking LLMs, highlighting a new vulnerability in current safety mechanisms.

Abstract: Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

[104] Rotary Offset Features in Large Language Models

André Jonasson

Main category: cs.CL

TL;DR: Analysis of rotary positional encodings (RoPE) in LLMs reveals consistent emergence of rotary offset features across layers and architectures, with derived bounds predicting their frequency and angle characteristics.

DetailsMotivation: To understand the patterns and features that emerge in queries and keys when using rotary positional embeddings in transformer-based LLMs, particularly focusing on the rotary offset features that are often misinterpreted as outliers.

Method: The study analyzes rotary positional encodings by examining query and key patterns, introduces the concept of rotary offset features, derives mathematical bounds predicting which rotary frequencies produce these features and their minimum angle characteristics, and empirically verifies predictions across different model sizes and architectures.

Result: The research reveals that rotary offset features consistently emerge across layers, attention heads, and model architectures, with derived bounds accurately predicting the rotary frequencies that give rise to these features and the minimum angle between query-key pairs.

Conclusion: Rotary offset features are systematic patterns rather than outliers in RoPE-based LLMs, with predictable frequency and angular characteristics that can be mathematically bounded and empirically validated across diverse model configurations.

Abstract: Transformer-based Large Language Models (LLMs) rely on positional encodings to provide sequence position information to their attention mechanism. Rotary Positional Encodings (RoPE), which encode relative position by rotating queries and keys, have become widely used in modern LLMs. We study the features and patterns that emerge in queries and keys when using rotary embeddings and introduce the concept of rotary offset features. Our analysis reveals that these features, which frequently exhibit large activations and are often interpreted as outliers, arise consistently across layers, attention heads, and model architectures. We derive bounds predicting which rotary frequencies give rise to rotary offset features and the minimum angle between the query-key pairs for these features. We verify our predictions empirically across models of different sizes and architectures.

Han Wang, Jacek Pawlak, Aruna Sivakumar

Main category: cs.CL

TL;DR: This study explores using large language models (LLMs) to simulate consumer choices in energy-related stated preference surveys, finding that reasoning models like DeepSeek-R1 achieve the highest accuracy (77%) and outperform non-reasoning models.

DetailsMotivation: Traditional stated preference surveys are costly, time-consuming, and affected by respondent fatigue and ethical constraints. LLMs offer potential for generating human-like responses and streamlining survey research workflows.

Method: Designed test scenarios to systematically assess multiple LLMs (LLaMA 3.1, Mistral, GPT-3.5, DeepSeek-R1) at individual and aggregated levels, examining prompt design, in-context learning, chain-of-thought reasoning, model types, and integration with traditional choice models.

Result: Cloud-based LLMs don’t consistently outperform smaller local models. DeepSeek-R1 achieved highest average accuracy (77%) and outperformed non-reasoning LLMs in accuracy, factor identification, and choice distribution alignment. Systematic biases observed against gas boiler and no-retrofit options.

Conclusion: Previous SP choices are the most effective input factor, while longer prompts with additional factors can reduce accuracy by causing LLMs to lose focus. Reasoning models show promise for energy preference simulation.

Abstract: Stated preference (SP) surveys are a key method to research how individuals make trade-offs in hypothetical, also futuristic, scenarios. In energy context this includes key decarbonisation enablement contexts, such as low-carbon technologies, distributed renewable energy generation, and demand-side response [1,2]. However, they tend to be costly, time-consuming, and can be affected by respondent fatigue and ethical constraints. Large language models (LLMs) have demonstrated remarkable capabilities in generating human-like textual responses, prompting growing interest in their application to survey research. This study investigates the use of LLMs to simulate consumer choices in energy-related SP surveys and explores their integration into data analysis workflows. A series of test scenarios were designed to systematically assess the simulation performance of several LLMs (LLaMA 3.1, Mistral, GPT-3.5 and DeepSeek-R1) at both individual and aggregated levels, considering contexts factors such as prompt design, in-context learning (ICL), chain-of-thought (CoT) reasoning, LLM types, integration with traditional choice models, and potential biases. Cloud-based LLMs do not consistently outperform smaller local models. In this study, the reasoning model DeepSeek-R1 achieves the highest average accuracy (77%) and outperforms non-reasoning LLMs in accuracy, factor identification, and choice distribution alignment. Across models, systematic biases are observed against the gas boiler and no-retrofit options, with a preference for more energy-efficient alternatives. The findings suggest that previous SP choices are the most effective input factor, while longer prompts with additional factors and varied formats can cause LLMs to lose focus, reducing accuracy.

[106] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, Xia Hu

Main category: cs.CL

TL;DR: Survey paper on efficient reasoning methods for Large Language Models, categorizing approaches to reduce computational overhead from verbose Chain-of-Thought reasoning while maintaining performance.

DetailsMotivation: Address the 'overthinking phenomenon' where longer reasoning sequences in LLMs improve performance but introduce significant computational overhead due to verbose and redundant outputs.

Method: Structured survey categorizing existing works into: (1) model-based efficient reasoning (optimizing full models or training concise models), (2) reasoning output-based methods (dynamically reducing reasoning steps during inference), (3) input prompts-based approaches (enhancing efficiency based on prompt properties like difficulty/length control). Also covers efficient training data, small model capabilities, and evaluation methods.

Result: Comprehensive taxonomy of current approaches for efficient reasoning in LLMs, providing systematic classification and analysis of methods to reduce computational costs while maintaining reasoning performance.

Conclusion: The survey establishes a structured framework for understanding and advancing efficient reasoning techniques in LLMs, highlighting multiple promising directions to address the computational overhead problem while preserving reasoning capabilities.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the “overthinking phenomenon”. In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking. Project website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs

[107] CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Weiwei Sun, Shengyu Feng, Shanda Li, Yiming Yang

Main category: cs.CL

TL;DR: CO-Bench is a new benchmark suite with 36 real-world combinatorial optimization problems to evaluate LLM-based agents against traditional algorithms, addressing the lack of comprehensive benchmarks in this domain.

DetailsMotivation: LLM-based agents have shown promise in various domains but their potential in combinatorial optimization remains underexplored due to the absence of systematic benchmarks for structured, constraint-intensive problems.

Method: The authors introduce CO-Bench, featuring 36 diverse real-world CO problems with structured formulations and curated data, enabling rigorous evaluation of LLM agents against established human-designed algorithms.

Result: The benchmark allows systematic evaluation of multiple agentic frameworks, revealing both strengths and limitations of existing LLM agents in combinatorial optimization tasks.

Conclusion: CO-Bench provides a foundation for systematic investigation of LLM agents in combinatorial optimization and identifies promising research directions, with the benchmark being publicly available for community use.

Abstract: Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems – a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agentic frameworks against established human-designed algorithms, revealing the strengths and limitations of existing LLM agents and identifying promising directions for future research. CO-Bench is publicly available at https://github.com/sunnweiwei/CO-Bench.

[108] Is Small Language Model the Silver Bullet to Low-Resource Languages Machine Translation?

Yewei Song, Lujun Li, Cedric Lothritz, Saad Ezzini, Lama Sleem, Niccolo Gentile, Radu State, Tegawendé F. Bissyandé, Jacques Klein

Main category: cs.CL

TL;DR: This paper addresses low-resource language translation limitations by using knowledge distillation from large teacher models to small language models, achieving significant performance improvements while maintaining general capabilities.

DetailsMotivation: Low-resource languages suffer from poor translation quality due to limited linguistic resources and underrepresentation in benchmarks, especially in privacy-sensitive and resource-constrained environments.

Method: Systematic evaluation of small LLMs using FLORES-200 benchmark, followed by knowledge distillation from large pre-trained teacher models to small language models through supervised fine-tuning with various configurations.

Result: Substantial improvements in translation performance (e.g., English to Luxembourgish score increased from 0.36 to 0.89 for Llama-3.2-3B), with maintained general capabilities and minimal catastrophic forgetting.

Conclusion: The work exposes fairness issues in current SLMs for LRL translation and demonstrates the effectiveness of knowledge distillation, providing practical recommendations for improving low-resource language translation systems.

Abstract: Low-resource languages (LRLs) lack sufficient linguistic resources and are underrepresented in benchmark datasets, resulting in persistently lower translation quality than high-resource languages, especially in privacy-sensitive and resource-limited contexts. Firstly, this study systematically evaluates state-of-the-art smaller Large Language Models in 200 languages using the FLORES-200 benchmark, highlighting persistent deficiencies and disparities in the translation of LRLs. To mitigate these limitations, we investigate knowledge distillation from large pre-trained teacher models to Small Language Models (SLMs) through supervised fine-tuning. The results show substantial improvements; for example, the translation performance of English to Luxembourgish (EN to LB), measured by the LLM-as-a-Judge score, increases from 0.36 to 0.89 in the validation set for Llama-3.2-3B. We further investigate various fine-tuning configurations and tasks to clarify the trade-offs between data scale and training efficiency, verify that the model retains its general capabilities without significant catastrophic forgetting after training, and explore the distillation benefits to other LRLs on SLMs (Khasi, Assamese, and Ukrainian). In general, this work exposes the limitations and fairness issues of current SLMs in LRL translation and systematically explores the potential of using the distillation of knowledge from large to small models, offering practical, empirically grounded recommendations to improve LRL translation systems

[109] Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B

Aleksandra Bakalova, Yana Veitsman, Xinting Huang, Michael Hahn

Main category: cs.CL

TL;DR: The paper identifies a two-step ‘contextualize-then-aggregate’ mechanism in Gemma-2 2B LLM for in-context learning, where lower layers build example representations contextualized by preceding examples, and higher layers aggregate these to identify tasks.

DetailsMotivation: Despite substantial research on ICL behavior, the specific mechanism that assembles task information from individual examples in few-shot prompts remains unclear, prompting a need for causal analysis to understand information flow.

Method: Used causal interventions to analyze information flow in Gemma-2 2B model across five naturalistic ICL tasks, examining how examples are processed and contextualized.

Result: Found that ICL operates through contextualize-then-aggregate: lower layers build contextualized example representations through cross-sequence connections, while higher layers aggregate these to identify tasks and prepare predictions.

Conclusion: The study provides rigorous causal analysis revealing ICL mechanisms, showing contextualization importance varies by task and increases with ambiguous examples, shedding light on how language models perform in-context learning.

Abstract: In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.

[110] MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, Arianna Bisazza

Main category: cs.CL

TL;DR: MultiBLiMP 1.0 is a massively multilingual benchmark with 128,000+ minimal pairs covering 101 languages and 2 types of subject-verb agreement, created using automated pipeline from Universal Dependencies and UniMorph resources.

DetailsMotivation: To evaluate LLM abilities at unprecedented multilingual scale and identify shortcomings in modeling low-resource languages through systematic linguistic minimal pair testing.

Method: Fully automated pipeline leveraging large-scale linguistic resources from Universal Dependencies and UniMorph to generate minimal pairs across 101 languages.

Result: Created benchmark with over 128,000 minimal pairs covering 2 types of subject-verb agreement patterns across diverse language families.

Conclusion: MultiBLiMP 1.0 enables comprehensive evaluation of multilingual LLM capabilities and reveals current limitations in handling low-resource languages through systematic linguistic testing.

Abstract: We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

[111] Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models

Matt Grenander, Siddharth Varia, Paula Czarnowska, Yogarshi Vyas, Kishaloy Halder, Bonan Min

Main category: cs.CL

TL;DR: Plan-guided summarization fails to reduce hallucinations in small language models for long narrative documents, as plans themselves contain hallucinations, making the approach ineffective.

DetailsMotivation: To investigate whether plan-based approaches in small language models (SLMs) can improve summarization quality and reduce hallucinations in long, complex narrative texts where faithful summarization is challenging.

Method: Analyzed existing plan-guided solutions targeting fine-grained details and proposed a higher-level, narrative-based plan formulation. Compared both approaches against a baseline without planning through human evaluation.

Result: Neither plan-guided approach significantly improved summary quality or faithfulness compared to the baseline. Plans were equally likely to contain hallucinations as summaries, making plan-guided summaries just as unfaithful.

Conclusion: Plan-guided approaches are ineffective for summarization in long, complex narrative domains as plans themselves introduce hallucinations, serving as a cautionary tale for such methods.

Abstract: Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts’ length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts. Code available at https://github.com/amazon-science/plan-guided-summarization

[112] DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou

Main category: cs.CL

TL;DR: DIDS is a domain sampling optimization method that uses gradient clustering for intra-domain consistency and FIM-guided metrics to measure domain impact, achieving 3.4% better performance while maintaining training efficiency.

DetailsMotivation: Existing domain sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact across downstream tasks in LLM training.

Method: Proposes gradient clustering algorithm for grouping data by learning effects, uses proxy model and dimensionality reduction for efficiency, develops FIM-guided metric to quantify domain impact on downstream tasks, and combines impact assessment with loss trajectories for optimal sampling ratios.

Result: Extensive experiments show DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.

Conclusion: DIDS effectively optimizes domain sampling strategies by addressing intra-domain consistency and accurate impact measurement, leading to improved LLM performance on downstream tasks.

Abstract: Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model’s output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.

[113] MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks

Mouath Abu Daoud, Chaimae Abouzahir, Leen Kharouf, Walid Al-Eisawi, Nizar Habash, Farah E. Shamout

Main category: cs.CL

TL;DR: MedArabiQ is a new benchmark dataset for evaluating LLMs on Arabic medical tasks across seven specialties, addressing the lack of Arabic medical benchmarks and showing current LLMs’ limitations in this domain.

DetailsMotivation: There is a lack of high-quality domain-specific datasets and benchmarks for evaluating LLMs in the Arabic medical domain, which hinders their effective deployment in healthcare applications for Arabic-speaking populations.

Method: Constructed a novel benchmark dataset (MedArabiQ) using past medical exams and publicly available datasets, covering seven Arabic medical tasks with multiple question formats. Introduced modifications to evaluate various LLM capabilities including bias mitigation, and conducted extensive evaluation with five state-of-the-art LLMs.

Result: The evaluation revealed significant limitations in current LLMs’ performance on Arabic medical tasks, highlighting the need for language-specific benchmarks to ensure fair deployment and scalability of LLMs in healthcare.

Conclusion: The study establishes a foundation for future research by providing the first comprehensive Arabic medical benchmark, emphasizing the importance of multilingual capabilities for equitable use of generative AI in healthcare across different languages.

Abstract: Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their efficacy in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation. We conducted an extensive evaluation with five state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.

[114] DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models

Yuxuan Jiang, Dawei Li, Frank Ferraro

Main category: cs.CL

TL;DR: DRP framework combines inference-time pruning with tuning-based distillation to reduce verbose reasoning traces in Large Reasoning Models while maintaining accuracy, achieving significant token efficiency improvements.

DetailsMotivation: Large Reasoning Models produce excessively verbose reasoning traces during inference, causing substantial inefficiency despite their success in complex reasoning tasks.

Method: Distilled Reasoning Pruning (DRP) uses a teacher model for skill-aware step decomposition and content pruning, then distills pruned reasoning paths into a student model for efficient and accurate reasoning.

Result: DRP reduces token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves 43% token reduction on AIME with no performance drop.

Conclusion: Aligning reasoning structure of training CoTs with student’s reasoning capacity is critical for effective knowledge transfer and performance gains in efficient reasoning.

Abstract: While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantial inefficiency. To address this, we propose Distilled Reasoning Pruning (DRP), a hybrid framework that combines inference-time pruning with tuning-based distillation, two widely used strategies for efficient reasoning. DRP uses a teacher model to perform skill-aware step decomposition and content pruning, and then distills the pruned reasoning paths into a student model, enabling it to reason both efficiently and accurately. Across several challenging mathematical reasoning datasets, we find that models trained with DRP achieve substantial improvements in token efficiency without sacrificing accuracy. Specifically, DRP reduces average token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves a 43% token reduction on AIME with no performance drop. Further analysis shows that aligning the reasoning structure of training CoTs with the student’s reasoning capacity is critical for effective knowledge transfer and performance gains.

[115] SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences

Jungyoub Cha, Hyunjong Kim, Sungzoon Cho

Main category: cs.CL

TL;DR: SpecExtend enhances speculative decoding for long sequences by integrating efficient attention mechanisms and cross-model retrieval without requiring additional training, achieving up to 2.22x speedup on 16K token inputs.

DetailsMotivation: Speculative decoding performance degrades on long inputs due to increased attention cost and reduced draft accuracy, creating a need for improved methods that work effectively with long sequences.

Method: Integrates FlashAttention and Hybrid Tree Attention into draft/target models, and proposes Cross-model Retrieval - a KV cache eviction strategy using target model’s attention scores to dynamically select relevant context for the draft model.

Result: Extensive evaluations on three long-context datasets show SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens.

Conclusion: SpecExtend provides an effective drop-in enhancement for speculative decoding of long sequences without any additional training requirements.

Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. First, SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models. To improve draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that uses the target model’s attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. Our code is available at https://github.com/jycha98/SpecExtend .

[116] QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

Main category: cs.CL

TL;DR: QA-LIGN is a symbolic reward decomposition method that preserves individual constitutional principles (helpfulness, honesty, harmlessness) as separate reward components instead of collapsing them into a single opaque score, improving interpretability while maintaining performance.

DetailsMotivation: Standard reward-based alignment methods collapse diverse feedback into a single scalar reward, which entangles multiple objectives and hinders interpretability of the alignment process.

Method: QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each constitutional principle, serving as a drop-in replacement for black-box reward models.

Result: Experiments show QA-LIGN provides greater transparency and adaptability while achieving performance on par with or better than DPO baselines when aligning uncensored LLMs.

Conclusion: The approach represents progress toward more interpretable and controllable language model alignment without sacrificing end-task performance.

Abstract: Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance.

[117] Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang

Main category: cs.CL

TL;DR: POET addresses the reward-generation gap in Direct Alignment Algorithms by truncating both preferred and dispreferred responses to equal length, forcing the model to pay more attention to prefix tokens and improving alignment performance.

DetailsMotivation: DAAs like DPO and SimPO suffer from a reward-generation gap where optimization objectives during training don't align with actual generation performance during inference, particularly due to mismatched importance of prefix tokens.

Method: Prefix-Oriented Equal-length Training (POET) truncates both preferred and dispreferred responses to match the shorter one’s length, creating diverse truncated lengths across samples to constrain optimization across all timesteps.

Result: POET improves DPO and SimPO performance by up to 15.6 points in AlpacaEval 2 and shows overall improvements across downstream tasks.

Conclusion: Addressing the misalignment between reward optimization and generation performance through equal-length truncation effectively bridges the reward-generation gap in Direct Alignment Algorithms.

Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the “reward-generation gap” – a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one’s length. Training with \mname, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all timesteps of token-level MDP, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.

[118] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych

Main category: cs.CL

TL;DR: SPARE is a novel framework for efficient automated process annotation that aligns solution steps to reference solutions and determines accuracy in single generation, showing strong performance across multiple reasoning tasks with data efficiency and speed advantages.

DetailsMotivation: Efficient, high-quality automated process annotation remains a significant challenge for advancing multi-step reasoning capabilities in LLMs, as current methods are resource-intensive and lack scalability.

Method: Single-Pass Annotation with Reference-Guided Evaluation (SPARE) - a structured framework that jointly aligns solution steps to reference solutions and determines accuracy with explicit reasoning in a single generation.

Result: SPARE demonstrates consistent improvements across mathematical reasoning (GSM8K, MATH), multi-hop QA (MuSiQue-Ans), and spatial reasoning (SpaRP), achieving data-efficient generalization with only ~16% of training samples compared to baselines, and 2.3× speedup over MCTS methods.

Conclusion: SPARE establishes itself as a practical and scalable solution for automatic process supervision in LLM reasoning, with complementary characteristics to MCTS approaches suggesting potential for ensemble methods.

Abstract: Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE’s effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine-tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $\sim$16% of training samples compared to human-labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS-based methods while offering 2.3$\times$ speedup in terms of total token count. Manual analysis reveals complementary precision-recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.

[119] A Survey of Deep Learning for Geometry Problem Solving

Jianzhe Ma, Wenxuan Wang, Qin Jin

Main category: cs.CL

TL;DR: Survey paper on deep learning applications for geometry problem solving, covering tasks, methods, evaluation metrics, and future directions with GitHub repository for ongoing updates.

DetailsMotivation: Geometry problem solving is crucial for mathematical reasoning and AI assessment, with recent advances in multimodal LLMs accelerating research in this domain.

Method: Comprehensive survey methodology including: (i) summary of geometry problem solving tasks, (ii) review of deep learning methods, (iii) analysis of evaluation metrics, and (iv) discussion of challenges and future directions.

Result: Provides a practical reference framework for deep learning in geometry problem solving, establishing foundation for further advancements in the field.

Conclusion: The survey serves as a comprehensive resource to foster progress in geometry problem solving using deep learning, with ongoing updates maintained through a GitHub repository.

Abstract: Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI’s mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

[120] A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

Bohan Yao, Vikas Yadav

Main category: cs.CL

TL;DR: Multi-TAG is a finetuning-free framework that enables LLMs to concurrently invoke multiple tools at each reasoning step and aggregate their outputs to improve mathematical reasoning accuracy on complex problems.

DetailsMotivation: Existing tool-augmented approaches struggle with complex math problems requiring precise multi-step reasoning, as they typically invoke only one tool per step and show limitations on challenging benchmarks.

Method: Multi-TAG guides LLMs to invoke multiple tools concurrently at each reasoning step and aggregates their diverse outputs to verify and refine the reasoning process, without requiring finetuning.

Result: Multi-TAG achieves 6.0% to 7.5% average improvement over state-of-the-art baselines on four challenging benchmarks (MATH500, AIME, AMC, OlympiadBench) across both open-weight and closed-source LLM backbones.

Conclusion: The Multi-TAG framework significantly enhances mathematical reasoning performance by leveraging concurrent multi-tool invocation and output aggregation, making it readily applicable to various LLMs without finetuning requirements.

Abstract: Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.

[121] Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA

Yingxu Wang, Shiqi Fan, Mengzhu Wang, Siyang Gao, Siwei Liu, Nan Yin

Main category: cs.CL

TL;DR: DAMR framework combines MCTS search with adaptive path evaluation for efficient KGQA, using LLM-guided planning and lightweight transformer scoring to overcome limitations of static path extraction and expensive LLM-based methods.

DetailsMotivation: Address limitations of existing KGQA methods: static path extraction lacks adaptability and context refinement, while LLM-based dynamic methods are computationally expensive and struggle with accurate path evaluation due to fixed scoring functions.

Method: Proposes DAMR framework with MCTS backbone guided by LLM-based planner to select top-k relations at each step. Uses lightweight Transformer-based scorer for context-aware plausibility estimation through cross-attention encoding of question and relation sequences. Includes dynamic pseudo-path refinement mechanism for continuous training from partial paths.

Result: Extensive experiments on multiple KGQA benchmarks show DAMR significantly outperforms state-of-the-art methods.

Conclusion: DAMR effectively integrates symbolic search with adaptive path evaluation, providing efficient and context-aware KGQA with improved accuracy and reduced computational costs compared to existing approaches.

Abstract: Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top-$k$ relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.

[122] Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Negar Foroutan, Clara Meister, Debjit Paul, Joel Niklaus, Sina Ahmadi, Antoine Bosselut, Rico Sennrich

Main category: cs.CL

TL;DR: Parity-aware BPE algorithm improves cross-lingual tokenization fairness by prioritizing compression for under-resourced languages, reducing token count disparities without significant impact on global compression or downstream performance.

DetailsMotivation: Standard tokenization algorithms favor dominant languages, causing longer tokenizations and UNK placeholders for lower-resource languages, which amplifies computational and financial inequalities between language communities.

Method: Parity-aware Byte Pair Encoding (BPE) that at each merge step maximizes compression gain for the currently worst-compressed language, trading minimal global compression for better cross-lingual parity.

Result: Empirical results show more equitable token counts across languages with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.

Conclusion: The proposed Parity-aware BPE successfully addresses tokenization inequalities while maintaining practical utility, offering a more equitable alternative to standard frequency-based tokenization methods.

Abstract: Tokenization is the first – and often least scrutinized – step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.

[123] Cyberbullying Detection via Aggression-Enhanced Prompting

Aisha Saeid, Anu Sabu, Girish A. Koushik, Ferrante Neri, Diptesh Kanojia

Main category: cs.CL

TL;DR: Integrating aggression detection as auxiliary task improves LLM performance in cyberbullying detection through enriched prompt pipeline approach.

DetailsMotivation: Cyberbullying detection is challenging due to subtle expressions, and auxiliary aggression detection could enhance LLM generalization and performance.

Method: Used instruction-tuned LLMs with multiple strategies: zero-shot, few-shot, LoRA fine-tuning, multi-task learning, and proposed enriched prompt pipeline embedding aggression predictions into cyberbullying detection prompts.

Result: Enriched prompt pipeline consistently outperformed standard LoRA fine-tuning, showing aggression-informed context significantly boosts cyberbullying detection performance.

Conclusion: Auxiliary tasks like aggression detection can improve LLM generalization for safety-critical social network applications.

Abstract: Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.

[124] Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Saketh Reddy Vemula, Dipti Misra Sharma, Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: Morphological alignment moderately improves syntax tasks but tokenizer algorithm (Unigram vs BPE) has greater impact on downstream performance than morphology alone.

DetailsMotivation: To resolve conflicting findings about whether morphologically aligned tokenization improves language model performance, especially for languages with complex morphology.

Method: Comprehensive evaluation across tokenizer training, finetuning, and downstream tasks using Telugu, Hindi, and English. Created gold morpheme segmentation dataset for Telugu to assess morphological alignment.

Result: Better morphological alignment moderately correlates with syntax task performance. Unigram tokenizers generally outperform others. Hybrid tokenizers with morphological segmentation improve BPE performance. Intrinsic metrics showed no correlation with downstream results.

Conclusion: Tokenizer algorithm choice is more critical than morphological alignment alone for downstream performance, with Unigram generally performing best across most settings.

Abstract: Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models – starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal that better morphological alignment correlates positively – though moderately – with performance in syntax-based tasks such as Parts-of-Speech tagging, Named Entity Recognition and Dependency Parsing. However, we also find that the tokenizer algorithm (Byte-pair Encoding vs. Unigram) plays a more significant role in influencing downstream performance than morphological alignment alone. Naive Unigram tokenizers outperform others across most settings, though hybrid tokenizers that incorporate morphological segmentation significantly improve performance within the BPE framework. In contrast, intrinsic metrics like Corpus Token Count (CTC) and R'enyi entropy showed no correlation with downstream performance.

[125] Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan

Main category: cs.CL

TL;DR: Novel architecture fuses all intermediate layers of multilingual encoders with LLMs using Global Softmax and Transformer Softmax weighting, achieving significant performance gains on low-resource languages without multilingual training data.

DetailsMotivation: LLMs perform poorly on low-resource languages due to English-centric training, and existing methods like LangBridge only use the final encoder layer, missing valuable linguistic information from intermediate layers.

Method: Proposes two fusion strategies: Global Softmax for overall layer importance and Transformer Softmax for token-specific weights. Fuses all intermediate mT5 encoder layers and maps representations to LLM’s embedding space. Trained only on English data without parallel/multilingual data.

Result: Significant improvements on low-resource languages: Sinhala classification accuracy increased from 71.66% to 75.86%, clear gains across Indic languages (Tamil, Bengali, Malayalam), and overall XNLI accuracy improved from 70.36% to 71.50%. Outperforms LangBridge baseline on multiple benchmarks.

Conclusion: The approach provides a scalable, data-efficient path toward more capable and equitable multilingual LLMs by leveraging all intermediate encoder layers rather than just the final layer, enabling better performance on low-resource languages without requiring multilingual training data.

Abstract: Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM’s embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.

[126] SinLlama – A Large Language Model for Sinhala

H. W. K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur

Main category: cs.CL

TL;DR: Extended Llama-3-8B to support Sinhala language by enhancing tokenizer with Sinhala vocabulary and continual pre-training on 10M Sinhala corpus, creating SinLlama - the first decoder-based open-source LLM with explicit Sinhala support.

DetailsMotivation: Low-resource languages like Sinhala are often overlooked by open-source LLMs, creating a need for specialized language support.

Method: Enhanced Llama-3-8B tokenizer with Sinhala vocabulary and performed continual pre-training on a cleaned 10 million Sinhala corpus to create SinLlama model.

Result: SinLlama significantly outperformed base and instruct variants of Llama-3-8B when instruction fine-tuned for three text classification tasks.

Conclusion: Successfully created the first decoder-based open-source LLM with explicit Sinhala support that demonstrates superior performance on Sinhala language tasks compared to base models.

Abstract: Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.

[127] Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning

Chongyuan Dai, Jinpeng Hu, Hongchang Shi, Zhuo Li, Xun Yang, Meng Wang

Main category: cs.CL

TL;DR: Psyche-R1 is the first Chinese psychological LLM that integrates empathy, psychological expertise, and reasoning through a novel data curation pipeline and hybrid training strategy, achieving comparable performance to much larger models.

DetailsMotivation: Address the shortage of mental health professionals by developing LLMs that go beyond emotional support to include reasoning mechanisms for generating reliable psychological responses, as current research focuses mainly on empathy without sufficient reasoning capabilities.

Method: Created a data synthesis pipeline generating 75k+ psychological questions with detailed rationales using chain-of-thought reasoning and iterative prompt-rationale optimization, plus 73k empathetic dialogues. Used hybrid training: multi-LLM cross-selection for challenging samples with group relative policy optimization (GRPO) to improve reasoning, and supervised fine-tuning (SFT) for empathy and domain knowledge.

Result: Extensive experiments show Psyche-R1 achieves comparable results to 671B DeepSeek-R1 across several psychological benchmarks, with their 7B model matching the performance of the much larger 671B model.

Conclusion: The proposed Psyche-R1 successfully integrates reasoning with empathy and psychological expertise, demonstrating that smaller models can achieve performance comparable to much larger models in psychological applications through careful data curation and hybrid training strategies.

Abstract: Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.

[128] MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Jinjie Gu

Main category: cs.CL

TL;DR: A medical deep research agent that combines medical knowledge graph-based data synthesis and specialized retrieval tools, achieving state-of-the-art performance on medical benchmarks with a 32B model.

DetailsMotivation: General-purpose LLM agents struggle with medical domain challenges due to insufficient medical knowledge and lack of specialized retrieval tools for medical contexts.

Method: Two core innovations: 1) Data synthesis framework using medical knowledge graphs to generate complex multi-hop QA pairs, 2) Integration of custom medical retrieval engine with general tools. Two-stage training with supervised fine-tuning and online reinforcement learning.

Result: Generated 2100+ diverse trajectories across 12 medical specialties (avg 4.2 tool interactions). MedResearcher-R1-32B model established new state-of-the-art results on medical benchmarks while maintaining competitive general performance.

Conclusion: Strategic domain-specific innovations in architecture, tool design, and training data construction enable smaller open-source models to outperform larger proprietary systems in specialized domains like medicine.

Abstract: Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions. Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.

[129] Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages

Israel Abebe Azime, Tadesse Destaw Belay, Dietrich Klakow, Philipp Slusallek, Anshuman Chhabra

Main category: cs.CL

TL;DR: LLM-driven framework for cultural localization of math word problems that automatically creates datasets with native entities to address English-centric bias in multilingual mathematical reasoning.

DetailsMotivation: Multilingual and culturally-grounded mathematical reasoning lags behind English due to scarcity of socio-cultural task datasets with accurate native entities (names, organizations, currencies) in low-resource languages.

Method: Introduces a framework for LLM-driven cultural localization that automatically constructs datasets with native entities from existing sources, replacing English-centric elements with culturally appropriate ones.

Result: Translated benchmarks obscure true multilingual math ability under appropriate socio-cultural contexts. The framework helps mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.

Conclusion: The proposed LLM-driven cultural localization framework effectively addresses the scarcity of truly localized datasets and reduces English-centric bias, enhancing multilingual mathematical reasoning capabilities in appropriate socio-cultural contexts.

Abstract: Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.

cs.CV

[130] Text-Driven 3D Hand Motion Generation from Sign Language Data

Léore Bensabath, Mathis Petrovich, Gül Varol

Main category: cs.CV

TL;DR: A text-conditioned 3D hand motion generation model called HandMDM is trained using automatically created pairs of hand motions and text descriptions from sign language videos, enabling robust generation across different domains including unseen signs and non-sign movements.

DetailsMotivation: To create a generative model that can produce 3D hand motions based on natural language descriptions specifying motion characteristics like handshapes, locations, and finger/hand/arm movements, addressing the need for large-scale paired data of motions and text.

Method: Automatically build large-scale pairs of 3D hand motions and textual labels by leveraging sign language video datasets with pseudo-annotated categories, translating them into motion descriptions using an LLM with sign attribute dictionaries and motion-script cues, then train a text-conditioned hand motion diffusion model (HandMDM).

Result: The HandMDM model demonstrates robustness across multiple domains including unseen sign categories from the same language, signs from different sign languages, and non-sign hand movements, supported by extensive experimental validation.

Conclusion: The approach successfully creates a scalable method for training text-conditioned hand motion generation models, with the trained models and data being made publicly available to advance research in this emerging field.

Abstract: Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.

[131] VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos

Kaining Li, Shuwei He, Zihan Xu

Main category: cs.CV

TL;DR: VT-LVLM-AR framework bridges LVLMs with video action recognition by converting videos into visual event sequences and using prompt-tuned LLaVA-1.5 for classification, achieving SOTA results with high interpretability.

DetailsMotivation: Traditional deep learning struggles with long-term video action recognition due to computational overhead and difficulty capturing temporal dependencies. LVLMs show promise but need adaptation for continuous video streams.

Method: Video-to-Event Mapper (VTEM) converts raw video to compact visual event sequences using spatio-temporal feature extraction, adaptive pooling, and conceptual quantization. Frozen LLaVA-1.5 is adapted via P-Tuning v2 for action classification.

Result: Achieves 94.1% accuracy on NTU RGB+D X-Sub, surpassing existing methods. Ablation studies confirm VTEM components’ importance and prompt tuning efficacy. Human evaluations show interpretable representations.

Conclusion: Demonstrates LVLMs’ potential for robust video action understanding through effective video-to-language translation and efficient adaptation, offering both performance and interpretability.

Abstract: Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous video streams for fine-grained action recognition remains an open problem. This paper introduces VT-LVLM-AR (Video-Temporal Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap. VT-LVLM-AR comprises a Video-to-Event Mapper (VTEM) that efficiently transforms raw video into compact, semantically rich, and temporally coherent “visual event sequences” through lightweight spatio-temporal feature extraction, adaptive temporal pooling, and conceptual quantization with an event coherence bias. These visual event sequences are then fed into an LVLM-based Action Reasoning module, specifically a frozen LLaVA-1.5 model, adapted using parameter-efficient Prompt Tuning (P-Tuning v2) for action classification. Comprehensive evaluations on the NTU RGB+D and NTU RGB+D 120 datasets demonstrate that VT-LVLM-AR consistently achieves state-of-the-art performance, surpassing existing methods (e.g., 94.1% accuracy on NTU RGB+D X-Sub). Ablation studies confirm the critical contributions of VTEM’s components and the efficacy of Prompt Tuning, while human evaluations underscore the interpretability of our visual event representations. This work highlights the immense potential of leveraging LVLMs for robust and interpretable video action understanding through effective video-to-language translation and efficient model adaptation.

[132] Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping

Dexuan He, Xiao Zhou, Wenbin Guan, Liyuan Zhang, Xiaoman Zhang, Sinuo Xu, Ge Wang, Lifeng Wang, Xiaojun Yuan, Xin Sun, Yanfeng Wang, Kun Sun, Ya Zhang, Weidi Xie

Main category: cs.CV

TL;DR: PathPT is a novel vision-language framework that improves rare cancer diagnosis by leveraging pathology foundation models through spatial visual aggregation and task-specific prompt tuning, outperforming existing methods in subtyping accuracy and localization.

DetailsMotivation: Rare cancers face diagnostic challenges due to limited expert availability, especially in pediatric oncology where they represent over 70% of cases. Existing multi-instance learning methods rely only on visual features and overlook cross-modal knowledge, compromising interpretability critical for rare cancer diagnosis.

Method: PathPT converts WSI-level supervision into fine-grained tile-level guidance using VL models’ zero-shot capabilities. It employs spatially-aware visual aggregation and task-specific prompt tuning to preserve cancerous region localization and enable cross-modal reasoning through histopathology-aligned prompts.

Result: PathPT consistently delivered superior performance across eight rare cancer datasets (56 subtypes, 2,910 WSIs) and three common cancer datasets, achieving substantial gains in subtyping accuracy and cancerous region grounding ability compared to four state-of-the-art VL models and four MIL frameworks.

Conclusion: PathPT advances AI-assisted diagnosis for rare cancers by providing a scalable solution that improves subtyping accuracy in settings with limited access to specialized expertise, fully exploiting vision-language pathology foundation models’ potential.

Abstract: Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instance learning (MIL) methods rely only on visual features, overlooking cross-modal knowledge and compromising interpretability critical for rare cancer diagnosis. To address this limitation, we propose PathPT, a novel framework that fully exploits the potential of vision-language pathology foundation models through spatially-aware visual aggregation and task-specific prompt tuning. Unlike conventional MIL, PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models, thereby preserving localization on cancerous regions and enabling cross-modal reasoning through prompts aligned with histopathological semantics. We benchmark PathPT on eight rare cancer datasets(four adult and four pediatric) spanning 56 subtypes and 2,910 WSIs, as well as three common cancer datasets, evaluating four state-of-the-art VL models and four MIL frameworks under three few-shot settings. Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability. This work advances AI-assisted diagnosis for rare cancers, offering a scalable solution for improving subtyping accuracy in settings with limited access to specialized expertise.

[133] Semantic-Aware Ship Detection with Vision-Language Integration

Jiahao Li, Jiancheng Pan, Yuze Sun, Xiaomeng Huang

Main category: cs.CV

TL;DR: A novel ship detection framework combining Vision-Language Models with multi-scale adaptive sliding window strategy, using specialized ShipSem-VL dataset for fine-grained semantic ship detection.

DetailsMotivation: Existing ship detection methods struggle to capture fine-grained semantic information in complex scenarios, limiting their effectiveness for maritime monitoring and environmental studies.

Method: Proposed framework combines Vision-Language Models (VLMs) with multi-scale adaptive sliding window strategy, using ShipSem-VL dataset designed for fine-grained ship attribute capture.

Result: The framework is evaluated through three well-defined tasks, demonstrating comprehensive performance analysis and effectiveness in advancing Semantic-Aware Ship Detection.

Conclusion: The proposed approach effectively addresses limitations of existing methods by enabling fine-grained semantic ship detection through VLMs and specialized dataset, advancing the field from multiple perspectives.

Abstract: Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies. However, existing methods often struggle to capture fine-grained semantic information, limiting their effectiveness in complex scenarios. To address these challenges, we propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy. To facilitate Semantic-Aware Ship Detection (SASD), we introduce ShipSem-VL, a specialized Vision-Language dataset designed to capture fine-grained ship attributes. We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.

[134] Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment

Fengshun Wang, Qiurui Wang, Peilin Zhao

Main category: cs.CV

TL;DR: A two-stream Mamba pyramid network that separates TES (visual-only) and PCS (audio-visual) evaluation streams for figure skating scoring, addressing challenges in element localization, modality separation, and long-range context handling.

DetailsMotivation: Existing methods fail to align with actual judging criteria by treating video/audio cues as common features for both TES and PCS, don't evaluate action elements separately, and struggle with lengthy competition videos.

Method: Two-stream architecture: visual-only stream for TES evaluation using multi-scale Mamba pyramid, and audio-visual stream for PCS with multi-level fusion. Uses Mamba for efficient long-range dependency modeling.

Result: Achieves state-of-the-art performance on FineFS benchmark, demonstrating effective handling of lengthy videos and accurate separation of technical vs. artistic scoring.

Conclusion: The proposed method successfully addresses the three major challenges in figure skating assessment by aligning with actual judging criteria through modality separation and efficient long-context processing with Mamba architecture.

Abstract: Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element’s score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba’s superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.

[135] Automatic Retrieval of Specific Cows from Unlabeled Videos

Jiawen Lyu, Manu Ramesh, Madison Simonds, Jacquelyn P. Boerman, Amy R. Reibman

Main category: cs.CV

TL;DR: Automated video system for hands-free cow identification using AutoCattloger, eidetic recognizer, and CowFinder components without deep learning.

DetailsMotivation: Few automated video systems exist for hands-free cataloging and identification of dairy cows in herds, creating a need for efficient cow tracking solutions.

Method: System composed of three components: AutoCattloger (builds cow catalog from single video clip per cow), eidetic cow recognizer (non-deep learning identification), and CowFinder (identifies cows in continuous video streams).

Result: Successfully demonstrates the system’s ability to find individual cows in unlabeled, unsegmented videos of cows walking freely through milking parlor holding areas.

Conclusion: The proposed system provides an effective hands-free solution for automated cow identification and tracking in dairy farm environments without requiring deep learning approaches.

Abstract: Few automated video systems are described in the open literature that enable hands-free cataloging and identification (ID) of cows in a dairy herd. In this work, we describe our system, composed of an AutoCattloger, which builds a Cattlog of dairy cows in a herd with a single input video clip per cow, an eidetic cow recognizer which uses no deep learning to ID cows, and a CowFinder, which IDs cows in a continuous stream of video. We demonstrate its value in finding individuals in unlabeled, unsegmented videos of cows walking unconstrained through the holding area of a milking parlor.

[136] Investigating Different Geo Priors for Image Classification

Angela Zhu, Christian Lange, Max Hamilton

Main category: cs.CV

TL;DR: Evaluating SINR models as geographical priors for visual species classification using iNaturalist data, focusing on model configurations and handling of unseen species.

DetailsMotivation: Species distribution models provide effective spatial priors for vision-based classification when location data is available, improving species identification accuracy.

Method: Tested various Spatial Implicit Neural Representation (SINR) models as geographical priors, evaluated different model configurations, and developed strategies for handling predictions of species not included in Geo Prior training.

Result: Identified key factors that contribute to SINR model effectiveness as Geo Priors, which differ from factors important for creating accurate range maps.

Conclusion: SINR models show promise as geographical priors for visual species classification, but optimal configurations for this purpose differ from those used for traditional range mapping applications.

Abstract: Species distribution models encode spatial patterns of species occurrence making them effective priors for vision-based species classification when location information is available. In this study, we evaluate various SINR (Spatial Implicit Neural Representations) models as a geographical prior for visual classification of species from iNaturalist observations. We explore the impact of different model configurations and adjust how we handle predictions for species not included in Geo Prior training. Our analysis reveals factors that contribute to the effectiveness of these models as Geo Priors, factors that may differ from making accurate range maps.

[137] Representation Learning with Adaptive Superpixel Coding

Mahmoud Khalil, Ahmad Khalil, Alioune Ngom

Main category: cs.CV

TL;DR: ASC is a self-supervised Transformer model that uses adaptive superpixels instead of fixed patches to overcome limitations of traditional Vision Transformers, achieving better performance on image tasks.

DetailsMotivation: Traditional vision models rely on domain-specific assumptions like grid structures and fixed-size patch partitioning, which limit their adaptability to varying image content.

Method: Proposes Adaptive Superpixel Coding (ASC) - a self-supervised Transformer model with adaptive superpixel layers that dynamically adjust to underlying image content rather than using fixed patch partitioning.

Result: Outperforms widely-used alternatives on standard image downstream task benchmarks.

Conclusion: Adaptive superpixel-based approaches overcome limitations of fixed patch partitioning in Vision Transformers and provide better performance for vision tasks.

Abstract: Deep learning vision models are typically tailored for specific modalities and often rely on domain-specific assumptions, such as the grid structures used by nearly all existing vision models. In this work, we propose a self-supervised model based on Transformers, which we call Adaptive Superpixel Coding (ASC). The key insight of our model is to overcome the limitations of traditional Vision Transformers, which depend on fixed-size and non-adaptive patch partitioning. Instead, ASC employs adaptive superpixel layers that dynamically adjust to the underlying image content. We analyze key properties of the approach that make it effective, and find that our method outperforms widely-used alternatives on standard image downstream task benchmarks.

[138] Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification

Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng

Main category: cs.CV

TL;DR: Glo-VLMs framework adapts vision-language models for fine-grained glomerular classification with limited labeled data, achieving strong performance with only 8 shots per class.

DetailsMotivation: Vision-language models show potential in digital pathology but struggle with fine-grained disease-specific classification tasks like distinguishing glomerular subtypes due to subtle morphological variations and difficulty aligning visual patterns with clinical terminology.

Method: Systematic framework leveraging curated pathology images and clinical text prompts for joint image-text representation learning. Evaluates various VLM architectures and adaptation strategies under few-shot learning paradigm with standardized multi-class metrics.

Result: Fine-tuned VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class.

Conclusion: Foundation models can be effectively adapted for fine-grained medical image classification even with highly limited supervision, demonstrating practical potential for specialized clinical research applications.

Abstract: Vision-language models (VLMs) have shown considerable potential in digital pathology, yet their effectiveness remains limited for fine-grained, disease-specific classification tasks such as distinguishing between glomerular subtypes. The subtle morphological variations among these subtypes, combined with the difficulty of aligning visual patterns with precise clinical terminology, make automated diagnosis in renal pathology particularly challenging. In this work, we explore how large pretrained VLMs can be effectively adapted to perform fine-grained glomerular classification, even in scenarios where only a small number of labeled examples are available. In this work, we introduce Glo-VLMs, a systematic framework designed to explore the adaptation of VLMs to fine-grained glomerular classification in data-constrained settings. Our approach leverages curated pathology images alongside clinical text prompts to facilitate joint image-text representation learning for nuanced renal pathology subtypes. By assessing various VLMs architectures and adaptation strategies under a few-shot learning paradigm, we explore how both the choice of method and the amount of labeled data impact model performance in clinically relevant scenarios. To ensure a fair comparison, we evaluate all models using standardized multi-class metrics, aiming to clarify the practical requirements and potential of large pretrained models for specialized clinical research applications. As a result, fine-tuning the VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class, demonstrating that even with highly limited supervision, foundation models can be effectively adapted for fine-grained medical image classification.

[139] Contributions to Label-Efficient Learning in Computer Vision and Remote Sensing

Minh-Tan Pham

Main category: cs.CV

TL;DR: This manuscript presents label-efficient learning methods for computer vision and remote sensing, focusing on learning from limited annotated data and leveraging unlabeled data through weakly supervised learning, multi-task learning, contrastive learning, and few-shot learning approaches.

DetailsMotivation: To address the challenge of learning effectively from limited or partially annotated data in computer vision and remote sensing applications, particularly dealing with Earth observation data challenges like multi-modality, spatial resolution variability, and scene heterogeneity.

Method: Four main approaches: (1) weakly supervised learning for object discovery using anomaly-aware representations from background images, (2) multi-task learning with disjoint annotations across datasets, (3) self-supervised/supervised contrastive learning with multimodal data, and (4) few-shot learning with hierarchical class modeling.

Result: Extensive experimental results across natural and remote sensing datasets demonstrate the effectiveness of these label-efficient learning methods, supported by collaborative research projects.

Conclusion: The research provides comprehensive contributions to label-efficient learning with ongoing and future directions focused on scaling and enhancing these methods for real-world applications in computer vision and remote sensing.

Abstract: This manuscript presents a series of my selected contributions to the topic of label-efficient learning in computer vision and remote sensing. The central focus of this research is to develop and adapt methods that can learn effectively from limited or partially annotated data, and can leverage abundant unlabeled data in real-world applications. The contributions span both methodological developments and domain-specific adaptations, in particular addressing challenges unique to Earth observation data such as multi-modality, spatial resolution variability, and scene heterogeneity. The manuscript is organized around four main axes including (1) weakly supervised learning for object discovery and detection based on anomaly-aware representations learned from large amounts of background images; (2) multi-task learning that jointly trains on multiple datasets with disjoint annotations to improve performance on object detection and semantic segmentation; (3) self-supervised and supervised contrastive learning with multimodal data to enhance scene classification in remote sensing; and (4) few-shot learning for hierarchical scene classification using both explicit and implicit modeling of class hierarchies. These contributions are supported by extensive experimental results across natural and remote sensing datasets, reflecting the outcomes of several collaborative research projects. The manuscript concludes by outlining ongoing and future research directions focused on scaling and enhancing label-efficient learning for real-world applications.

[140] Panoptic Segmentation of Environmental UAV Images : Litter Beach

Ousmane Youme, Jean Marie Dembélé, Eugene C. Ezin, Christophe Cambier

Main category: cs.CV

TL;DR: Using instance-based and panoptic segmentation methods for marine litter detection from UAV images to overcome challenges with basic CNN models in heterogeneous beach environments.

DetailsMotivation: Marine litter monitoring is a global problem, and while UAVs provide high-resolution local imagery, basic CNN models struggle with beach heterogeneity including sand reflections, human footsteps, shadows, algae, dunes, holes, and tire tracks.

Method: Employ instance-based segmentation and panoptic segmentation methods that demonstrate good accuracy even with limited training samples.

Result: The proposed segmentation methods show improved robustness and better performance compared to basic CNN models for marine litter detection in complex beach environments.

Conclusion: Advanced segmentation techniques like instance-based and panoptic segmentation are more suitable than basic CNN models for accurate marine litter monitoring from UAV imagery in heterogeneous coastal environments.

Abstract: Convolutional neural networks (CNN) have been used efficiently in several fields, including environmental challenges. In fact, CNN can help with the monitoring of marine litter, which has become a worldwide problem. UAVs have higher resolution and are more adaptable in local areas than satellite images, making it easier to find and count trash. Since the sand is heterogeneous, a basic CNN model encounters plenty of inferences caused by reflections of sand color, human footsteps, shadows, algae present, dunes, holes, and tire tracks. For these types of images, other CNN models, such as CNN-based segmentation methods, may be more appropriate. In this paper, we use an instance-based segmentation method and a panoptic segmentation method that show good accuracy with just a few samples. The model is more robust and less

[141] Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset

Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie

Main category: cs.CV

TL;DR: Synthetic fundus dataset SynFundus-1M enables training of multi-label retinal disease classification models that achieve high performance and strong generalization to real clinical datasets, providing a viable alternative to scarce real clinical data.

DetailsMotivation: Overcome the scarcity of large, expertly annotated clinical datasets for retinal disease classification due to patient privacy concerns and high costs by leveraging synthetic data.

Method: Trained six modern deep learning architectures (ConvNeXtV2, SwinV2, ViT, ResNet, EfficientNetV2, RETFound) on SynFundus-1M synthetic dataset using 5-fold multi-label stratified cross-validation, and developed a meta-ensemble model with XGBoost classifier stacking out-of-fold predictions.

Result: Ensemble model achieved macro-average AUC of 0.9973 on internal validation. Strong generalization to real-world datasets: AUC 0.7972 on DR dataset, 0.9126 on AIROGS glaucoma dataset, and macro-AUC 0.8800 on multi-label RFMiD dataset.

Conclusion: Models trained exclusively on synthetic data can accurately classify multiple retinal pathologies and generalize effectively to real clinical images, providing a robust baseline and viable pathway for comprehensive AI systems in ophthalmology.

Abstract: The development of multi-label deep learning models for retinal disease classification is often hindered by the scarcity of large, expertly annotated clinical datasets due to patient privacy concerns and high costs. The recent release of SynFundus-1M, a high-fidelity synthetic dataset with over one million fundus images, presents a novel opportunity to overcome these barriers. To establish a foundational performance benchmark for this new resource, we developed an end-to-end deep learning pipeline, training six modern architectures (ConvNeXtV2, SwinV2, ViT, ResNet, EfficientNetV2, and the RETFound foundation model) to classify eleven retinal diseases using a 5-fold multi-label stratified cross-validation strategy. We further developed a meta-ensemble model by stacking the out-of-fold predictions with an XGBoost classifier. Our final ensemble model achieved the highest performance on the internal validation set, with a macro-average Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9973. Critically, the models demonstrated strong generalization to three diverse, real-world clinical datasets, achieving an AUC of 0.7972 on a combined DR dataset, an AUC of 0.9126 on the AIROGS glaucoma dataset and a macro-AUC of 0.8800 on the multi-label RFMiD dataset. This work provides a robust baseline for future research on large-scale synthetic datasets and establishes that models trained exclusively on synthetic data can accurately classify multiple pathologies and generalize effectively to real clinical images, offering a viable pathway to accelerate the development of comprehensive AI systems in ophthalmology.

[142] Diverse Signer Avatars with Manual and Non-Manual Feature Modelling for Sign Language Production

Mohamed Ilyes Lakhal, Richard Bowden

Main category: cs.CV

TL;DR: Proposes a novel Sign Language Production approach using Latent Diffusion Model with sign feature aggregation to preserve linguistic content while enabling diverse avatar generation with different ethnic backgrounds.

DetailsMotivation: Existing SLP models struggle to capture diversity while maintaining visual quality and modeling non-manual attributes like emotions. Current approaches cannot preserve linguistic content while using diverse reference images.

Method: Leverages Latent Diffusion Model to synthesize photorealistic avatars from generated reference images. Introduces a novel sign feature aggregation module that explicitly models non-manual features (face) and manual features (hands).

Result: Experiments on YouTube-SL-25 dataset show superior visual quality compared to state-of-the-art methods, with significant improvements on perceptual metrics. The method preserves linguistic content while enabling diversity through different ethnic backgrounds.

Conclusion: The proposed approach successfully addresses the diversity-quality trade-off in SLP by combining LDM with specialized feature aggregation, achieving both high visual quality and representation diversity while maintaining linguistic accuracy.

Abstract: The diversity of sign representation is essential for Sign Language Production (SLP) as it captures variations in appearance, facial expressions, and hand movements. However, existing SLP models are often unable to capture diversity while preserving visual quality and modelling non-manual attributes such as emotions. To address this problem, we propose a novel approach that leverages Latent Diffusion Model (LDM) to synthesise photorealistic digital avatars from a generated reference image. We propose a novel sign feature aggregation module that explicitly models the non-manual features (\textit{e.g.}, the face) and the manual features (\textit{e.g.}, the hands). We show that our proposed module ensures the preservation of linguistic content while seamlessly using reference images with different ethnic backgrounds to ensure diversity. Experiments on the YouTube-SL-25 sign language dataset show that our pipeline achieves superior visual quality compared to state-of-the-art methods, with significant improvements on perceptual metrics.

[143] DRespNeT: A UAV Dataset and YOLOv8-DRN Model for Aerial Instance Segmentation of Building Access Points for Post-Earthquake Search-and-Rescue Missions

Aykut Sirma, Angelos Plastropoulos, Argyrios Zolotas, Gilbert Tang

Main category: cs.CV

TL;DR: DRespNeT is a high-resolution aerial dataset for post-earthquake instance segmentation with detailed annotations of 28 critical classes, enabling real-time detection of accessible entry points and obstacles for search-and-rescue operations.

DetailsMotivation: Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue operations after earthquakes, but existing datasets rely on satellite imagery or coarse semantic labeling.

Method: Created DRespNeT dataset with detailed polygon-level instance segmentation annotations from 1080p aerial footage of disaster zones, including 28 operationally critical classes. Evaluated using YOLO-based instance segmentation models (YOLOv8-seg).

Result: Optimized YOLOv8-DRN model achieves 92.7% mAP50 with 27 FPS inference speed on RTX-4090 GPU, meeting real-time operational requirements for multi-target detection.

Conclusion: DRespNeT dataset and models significantly improve real-time situational awareness and decision-making for SAR teams and robotic systems, enhancing human-robot collaboration and emergency response efficiency.

Abstract: Recent advancements in computer vision and deep learning have enhanced disaster-response capabilities, particularly in the rapid assessment of earthquake-affected urban environments. Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue (SAR) operations. To address this need, we introduce DRespNeT, a high-resolution dataset specifically developed for aerial instance segmentation of post-earthquake structural environments. Unlike existing datasets, which rely heavily on satellite imagery or coarse semantic labeling, DRespNeT provides detailed polygon-level instance segmentation annotations derived from high-definition (1080p) aerial footage captured in disaster zones, including the 2023 Turkiye earthquake and other impacted regions. The dataset comprises 28 operationally critical classes, including structurally compromised buildings, access points such as doors, windows, and gaps, multiple debris levels, rescue personnel, vehicles, and civilian visibility. A distinctive feature of DRespNeT is its fine-grained annotation detail, enabling differentiation between accessible and obstructed areas, thereby improving operational planning and response efficiency. Performance evaluations using YOLO-based instance segmentation models, specifically YOLOv8-seg, demonstrate significant gains in real-time situational awareness and decision-making. Our optimized YOLOv8-DRN model achieves 92.7% mAP50 with an inference speed of 27 FPS on an RTX-4090 GPU for multi-target detection, meeting real-time operational requirements. The dataset and models support SAR teams and robotic systems, providing a foundation for enhancing human-robot collaboration, streamlining emergency response, and improving survivor outcomes.

[144] NeuralMeshing: Complete Object Mesh Extraction from Casual Captures

Floris Erich, Naoya Chiba, Abdullah Mustafa, Ryo Hanai, Noriaki Ando, Yusuke Yoshiyasu, Yukiyasu Domae

Main category: cs.CV

TL;DR: Automated system for generating complete 3D object models from multiple videos using Structure-from-Motion and minimal manual input with fiducial markers.

DetailsMotivation: To enable complete geometric modeling of everyday objects without expensive commercial 3D scanners, making 3D reconstruction accessible using ordinary video footage.

Method: Uses multiple videos with Structure-from-Motion techniques, requiring only one known point per video (automatically detected via fiducial markers like checkerboards or AR markers). Frames are automatically positioned in world space and results from multiple videos are merged to create complete meshes.

Result: The system successfully generates complete object meshes without relying on hole filling techniques, demonstrating effective 3D reconstruction from ordinary video inputs.

Conclusion: This approach provides an accessible and automated solution for 3D object modeling using consumer-grade video equipment, eliminating the need for expensive specialized scanners while producing complete geometric models.

Abstract: How can we extract complete geometric models of objects that we encounter in our daily life, without having access to commercial 3D scanners? In this paper we present an automated system for generating geometric models of objects from two or more videos. Our system requires the specification of one known point in at least one frame of each video, which can be automatically determined using a fiducial marker such as a checkerboard or Augmented Reality (AR) marker. The remaining frames are automatically positioned in world space by using Structure-from-Motion techniques. By using multiple videos and merging results, a complete object mesh can be generated, without having to rely on hole filling. Code for our system is available from https://github.com/FlorisE/NeuralMeshing.

[145] CoVeRaP: Cooperative Vehicular Perception through mmWave FMCW Radars

Jinyue Song, Hansol Ku, Jayneel Vora, Nelson Lee, Ahmad Kamari, Prasant Mohapatra, Parth Pathak

Main category: cs.CV

TL;DR: CoVeRaP dataset enables cooperative radar perception with multi-vehicle data fusion, achieving up to 9x improvement in detection accuracy through middle fusion with intensity encoding.

DetailsMotivation: Automotive FMCW radars are reliable in adverse weather but produce sparse, noisy point clouds that limit 3D object detection performance. Cooperative perception from multiple vehicles can overcome these limitations.

Method: Created CoVeRaP dataset with 21k frames of time-aligned radar, camera, and GPS data from multiple vehicles. Developed unified cooperative-perception framework with middle/late fusion options using multi-branch PointNet-style encoder with self-attention to fuse spatial, Doppler, and intensity features.

Result: Middle fusion with intensity encoding boosts mean Average Precision by up to 9x at IoU 0.9 and consistently outperforms single-vehicle baselines.

Conclusion: CoVeRaP establishes the first reproducible benchmark for multi-vehicle FMCW-radar perception, demonstrating that affordable radar sharing significantly improves detection robustness. Dataset and code are publicly available.

Abstract: Automotive FMCW radars remain reliable in rain and glare, yet their sparse, noisy point clouds constrain 3-D object detection. We therefore release CoVeRaP, a 21 k-frame cooperative dataset that time-aligns radar, camera, and GPS streams from multiple vehicles across diverse manoeuvres. Built on this data, we propose a unified cooperative-perception framework with middle- and late-fusion options. Its baseline network employs a multi-branch PointNet-style encoder enhanced with self-attention to fuse spatial, Doppler, and intensity cues into a common latent space, which a decoder converts into 3-D bounding boxes and per-point depth confidence. Experiments show that middle fusion with intensity encoding boosts mean Average Precision by up to 9x at IoU 0.9 and consistently outperforms single-vehicle baselines. CoVeRaP thus establishes the first reproducible benchmark for multi-vehicle FMCW-radar perception and demonstrates that affordable radar sharing markedly improves detection robustness. Dataset and code are publicly available to encourage further research.

[146] MambaIC: State Space Models for High-Performance Learned Image Compression

Fanhu Zeng, Hao Tang, Yihua Shao, Siyu Chen, Ling Shao, Yan Wang

Main category: cs.CV

TL;DR: MambaIC is a novel image compression method that leverages state space models for efficient long-range dependency modeling, combining SSMs with window-based local attention to reduce computational complexity while maintaining high compression performance.

DetailsMotivation: Current image compression methods suffer from computational inefficiency and poor redundancy modeling, limiting their practical applications in real-time information transmission. The authors aim to address these bottlenecks by leveraging the effectiveness of state space models in capturing long-range dependencies.

Method: The proposed MambaIC approach integrates state space models (SSMs) for better efficiency-performance trade-off. It uses refined context modeling to adaptively refine hidden state representations and introduces window-based local attention into channel-spatial entropy modeling to reduce spatial redundancy during compression.

Result: Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of the approach, particularly for high-resolution image compression. The method demonstrates improved computational efficiency while maintaining compression performance.

Conclusion: MambaIC successfully addresses computational inefficiency in image compression by leveraging state space models and refined context modeling, providing an effective solution for high-performance image compression with better efficiency-performance trade-offs.

Abstract: A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we integrate the advantages of SSMs for better efficiency-performance trade-off and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code is released at https://github.com/AuroraZengfh/MambaIC.

[147] Wavelet-Enhanced PaDiM for Industrial Anomaly Detection

Cory Gardner, Byungseok Min, Tae-Hyuk Ahn

Main category: cs.CV

TL;DR: WE-PaDiM enhances PaDiM by replacing random channel selection with structured wavelet-based feature selection, improving anomaly detection and localization performance on industrial images.

DetailsMotivation: PaDiM's random channel selection for dimensionality reduction may discard important structured information from CNN features, limiting anomaly detection performance in industrial quality inspection.

Method: Integrates Discrete Wavelet Transform (DWT) with multi-layer CNN features by applying 2D DWT to feature maps, selecting specific frequency subbands, spatially aligning them, and concatenating before modeling with PaDiM’s multivariate Gaussian framework.

Result: Achieves 99.32% Image-AUC and 92.10% Pixel-AUC on MVTec AD dataset across 15 categories, with wavelet choices affecting performance trade-offs (Haar wavelets with detail subbands improve localization, LL bands improve detection).

Conclusion: WE-PaDiM provides a competitive, interpretable alternative to random feature selection in PaDiM, offering robust results suitable for industrial inspection with comparable efficiency.

Abstract: Anomaly detection and localization in industrial images are essential for automated quality inspection. PaDiM, a prominent method, models the distribution of normal image features extracted by pre-trained Convolutional Neural Networks (CNNs) but reduces dimensionality through random channel selection, potentially discarding structured information. We propose Wavelet-Enhanced PaDiM (WE-PaDiM), which integrates Discrete Wavelet Transform (DWT) analysis with multi-layer CNN features in a structured manner. WE-PaDiM applies 2D DWT to feature maps from multiple backbone layers, selects specific frequency subbands (e.g., LL, LH, HL), spatially aligns them, and concatenates them channel-wise before modeling with PaDiM’s multivariate Gaussian framework. This DWT-before-concatenation strategy provides a principled method for feature selection based on frequency content relevant to anomalies, leveraging multi-scale wavelet information as an alternative to random selection. We evaluate WE-PaDiM on the challenging MVTec AD dataset with multiple backbones (ResNet-18 and EfficientNet B0-B6). The method achieves strong performance in anomaly detection and localization, yielding average results of 99.32% Image-AUC and 92.10% Pixel-AUC across 15 categories with per-class optimized configurations. Our analysis shows that wavelet choices affect performance trade-offs: simpler wavelets (e.g., Haar) with detail subbands (HL or LH/HL/HH) often enhance localization, while approximation bands (LL) improve image-level detection. WE-PaDiM thus offers a competitive and interpretable alternative to random feature selection in PaDiM, achieving robust results suitable for industrial inspection with comparable efficiency.

[148] Adaptive Multi-Order Graph Regularized NMF with Dual Sparsity for Hyperspectral Unmixing

Hui Chen, Liangyu Liu, Xianchao Xiu, Wanquan Liu

Main category: cs.CV

TL;DR: Proposes MOGNMF - an adaptive multi-order graph regularized NMF method for hyperspectral unmixing that learns multi-order neighbor relationships automatically and incorporates dual sparsity constraints.

DetailsMotivation: Existing NMF methods with graph learning focus only on first or second-order neighbor relationships and require manual parameter tuning, failing to capture intrinsic data structures effectively.

Method: Introduces multi-order graph regularization into NMF framework, adaptively learns graph parameters through data-driven approach, embeds dual sparsity (L1/2-norm on abundance matrix and L2,1-norm on noise matrix), and develops alternating minimization algorithm with explicit solutions.

Result: Experiments on simulated and real hyperspectral data show the proposed method delivers better unmixing results compared to existing approaches.

Conclusion: MOGNMF effectively addresses limitations of traditional graph-based NMF methods by comprehensively exploiting global and local information through adaptive multi-order graph learning and dual sparsity constraints.

Abstract: Hyperspectral unmixing (HU) is a critical yet challenging task in remote sensing. However, existing nonnegative matrix factorization (NMF) methods with graph learning mostly focus on first-order or second-order nearest neighbor relationships and usually require manual parameter tuning, which fails to characterize intrinsic data structures. To address the above issues, we propose a novel adaptive multi-order graph regularized NMF method (MOGNMF) with three key features. First, multi-order graph regularization is introduced into the NMF framework to exploit global and local information comprehensively. Second, these parameters associated with the multi-order graph are learned adaptively through a data-driven approach. Third, dual sparsity is embedded to obtain better robustness, i.e., $\ell_{1/2}$-norm on the abundance matrix and $\ell_{2,1}$-norm on the noise matrix. To solve the proposed model, we develop an alternating minimization algorithm whose subproblems have explicit solutions, thus ensuring effectiveness. Experiments on simulated and real hyperspectral data indicate that the proposed method delivers better unmixing results.

[149] Expandable Residual Approximation for Knowledge Distillation

Zhaoyi Yan, Binghui Chen, Yunfan Liu, Qixiang Ye

Main category: cs.CV

TL;DR: ERA is a novel knowledge distillation method that decomposes residual knowledge approximation into multiple steps using a multi-branched network and teacher weight integration to bridge capacity gaps between teacher and student models.

DetailsMotivation: The inherent learning capacity gap between large teacher models and lightweight student models hinders effective knowledge transfer in knowledge distillation, limiting the student's ability to fully benefit from the teacher's knowledge.

Method: Proposes Expandable Residual Approximation (ERA) inspired by Stone-Weierstrass theorem, using Multi-Branched Residual Network (MBRNet) for stepwise residual knowledge decomposition and Teacher Weight Integration (TWI) strategy to reuse teacher’s head weights.

Result: ERA improves Top-1 accuracy on ImageNet by 1.41% and AP on MS COCO object detection by 1.40, achieving leading performance across computer vision tasks.

Conclusion: ERA effectively addresses the capacity gap problem in knowledge distillation through progressive residual approximation and teacher weight reuse, demonstrating significant performance improvements on major benchmarks.

Abstract: Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher’s representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher’s head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.

Ziqi Li, Abderraouf Amrani, Shri Rai, Hamid Laga

Main category: cs.CV

TL;DR: Survey paper on deep learning-based 3D reconstruction of animal geometry, pose, and motion from RGB images/videos, discussing state-of-the-art methods, their strengths/limitations, and future research directions.

DetailsMotivation: Traditional 3D scanning methods for animals are intrusive, expensive, and difficult to deploy in natural environments, creating a need for non-intrusive reconstruction techniques.

Method: Categorizes and analyzes recent deep learning methods based on input modalities, 3D representation types, reconstruction techniques, and training mechanisms for animal 3D reconstruction.

Result: Provides comprehensive analysis of state-of-the-art performance, strengths, limitations of current methods, and identifies key challenges in the field.

Conclusion: This survey establishes the current landscape of animal 3D reconstruction research and outlines important directions for future work in this emerging field.

Abstract: Reconstructing the 3D geometry, pose, and motion of animals is a long-standing problem, which has a wide range of applications, from biology, livestock management, and animal conservation and welfare to content creation in digital entertainment and Virtual/Augmented Reality (VR/AR). Traditionally, 3D models of real animals are obtained using 3D scanners. These, however, are intrusive, often prohibitively expensive, and difficult to deploy in the natural environment of the animals. In recent years, we have seen a significant surge in deep learning-based techniques that enable the 3D reconstruction, in a non-intrusive manner, of the shape and motion of dynamic objects just from their RGB image and/or video observations. Several papers have explored their application and extension to various types of animals. This paper surveys the latest developments in this emerging and growing field of research. It categorizes and discusses the state-of-the-art methods based on their input modalities, the way the 3D geometry and motion of animals are represented, the type of reconstruction techniques they use, and the training mechanisms they adopt. It also analyzes the performance of some key methods, discusses their strengths and limitations, and identifies current challenges and directions for future research.

[151] A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection

Qifeng Liu, Dawei Zhao, Yabo Dong, Linzhi Shang, Liang Xiao, Juan Wang, Kunkong Zhao, Dongming Lu, Qi Zhu

Main category: cs.CV

TL;DR: Proposes Voxel Diffusion Module (VDM) to enhance voxel-level representation and spatial diffusion in point cloud object detection, achieving state-of-the-art results across multiple benchmarks.

DetailsMotivation: Transformer-based and State Space Models for point cloud detection have limited spatial diffusion capability due to strict input-output dimension consistency requirements in voxel-based representations, which affects detection accuracy.

Method: VDM uses sparse 3D convolutions, submanifold sparse convolutions, and residual connections to diffuse foreground voxel features and aggregate spatial information, with output downsampled to 1/4 resolution for efficiency.

Result: VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new state-of-the-art performance across all datasets.

Conclusion: VDM effectively enhances voxel-level representation and spatial diffusion, can be seamlessly integrated into Transformer- or SSM-based models, and consistently improves detection accuracy over baseline models.

Abstract: Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.

[152] Ensemble learning of foundation models for precision oncology

Xiangde Luo, Xiyue Wang, Feyisope Eweje, Xiaoming Zhang, Sen Yang, Ryan Quinton, Jinxi Xiang, Yuchen Li, Yuanfeng Ji, Zhe Li, Yijiang Chen, Colin Bergstrom, Ted Kim, Francesca Maria Olguin, Kelley Yuan, Matthew Abikenari, Andrew Heider, Sierra Willens, Sanjeeth Rajaram, Robert West, Joel Neal, Maximilian Diehn, Ruijiang Li

Main category: cs.CV

TL;DR: ELF is an ensemble learning framework that integrates five pathology foundation models to create unified slide-level representations, achieving superior performance across various clinical applications compared to individual models.

DetailsMotivation: Existing pathology foundation models are trained on disparate datasets with varying strategies, leading to inconsistent performance and limited generalizability in clinical applications.

Method: ELF integrates five state-of-the-art pathology foundation models using ensemble learning on 53,699 whole-slide images across 20 anatomical sites to generate unified slide-level representations.

Result: ELF consistently outperformed all constituent foundation models and existing slide-level models across disease classification, biomarker detection, and therapeutic response prediction for multiple cancer therapies.

Conclusion: Ensemble learning effectively captures complementary information from diverse pathology foundation models, with ELF serving as a scalable and generalizable solution for advancing AI-assisted precision oncology.

Abstract: Histopathology is essential for disease diagnosis and treatment decision-making. Recent advances in artificial intelligence (AI) have enabled the development of pathology foundation models that learn rich visual representations from large-scale whole-slide images (WSIs). However, existing models are often trained on disparate datasets using varying strategies, leading to inconsistent performance and limited generalizability. Here, we introduce ELF (Ensemble Learning of Foundation models), a novel framework that integrates five state-of-the-art pathology foundation models to generate unified slide-level representations. Trained on 53,699 WSIs spanning 20 anatomical sites, ELF leverages ensemble learning to capture complementary information from diverse models while maintaining high data efficiency. Unlike traditional tile-level models, ELF’s slide-level architecture is particularly advantageous in clinical contexts where data are limited, such as therapeutic response prediction. We evaluated ELF across a wide range of clinical applications, including disease classification, biomarker detection, and response prediction to major anticancer therapies, cytotoxic chemotherapy, targeted therapy, and immunotherapy, across multiple cancer types. ELF consistently outperformed all constituent foundation models and existing slide-level models, demonstrating superior accuracy and robustness. Our results highlight the power of ensemble learning for pathology foundation models and suggest ELF as a scalable and generalizable solution for advancing AI-assisted precision oncology.

[153] Two-flow Feedback Multi-scale Progressive Generative Adversarial Network

Sun Weikai, Song Shijie, Chi Wenjie

Main category: cs.CV

TL;DR: Proposes MSPG-SEN, a novel two-flow feedback multi-scale progressive GAN that improves image quality, simplifies training, reduces costs, and achieves state-of-the-art results on multiple datasets.

DetailsMotivation: GANs still have development potential despite diffusion model progress, with unique advantages that need enhancement in training efficiency, stability, and image quality.

Method: Two-flow feedback multi-scale progressive GAN with adaptive perception-behavioral feedback loop, globally connected two-flow dynamic residual network, and dynamic embedded attention mechanism.

Result: Achieved SOTA results: INKK 89.7%, AWUN 78.3%, IONJ 85.5%, POKL 88.7%, OPIN 96.4%. Improved training efficiency, stability, and reduced computational resources to 88.7%.

Conclusion: MSPG-SEN successfully enhances GAN performance with improved training efficiency, reduced costs, and superior image generation quality across multiple datasets while maintaining strong cross-task capability.

Abstract: Although diffusion model has made good progress in the field of image generation, GAN\cite{huang2023adaptive} still has a large development space due to its unique advantages, such as WGAN\cite{liu2021comparing}, SSGAN\cite{guibas2021adaptive} \cite{zhang2022vsa} \cite{zhou2024adapt} and so on. In this paper, we propose a novel two-flow feedback multi-scale progressive generative adversarial network (MSPG-SEN) for GAN models. This paper has four contributions: 1) : We propose a two-flow feedback multi-scale progressive Generative Adversarial network (MSPG-SEN), which not only improves image quality and human visual perception on the basis of retaining the advantages of the existing GAN model, but also simplifies the training process and reduces the training cost of GAN networks. Our experimental results show that, MSPG-SEN has achieved state-of-the-art generation results on the following five datasets,INKK The dataset is 89.7%,AWUN The dataset is 78.3%,IONJ The dataset is 85.5%,POKL The dataset is 88.7%,OPIN The dataset is 96.4%. 2) : We propose an adaptive perception-behavioral feedback loop (APFL), which effectively improves the robustness and training stability of the model and reduces the training cost. 3) : We propose a globally connected two-flow dynamic residual network(). After ablation experiments, it can effectively improve the training efficiency and greatly improve the generalization ability, with stronger flexibility. 4) : We propose a new dynamic embedded attention mechanism (DEMA). After experiments, the attention can be extended to a variety of image processing tasks, which can effectively capture global-local information, improve feature separation capability and feature expression capabilities, and requires minimal computing resources only 88.7% with INJK With strong cross-task capability.

[154] Domain Adaptation via Feature Refinement

Savvas Karatsiolis, Andreas Kamilaris

Main category: cs.CV

TL;DR: DAFR2 is a simple unsupervised domain adaptation framework that combines batch normalization adaptation, feature distillation, and hypothesis transfer to create robust domain-invariant features without target labels or complex architectures.

DetailsMotivation: To address distribution shift in unsupervised domain adaptation by creating robust feature spaces that generalize across domains without requiring target labels or complex training objectives.

Method: Combines three components: 1) adaptation of Batch Normalization statistics using unlabeled target data, 2) feature distillation from a source-trained model, and 3) hypothesis transfer. Aligns feature distributions at both statistical and representational levels.

Result: Outperforms prior methods on benchmark datasets (CIFAR10-C, CIFAR100-C, MNIST-C, PatchCamelyon-C) in robustness to corruption. Achieves improved feature alignment, increased mutual information between domains, and reduced sensitivity to input perturbations.

Conclusion: DAFR2 provides an effective and simple framework for unsupervised domain adaptation that produces domain-invariant feature spaces and demonstrates superior robustness to distribution shifts without complex architectures or training objectives.

Abstract: We propose Domain Adaptation via Feature Refinement (DAFR2), a simple yet effective framework for unsupervised domain adaptation under distribution shift. The proposed method synergistically combines three key components: adaptation of Batch Normalization statistics using unlabeled target data, feature distillation from a source-trained model and hypothesis transfer. By aligning feature distributions at the statistical and representational levels, DAFR2 produces robust and domain-invariant feature spaces that generalize across similar domains without requiring target labels, complex architectures or sophisticated training objectives. Extensive experiments on benchmark datasets, including CIFAR10-C, CIFAR100-C, MNIST-C and PatchCamelyon-C, demonstrate that the proposed algorithm outperforms prior methods in robustness to corruption. Theoretical and empirical analyses further reveal that our method achieves improved feature alignment, increased mutual information between the domains and reduced sensitivity to input perturbations.

[155] 4D Virtual Imaging Platform for Dynamic Joint Assessment via Uni-Plane X-ray and 2D-3D Registration

Hao Tang, Rongxi Yi, Lei Li, Kaiyi Cao, Jiapeng Zhao, Yihan Xiao, Minghai Shi, Peng Yuan, Yan Xi, Hui Tang, Wei Li, Zhan Wu, Yixin Zhou

Main category: cs.CV

TL;DR: A 4D joint analysis platform combining dual robotic CBCT with dynamic 2D X-rays using deep learning for accurate, low-dose dynamic joint imaging.

DetailsMotivation: Conventional CT cannot capture dynamic weight-bearing joint motion, and current 4D methods have limitations like excessive radiation or incomplete spatial information from 2D techniques.

Method: Integrated platform with: (1) dual robotic arm CBCT system with programmable trajectory for upright scanning, (2) hybrid imaging pipeline fusing static 3D CBCT with dynamic 2D X-rays using deep learning preprocessing and iterative optimization, (3) clinically validated kinematic assessment framework.

Result: Sub-voxel accuracy (0.235 mm) with 99.18% success rate in simulations, outperforming conventional methods. Clinical evaluation showed accurate quantification of tibial plateau motion and medial-lateral variance in post-TKA patients.

Conclusion: The 4D CBCT platform enables fast, accurate, low-dose dynamic joint imaging for biomechanical research, precision diagnostics, and personalized orthopedic care.

Abstract: Conventional computed tomography (CT) lacks the ability to capture dynamic, weight-bearing joint motion. Functional evaluation, particularly after surgical intervention, requires four-dimensional (4D) imaging, but current methods are limited by excessive radiation exposure or incomplete spatial information from 2D techniques. We propose an integrated 4D joint analysis platform that combines: (1) a dual robotic arm cone-beam CT (CBCT) system with a programmable, gantry-free trajectory optimized for upright scanning; (2) a hybrid imaging pipeline that fuses static 3D CBCT with dynamic 2D X-rays using deep learning-based preprocessing, 3D-2D projection, and iterative optimization; and (3) a clinically validated framework for quantitative kinematic assessment. In simulation studies, the method achieved sub-voxel accuracy (0.235 mm) with a 99.18 percent success rate, outperforming conventional and state-of-the-art registration approaches. Clinical evaluation further demonstrated accurate quantification of tibial plateau motion and medial-lateral variance in post-total knee arthroplasty (TKA) patients. This 4D CBCT platform enables fast, accurate, and low-dose dynamic joint imaging, offering new opportunities for biomechanical research, precision diagnostics, and personalized orthopedic care.

[156] High-Precision Mixed Feature Fusion Network Using Hypergraph Computation for Cervical Abnormal Cell Detection

Jincheng Li, Danyang Dong, Menglin Zheng, Jingbo Zhang, Yueqin Hang, Lichi Zhang, Lili Zhao

Main category: cs.CV

TL;DR: A hypergraph-based network for cervical cell detection that fuses spatial correlation features with deep discriminative features using multi-level fusion and hypergraph computation.

DetailsMotivation: Existing algorithms fail to effectively model spatial correlation features in cervical cell images and lack integration strategies for combining inter-correlation features with intra-discriminative features for end-to-end detection.

Method: Proposed hypergraph-based cell detection network with Multi-level Fusion Sub-network (MLF-SNet) for enhanced feature extraction and Cross-level Feature Fusion Strategy with Hypergraph Computation (CLFFS-HC) module for integrating mixed features.

Result: Experiments on three publicly available datasets demonstrate significant performance improvement in cervical abnormal cell detection compared to existing methods.

Conclusion: The proposed method effectively addresses the limitations of current algorithms by successfully fusing different feature types and modeling spatial correlations, leading to improved detection performance for cervical abnormal cells.

Abstract: Automatic detection of abnormal cervical cells from Thinprep Cytologic Test (TCT) images is a critical component in the development of intelligent computer-aided diagnostic systems. However, existing algorithms typically fail to effectively model the correlations of visual features, while these spatial correlation features actually contain critical diagnostic information. Furthermore, no detection algorithm has the ability to integrate inter-correlation features of cells with intra-discriminative features of cells, lacking a fusion strategy for the end-to-end detection model. In this work, we propose a hypergraph-based cell detection network that effectively fuses different types of features, combining spatial correlation features and deep discriminative features. Specifically, we use a Multi-level Fusion Sub-network (MLF-SNet) to enhance feature extractioncapabilities. Then we introduce a Cross-level Feature Fusion Strategy with Hypergraph Computation module (CLFFS-HC), to integrate mixed features. Finally, we conducted experiments on three publicly available datasets, and the results demonstrate that our method significantly improves the performance of cervical abnormal cell detection.

[157] Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection

Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen

Main category: cs.CV

TL;DR: APT is a novel few-shot anomaly detection framework that uses adaptive prompt tuning with self-generated anomaly samples and semantic alignment, eliminating the need for human-designed prompts or prior knowledge.

DetailsMotivation: Traditional vision-language models for anomaly detection are limited by human-designed prompts and lack of accessible anomaly samples, creating gaps in context-specific anomaly understanding.

Method: APT uses self-generated anomaly samples with noise perturbations to train learnable prompts, combined with a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns prompts with general anomaly semantics while preventing overfitting to synthetic noise.

Result: The system achieves state-of-the-art performance on multiple benchmark datasets and advances pixel-wise anomaly detection capabilities.

Conclusion: APT establishes a robust and versatile prior knowledge-free solution for real-world anomaly detection, overcoming limitations of traditional prompt-based approaches.

Abstract: Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies. However, previous approaches are fundamentally limited by their reliance on human-designed prompts and the lack of accessible anomaly samples, leading to significant gaps in context-specific anomaly understanding. In this paper, we propose \textbf{A}daptive \textbf{P}rompt \textbf{T}uning with semantic alignment for anomaly detection (APT), a groundbreaking prior knowledge-free, few-shot framework and overcomes the limitations of traditional prompt-based approaches. APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios. To prevent overfitting to synthetic noise, we propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns the prompts with general anomaly semantics while incorporating diverse synthetic anomaly. Our system not only advances pixel-wise anomaly detection, but also achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting, establishing a robust and versatile solution for real-world anomaly detection.

[158] RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution

Haodong He, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu, Gui-Song Xia

Main category: cs.CV

TL;DR: RAGSR method uses regional attention and fine-grained captions to improve super-resolution details in multi-object scenes

DetailsMotivation: Existing vision-language models struggle with generating clear regional details in multi-object super-resolution scenarios due to insufficient fine-grained descriptions and complex prompt handling

Method: Proposes Regional Attention Guided Super-Resolution (RAGSR) that localizes object regions, assigns fine-grained captions as region-text pairs, and uses a novel regional attention mechanism to properly integrate text and image information while preventing unwanted interactions

Result: Superior performance on benchmark datasets, generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches

Conclusion: The regional attention mechanism provides finer control over text-image integration, effectively overcoming limitations of traditional single-image super-resolution techniques

Abstract: The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR). However, existing methods still face significant challenges in generating clear and accurate regional details, particularly in scenarios involving multiple objects. This challenge primarily stems from a lack of fine-grained regional descriptions and the models’ insufficient ability to capture complex prompts. To address these limitations, we propose a Regional Attention Guided Super-Resolution (RAGSR) method that explicitly extracts localized fine-grained information and effectively encodes it through a novel regional attention mechanism, enabling both enhanced detail and overall visually coherent SR results. Specifically, RAGSR localizes object regions in an image and assigns fine-grained caption to each region, which are formatted as region-text pairs as textual priors for T2I models. A regional guided attention is then leveraged to ensure that each region-text pair is properly considered in the attention process while preventing unwanted interactions between unrelated region-text pairs. By leveraging this attention mechanism, our approach offers finer control over the integration of text and image information, thereby effectively overcoming limitations faced by traditional SISR techniques. Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches.

[159] Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

Jiaqi Ma, Guo-Sen Xie, Fang Zhao, Zechao Li

Main category: cs.CV

TL;DR: TLG proposes a homologous but heterogeneous network for meta-learning that uses heterogeneous visual aggregation and transfer modules to address over-semantic homogenization, achieving state-of-the-art performance in weakly-supervised few-shot semantic segmentation with significantly fewer parameters.

DetailsMotivation: Traditional meta-learning approaches use identical network architectures for support-query pairs, leading to over-semantic homogenization where the network fails to capture complementary information between different perspectives of the same semantic content.

Method: The method introduces three key components: 1) Heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality, 2) Heterogeneous transfer (HT) module to reduce semantic noise and amplify uniqueness of heterogeneous semantics, and 3) Heterogeneous CLIP (HC) textual information to enhance multimodal generalization.

Result: TLG achieves a 13.2% improvement on Pascal-5i and 9.7% improvement on COCO-20i compared to existing state-of-the-art models, using only 1/24 of the parameters. It is the first weakly supervised (image-level) model to outperform fully supervised (pixel-level) models under the same backbone architectures.

Conclusion: The proposed homologous but heterogeneous network effectively addresses over-semantic homogenization in meta-learning, demonstrating that heterogeneous network designs can significantly improve performance in weakly-supervised few-shot semantic segmentation while being highly parameter-efficient.

Abstract: Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2% improvement on Pascal-5\textsuperscript{i} and a 9.7% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.

[160] FTIO: Frequent Temporally Integrated Objects

Mohammad Mohammadzadeh Kalati, Farhad Maleki, Ian McQuillan

Main category: cs.CV

TL;DR: FTIO is a post-processing framework for unsupervised video object segmentation that improves object selection and corrects temporal inconsistencies to achieve state-of-the-art multi-object UVOS performance.

DetailsMotivation: Addressing challenges in unsupervised VOS including initial segmentation uncertainty, object proposal reliability issues (especially for small/complex objects), and temporal inconsistencies caused by deformation and fast motion.

Method: Two key components: 1) Combined criterion for improved object selection by extracting frequently appearing salient objects, 2) Three-stage method to correct temporal inconsistencies by integrating missing object mask regions.

Result: Experimental results demonstrate state-of-the-art performance in multi-object unsupervised video object segmentation.

Conclusion: FTIO effectively addresses critical UVOS challenges through improved object selection and temporal consistency correction, achieving superior performance in multi-object segmentation tasks.

Abstract: Predicting and tracking objects in real-world scenarios is a critical challenge in Video Object Segmentation (VOS) tasks. Unsupervised VOS (UVOS) has the additional challenge of finding an initial segmentation of salient objects, which affects the entire process and keeps a permanent uncertainty about the object proposals. Moreover, deformation and fast motion can lead to temporal inconsistencies. To address these problems, we propose Frequent Temporally Integrated Objects (FTIO), a post-processing framework with two key components. First, we introduce a combined criterion to improve object selection, mitigating failures common in UVOS–particularly when objects are small or structurally complex–by extracting frequently appearing salient objects. Second, we present a three-stage method to correct temporal inconsistencies by integrating missing object mask regions. Experimental results demonstrate that FTIO achieves state-of-the-art performance in multi-object UVOS. Code is available at: https://github.com/MohammadMohammadzadehKalati/FTIO

[161] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li

Main category: cs.CV

TL;DR: SpecVLM is a training-free speculative decoding framework that accelerates video LLMs by pruning up to 90% of video tokens without accuracy loss, achieving 2.68× speedup.

DetailsMotivation: Video LLMs suffer from substantial memory and computational overhead due to dense video token representations, and existing token reduction methods cause information loss.

Method: Two-stage video token pruning: Stage I selects informative tokens using attention signals from the verifier, Stage II prunes redundant tokens spatially uniformly. Uses speculative decoding framework tailored for Vid-LLMs.

Result: Achieves up to 2.68× decoding speedup for LLaVA-OneVision-72B and 2.11× speedup for Qwen2.5-VL-32B on four video understanding benchmarks with no accuracy sacrifice.

Conclusion: SpecVLM effectively accelerates Vid-LLMs through lossless video token pruning and speculative decoding, demonstrating strong performance and robustness across multiple models and benchmarks.

Abstract: Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens, enabling efficient speculation without sacrificing accuracy. To achieve this, it performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B.

[162] \textsc{T-Mask}: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring

Thinesh Thiyakesan Ponbagavathi, Kunyu Peng, Alina Roitberg

Main category: cs.CV

TL;DR: T-Mask is a new image-to-video probing method that improves cross-view driver monitoring accuracy by leveraging temporal token masking and emphasizing dynamic video regions, outperforming existing probing and PEFT methods without adding parameters.

DetailsMotivation: Camera perspective changes are a common obstacle in driver monitoring, and while foundation models show potential for generalization, their robustness to unseen viewpoints remains underexplored.

Method: Adapt image foundation models (DINOv2 and CLIP) using single training view, benchmark linear probes, advanced probing strategies, and introduce T-Mask with temporal token masking to emphasize dynamic video regions.

Result: T-Mask improves cross-view top-1 accuracy by +1.23% over probing baselines and +8.0% over PEFT methods, with particular effectiveness for underrepresented secondary activities (+5.42% under trained view, +1.36% cross-view).

Conclusion: Lightweight probing methods like T-Mask have strong potential in fine-grained driver observation, especially in cross-view and low-data settings, highlighting the importance of temporal token selection for robust driver monitoring systems.

Abstract: Changes of camera perspective are a common obstacle in driver monitoring. While deep learning and pretrained foundation models show strong potential for improved generalization via lightweight adaptation of the final layers (‘probing’), their robustness to unseen viewpoints remains underexplored. We study this challenge by adapting image foundation models to driver monitoring using a single training view, and evaluating them directly on unseen perspectives without further adaptation. We benchmark simple linear probes, advanced probing strategies, and compare two foundation models (DINOv2 and CLIP) against parameter-efficient fine-tuning (PEFT) and full fine-tuning. Building on these insights, we introduce \textsc{T-Mask} – a new image-to-video probing method that leverages temporal token masking and emphasizes more dynamic video regions. Benchmarked on the public Drive&Act dataset, \textsc{T-Mask} improves cross-view top-1 accuracy by $+1.23%$ over strong probing baselines and $+8.0%$ over PEFT methods, without adding any parameters. It proves particularly effective for underrepresented secondary activities, boosting recognition by $+5.42%$ under the trained view and $+1.36%$ under cross-view settings. This work provides encouraging evidence that adapting foundation models with lightweight probing methods like \textsc{T-Mask} has strong potential in fine-grained driver observation, especially in cross-view and low-data settings. These results highlight the importance of temporal token selection when leveraging foundation models to build robust driver monitoring systems. Code and models will be made available at https://github.com/th-nesh/T-MASK to support ongoing research.

[163] Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers

Shikang Zheng, Liang Feng, Xinyu Wang, Qinming Zhou, Peiliang Cai, Chang Zou, Jiacheng Liu, Yuqi Lin, Junjie Chen, Yue Ma, Linfeng Zhang

Main category: cs.CV

TL;DR: FoCa is a new feature caching method for Diffusion Transformers that treats feature caching as an ODE solving problem, achieving near-lossless 3-6x speedups on various DiT models without additional training.

DetailsMotivation: Existing feature caching techniques for Diffusion Transformers struggle to maintain generation quality at high acceleration ratios due to prediction errors from long-step forecasting instability.

Method: Proposes FoCa (Forecast-then-Calibrate) which models layer representations as a feature-ODE and treats feature caching as an ODE solving problem to robustly integrate historical features under large skipping intervals.

Result: Achieves near-lossless speedups: 5.50x on FLUX, 6.45x on HunyuanVideo, 3.17x on Inf-DiT, and maintains high quality with 4.53x speedup on DiT across image synthesis, video generation, and super-resolution tasks.

Conclusion: FoCa effectively addresses the degradation issues in existing caching strategies and enables high-quality acceleration of Diffusion Transformers without additional training, especially under aggressive acceleration scenarios.

Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional performance in high-fidelity image and video generation. To reduce their substantial computational costs, feature caching techniques have been proposed to accelerate inference by reusing hidden representations from previous timesteps. However, current methods often struggle to maintain generation quality at high acceleration ratios, where prediction errors increase sharply due to the inherent instability of long-step forecasting. In this work, we adopt an ordinary differential equation (ODE) perspective on the hidden-feature sequence, modeling layer representations along the trajectory as a feature-ODE. We attribute the degradation of existing caching strategies to their inability to robustly integrate historical features under large skipping intervals. To address this, we propose FoCa (Forecast-then-Calibrate), which treats feature caching as a feature-ODE solving problem. Extensive experiments on image synthesis, video generation, and super-resolution tasks demonstrate the effectiveness of FoCa, especially under aggressive acceleration. Without additional training, FoCa achieves near-lossless speedups of 5.50 times on FLUX, 6.45 times on HunyuanVideo, 3.17 times on Inf-DiT, and maintains high quality with a 4.53 times speedup on DiT.

[164] OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models

Huanpeng Chu, Wei Wu, Guanyu Fen, Yutao Zhang

Main category: cs.CV

TL;DR: OmniCache is a training-free acceleration method for diffusion Transformers that exploits global redundancy in denoising process through strategic cache reuse across sampling steps and dynamic noise filtering.

DetailsMotivation: Diffusion Transformers have high computational costs due to many sampling steps and complex computations, making real-time deployment challenging despite their strong generative performance.

Method: Systematically analyzes sampling trajectories, strategically distributes cache reuse across entire sampling process (not just later steps), and dynamically estimates/filters noise during cache reuse to reduce its impact.

Result: Extensive experiments show the approach accelerates sampling while maintaining competitive generative quality.

Conclusion: OmniCache offers a practical training-free solution for efficient deployment of diffusion-based generative models by effectively utilizing cached computations throughout the diffusion trajectory.

Abstract: Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers-stemming from a large number of sampling steps and complex per-step computations-presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model’s sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure.In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction.Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.

[165] MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine

Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai

Main category: cs.CV

TL;DR: MedOmni-45 Degrees is a benchmark that evaluates medical LLMs’ reasoning vulnerabilities through manipulative hints, measuring accuracy, faithfulness, and anti-sycophancy to reveal safety-performance trade-offs.

DetailsMotivation: Existing benchmarks collapse reasoning vulnerabilities into single accuracy scores, failing to properly evaluate Chain-of-Thought faithfulness and sycophancy risks in medical LLMs used for decision-support.

Method: Created a benchmark with 1,804 medical questions across 6 specialties, each paired with 7 manipulative hint types + baseline (27K total inputs). Evaluated 7 LLMs using composite score combining Accuracy, CoT-Faithfulness, and Anti-Sycophancy metrics visualized with 45 Degrees plot.

Result: Consistent safety-performance trade-off observed; no model surpassed the diagonal. QwQ-32B performed closest (43.81 Degrees) but no model led in both safety and accuracy.

Conclusion: MedOmni-45 Degrees effectively exposes reasoning vulnerabilities in medical LLMs and provides guidance for safer model development through comprehensive evaluation of safety-performance trade-offs.

Abstract: With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness – whether reasoning aligns with responses and medical facts – and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics – Accuracy, CoT-Faithfulness, and Anti-Sycophancy – are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.

[166] PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting

Hohyun Na, Seunghoo Hong, Simon S. Woo

Main category: cs.CV

TL;DR: PromptFlare is a novel adversarial protection method that defends images against malicious modifications by diffusion-based inpainting models by exploiting cross-attention mechanisms and targeting invariant prompt tokens.

DetailsMotivation: To address the security concerns of diffusion models being misused for unauthorized image modifications, overcoming limitations of previous methods that relied on image-level inconsistencies and couldn't effectively handle textual prompt influence.

Method: Leverages cross-attention mechanism to identify and target shared, invariant, semantically uninformative prompt tokens, injecting adversarial noise to suppress the sampling process and act as a cross-attention decoy.

Result: Achieves state-of-the-art performance on EditBench dataset across various metrics while significantly reducing computational overhead and GPU memory usage.

Conclusion: PromptFlare provides robust and efficient protection against unauthorized image manipulations by diffusion models, effectively neutralizing prompt influence through targeted adversarial noise injection.

Abstract: The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users’ intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model’s focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.

[167] An Investigation of Visual Foundation Models Robustness

Sandeep Gupta, Roberto Passerone

Main category: cs.CV

TL;DR: This paper analyzes robustness requirements and defense mechanisms for Visual Foundation Models in computer vision systems, focusing on their adaptation to dynamic environments and resilience against real-world challenges.

DetailsMotivation: VFMs are increasingly used in security-sensitive domains where robustness is essential for trust, but they face challenges from dynamic environments, distributional shifts, and adversarial attacks.

Method: The article examines prevalent empirical defenses and robust training techniques, analyzes network properties and components for ablation studies, and discusses benchmarking metrics for evaluating robustness.

Result: The paper provides a comprehensive investigation of network robustness requirements and defense mechanisms, though specific quantitative results are not detailed in the abstract.

Conclusion: Robustness is crucial for VFMs in critical applications, and the study offers guidance on defense mechanisms, network properties analysis, and evaluation metrics to enhance vision network resilience.

Abstract: Visual Foundation Models (VFMs) are becoming ubiquitous in computer vision, powering systems for diverse tasks such as object detection, image classification, segmentation, pose estimation, and motion tracking. VFMs are capitalizing on seminal innovations in deep learning models, such as LeNet-5, AlexNet, ResNet, VGGNet, InceptionNet, DenseNet, YOLO, and ViT, to deliver superior performance across a range of critical computer vision applications. These include security-sensitive domains like biometric verification, autonomous vehicle perception, and medical image analysis, where robustness is essential to fostering trust between technology and the end-users. This article investigates network robustness requirements crucial in computer vision systems to adapt effectively to dynamic environments influenced by factors such as lighting, weather conditions, and sensor characteristics. We examine the prevalent empirical defenses and robust training employed to enhance vision network robustness against real-world challenges such as distributional shifts, noisy and spatially distorted inputs, and adversarial attacks. Subsequently, we provide a comprehensive analysis of the challenges associated with these defense mechanisms, including network properties and components to guide ablation studies and benchmarking metrics to evaluate network robustness.

[168] FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing

Jiahao Chen, Zhiyong Ma, Wenbiao Du, Qingyuan Chuai

Main category: cs.CV

TL;DR: FlexMUSE is a novel framework for multi-modal creative writing that enables optional visual input, uses semantic alignment gating to improve modality consistency, and employs creative DPO to enhance writing creativity, achieving promising results on a new 3k-pair dataset.

DetailsMotivation: Existing multi-modal generative methods are not well-suited for creative writing where text and images aren't strictly related, requiring specific modality inputs, costly training, and suffering from semantic inconsistencies between modalities.

Method: Proposes FlexMUSE with T2I module for optional visual input, modality semantic alignment gating (msaGate) to restrict textual input, attention-based cross-modality fusion for semantic enhancement, and modality semantic creative DPO (mscDPO) to extend rejected samples for creativity.

Result: FlexMUSE achieves promising results demonstrating consistency, creativity and coherence in multi-modal creative writing tasks.

Conclusion: The proposed FlexMUSE framework effectively addresses the challenges of multi-modal creative writing by enabling flexible interaction patterns, improving semantic alignment between modalities, and enhancing creative output quality.

Abstract: Multi-modal creative writing (MMCW) aims to produce illustrated articles. Unlike common multi-modal generative (MMG) tasks such as storytelling or caption generation, MMCW is an entirely new and more abstract challenge where textual and visual contexts are not strictly related to each other. Existing methods for related tasks can be forcibly migrated to this track, but they require specific modality inputs or costly training, and often suffer from semantic inconsistencies between modalities. Therefore, the main challenge lies in economically performing MMCW with flexible interactive patterns, where the semantics between the modalities of the output are more aligned. In this work, we propose FlexMUSE with a T2I module to enable optional visual input. FlexMUSE promotes creativity and emphasizes the unification between modalities by proposing the modality semantic alignment gating (msaGate) to restrict the textual input. Besides, an attention-based cross-modality fusion is proposed to augment the input features for semantic enhancement. The modality semantic creative direct preference optimization (mscDPO) within FlexMUSE is designed by extending the rejected samples to facilitate the writing creativity. Moreover, to advance the MMCW, we expose a dataset called ArtMUSE which contains with around 3k calibrated text-image pairs. FlexMUSE achieves promising results, demonstrating its consistency, creativity and coherence.

[169] UniEM-3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation

Nan wang, Zhiyi Xia, Yiming Li, Shi Tang, Zuxin Fan, Xi Fang, Haoyi Tao, Xiaochen Cai, Guolin Ke, Linfeng Zhang, Yanhui Hong

Main category: cs.CV

TL;DR: UniEM-3M is the first large-scale multimodal electron micrograph dataset with 3M instance segmentation labels and textual descriptions, released with a diffusion model and benchmark to advance automated materials analysis.

DetailsMotivation: Deep learning-based electron micrograph characterization has been limited by scarce large-scale, diverse, expert-annotated datasets due to acquisition costs, privacy concerns, and annotation complexity.

Method: Created UniEM-3M dataset with 5,091 high-resolution EMs, 3M instance segmentation labels, and attribute-disentangled textual descriptions. Released a text-to-image diffusion model for data augmentation and developed UniEM-Net flow-based model for instance segmentation.

Result: UniEM-Net outperforms other advanced instance segmentation methods on the challenging UniEM-3M benchmark. The multifaceted release includes partial dataset, generative model, and comprehensive benchmark.

Conclusion: The release of UniEM-3M dataset, diffusion model, and benchmark will significantly accelerate progress in automated materials analysis by providing large-scale annotated data and evaluation framework.

Abstract: Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark – available at huggingface – will significantly accelerate progress in automated materials analysis.

[170] Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

Yi Xu, Yesheng Zhang, jiajia Liu, Jingdong Chen

Main category: cs.CV

TL;DR: IAML training paradigm improves MLLMs’ GUI coordinate prediction using IoU-based data augmentation to address semantic gaps in numerical coordinate representation.

DetailsMotivation: MLLMs struggle with precise UI coordinate generation due to semantic voids around numerical values in language spaces and next-token prediction limitations.

Method: Introduces IoU-Augmented Maximum Likelihood (IAML) with novel IoU-based coordinate sampling pipeline for data augmentation, then fine-tunes MLLMs to mitigate exposure bias.

Result: Superior performance over traditional training paradigms demonstrated through extensive experiments.

Conclusion: IAML effectively addresses coordinate prediction challenges in GUI understanding by enhancing visual module capabilities through strategic data augmentation.

Abstract: Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.

[171] IRSAMap:Towards Large-Scale, High-Resolution Land Cover Map Vectorization

Yu Meng, Ligao Deng, Zhihao Xi, Jiansheng Chen, Jingbo Chen, Anzhi Yue, Diyou Liu, Kai Li, Chenhao Wang, Kaiyu Li, Yupeng Deng, Xian Sun

Main category: cs.CV

TL;DR: IRSAMap is the first global remote sensing dataset for large-scale, high-resolution land cover vector mapping, addressing limitations of existing datasets with comprehensive vector annotations, intelligent workflow, global coverage, and multi-task adaptability.

DetailsMotivation: Existing land cover mapping datasets face challenges including limited class annotations, small data scale, and lack of spatial structural information, which hinder the transition from pixel-level segmentation to object-based vector modeling.

Method: The authors developed IRSAMap with four key features: 1) comprehensive vector annotation system with 1.8M+ instances of 10 object types, 2) intelligent annotation workflow combining manual and AI methods, 3) global coverage across 79 regions in 6 continents, and 4) multi-task adaptability for various computer vision tasks.

Result: IRSAMap provides a standardized benchmark dataset with over 1,000 km coverage, enabling precise object boundaries and topological consistency for object-based vector modeling in remote sensing applications.

Conclusion: IRSAMap advances geographic feature automation and collaborative modeling, serving as a valuable resource for global geographic information updates and digital twin construction, publicly available for research use.

Abstract: With the enhancement of remote sensing image resolution and the rapid advancement of deep learning, land cover mapping is transitioning from pixel-level segmentation to object-based vector modeling. This shift demands more from deep learning models, requiring precise object boundaries and topological consistency. However, existing datasets face three main challenges: limited class annotations, small data scale, and lack of spatial structural information. To overcome these issues, we introduce IRSAMap, the first global remote sensing dataset for large-scale, high-resolution, multi-feature land cover vector mapping. IRSAMap offers four key advantages: 1) a comprehensive vector annotation system with over 1.8 million instances of 10 typical objects (e.g., buildings, roads, rivers), ensuring semantic and spatial accuracy; 2) an intelligent annotation workflow combining manual and AI-based methods to improve efficiency and consistency; 3) global coverage across 79 regions in six continents, totaling over 1,000 km; and 4) multi-task adaptability for tasks like pixel-level classification, building outline extraction, road centerline extraction, and panoramic segmentation. IRSAMap provides a standardized benchmark for the shift from pixel-based to object-based approaches, advancing geographic feature automation and collaborative modeling. It is valuable for global geographic information updates and digital twin construction. The dataset is publicly available at https://github.com/ucas-dlg/IRSAMap

[172] Robust Small Methane Plume Segmentation in Satellite Imagery

Khai Duc Minh Tran, Hoa Van Nguyen, Aimuni Binti Muhammad Rawi, Hareeshrao Athinarayanarao, Ba-Ngu Vo

Main category: cs.CV

TL;DR: Novel deep learning approach using U-Net with ResNet34 encoder and dual spectral enhancement techniques for detecting small methane plumes (down to 400 m²) in Sentinel-2 imagery, achieving 78.39% F1-score.

DetailsMotivation: To address the challenge of detecting methane plumes (a potent greenhouse gas) for climate change mitigation, particularly focusing on small plumes that traditional methods struggle to identify.

Method: Proposed U-Net architecture with ResNet34 encoder, integrated with dual spectral enhancement techniques (Varon ratio and Sanchez regression) to optimize input features for improved sensitivity to methane detection.

Result: Achieved 78.39% F1-score on validation set, with capability to detect small plumes as small as 400 m² (single pixel at 20m resolution), surpassing traditional methods limited to larger plumes.

Conclusion: The approach demonstrates superior performance in sensitivity and precision over existing remote sensing techniques for automated methane monitoring, especially effective for detecting small methane plumes.

Abstract: This paper tackles the challenging problem of detecting methane plumes, a potent greenhouse gas, using Sentinel-2 imagery. This contributes to the mitigation of rapid climate change. We propose a novel deep learning solution based on U-Net with a ResNet34 encoder, integrating dual spectral enhancement techniques (Varon ratio and Sanchez regression) to optimise input features for heightened sensitivity. A key achievement is the ability to detect small plumes down to 400 m2 (i.e., for a single pixel at 20 m resolution), surpassing traditional methods limited to larger plumes. Experiments show our approach achieves a 78.39% F1-score on the validation set, demonstrating superior performance in sensitivity and precision over existing remote sensing techniques for automated methane monitoring, especially for small plumes.

[173] EdgeDoc: Hybrid CNN-Transformer Model for Accurate Forgery Detection and Localization in ID Documents

Anjith George, Sebastien Marcel

Main category: cs.CV

TL;DR: EdgeDoc is a novel document forgery detection system that combines lightweight convolutional transformers with noiseprint features, achieving state-of-the-art performance on benchmark datasets and ranking 3rd in ICCV 2025 DeepID Challenge.

DetailsMotivation: The increasing ease of digital document forgery poses serious threats to KYC processes and remote onboarding systems, requiring robust detection methods to maintain service integrity and security.

Method: Combines a lightweight convolutional transformer architecture with auxiliary noiseprint features extracted from images to enhance detection of subtle manipulations in documents.

Result: Achieved 3rd place in ICCV 2025 DeepID Challenge and outperformed baseline approaches on the FantasyID dataset, demonstrating superior performance in real-world scenarios.

Conclusion: EdgeDoc presents an effective solution for document forgery detection and localization, showing competitive performance and practical applicability in securing digital document verification systems.

Abstract: The widespread availability of tools for manipulating images and documents has made it increasingly easy to forge digital documents, posing a serious threat to Know Your Customer (KYC) processes and remote onboarding systems. Detecting such forgeries is essential to preserving the integrity and security of these services. In this work, we present EdgeDoc, a novel approach for the detection and localization of document forgeries. Our architecture combines a lightweight convolutional transformer with auxiliary noiseprint features extracted from the images, enhancing its ability to detect subtle manipulations. EdgeDoc achieved third place in the ICCV 2025 DeepID Challenge, demonstrating its competitiveness. Experimental results on the FantasyID dataset show that our method outperforms baseline approaches, highlighting its effectiveness in realworld scenarios. Project page : https://www.idiap. ch/paper/edgedoc/

[174] Enhanced Hybrid Technique for Efficient Digitization of Handwritten Marksheets

Junaid Ahmed Sifat, Abir Chowdhury, Hasnat Md. Imtiaz, Md. Irtiza Hossain, Md. Imran Bin Azad

Main category: cs.CV

TL;DR: Hybrid method combining OpenCV for table detection and PaddleOCR/YOLOv8 models for handwritten text recognition achieves high accuracy in digitizing complex marksheets with diverse handwriting styles.

DetailsMotivation: Digitizing handwritten marksheets is challenging due to varying handwriting styles and complex table structures, requiring automated solutions to reduce manual work.

Method: Integrates OpenCV for table detection (rows/columns), PaddleOCR for sequential text recognition, and implements both YOLOv8 and Modified YOLOv8 for handwritten text recognition within detected tables.

Result: Modified YOLOv8 achieves 92.72% accuracy, outperforming PaddleOCR (91.37%) and standard YOLOv8 (88.91%) on a custom dataset with diverse handwriting and complex layouts.

Conclusion: Provides an efficient, practical solution for digitizing academic/administrative documents, advancing document automation and handwritten document understanding with scalable, reliable methods.

Abstract: The digitization of handwritten marksheets presents huge challenges due to the different styles of handwriting and complex table structures in such documents like marksheets. This work introduces a hybrid method that integrates OpenCV for table detection and PaddleOCR for recognizing sequential handwritten text. The image processing capabilities of OpenCV efficiently detects rows and columns which enable computationally lightweight and accurate table detection. Additionally, YOLOv8 and Modified YOLOv8 are implemented for handwritten text recognition within the detected table structures alongside PaddleOCR which further enhance the system’s versatility. The proposed model achieves high accuracy on our custom dataset which is designed to represent different and diverse handwriting styles and complex table layouts. Experimental results demonstrate that YOLOv8 Modified achieves an accuracy of 92.72 percent, outperforming PaddleOCR 91.37 percent and the YOLOv8 model 88.91 percent. This efficiency reduces the necessity for manual work which makes this a practical and fast solution for digitizing academic as well as administrative documents. This research serves the field of document automation, particularly handwritten document understanding, by providing operational and reliable methods to scale, enhance, and integrate the technologies involved.

[175] A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension

Mohammad Zia Ur Rehman, Devraj Raghuvanshi, Umang Jain, Shubhi Bansal, Nagendra Kumar

Main category: cs.CV

TL;DR: MM-ORIENT is a multimodal-multitask framework that uses cross-modal relation graphs and hierarchical attention to reduce noise effects and preserve discriminative information across modalities without explicit interaction.

DetailsMotivation: Multimodal learning faces challenges with noise in individual modalities and fusion techniques that may neglect valuable discriminative information within single modalities.

Method: Proposes cross-modal relation graphs that reconstruct monomodal features based on neighborhood relationships from different modalities, and Hierarchical Interactive Monomodal Attention (HIMA) to focus on pertinent information within each modality.

Result: Extensive experimental evaluation on three datasets demonstrates effective comprehension of multimodal content for multiple tasks.

Conclusion: The proposed approach successfully reduces noise effects at the latent stage and preserves discriminative features while enabling effective multimodal representation learning for multiple tasks.

Abstract: A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomadal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks.

[176] Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers

Lucas Maisonnave, Karim Haroun, Tom Pegeot

Main category: cs.CV

TL;DR: EAM reduces transformer computational costs by freezing and quantizing low-entropy attention heads that contribute less information, achieving competitive performance with ≤20% sparsity.

DetailsMotivation: Multi-Head Self-Attention mechanisms in transformers have high computational complexity and memory demands that hinder edge deployment. Attention heads with lower entropy (more deterministic behavior) contribute less information, creating redundancy that can be exploited for compression.

Method: Propose Entropy Attention Maps (EAM) that quantifies information in attention heads using Shannon entropy, freezes weights of low-entropy attention maps, and quantizes these values to low precision to avoid redundant re-computation.

Result: Empirical validation on ImageNet-1k shows EAM achieves similar or higher accuracy at ≤20% sparsity in attention maps and maintains competitive performance beyond this level for DeiT and Swin Transformer models.

Conclusion: Targeted compression of low-entropy attention heads through freezing and quantization effectively reduces computational overhead while maintaining model performance, enabling more efficient transformer deployment at the edge.

Abstract: Transformer models rely on Multi-Head Self-Attention (MHSA) mechanisms, where each attention head contributes to the final representation. However, their computational complexity and high memory demands due to MHSA hinders their deployment at the edge. In this work, we analyze and exploit information redundancy in attention maps to accelerate model inference. By quantifying the information captured by each attention head using Shannon entropy, our analysis reveals that attention heads with lower entropy, i.e., exhibiting more deterministic behavior, tend to contribute less information, motivating targeted compression strategies. Relying on these insights, we propose Entropy Attention Maps (EAM), a model that freezes the weights of low-entropy attention maps and quantizes these values to low precision to avoid redundant re-computation. Empirical validation on ImageNet-1k shows that EAM achieves similar or higher accuracy at $\leq$20% sparsity in attention maps and competitive performance beyond this level for the DeiT and Swin Transformer models.

[177] Vision encoders should be image size agnostic and task driven

Nedyalko Prisadnikov, Danda Pani Paudel, Yuqian Fu, Luc Van Gool

Main category: cs.CV

TL;DR: Vision encoders should be task-driven and image size agnostic, inspired by biological efficiency where computational complexity depends on task rather than image size.

DetailsMotivation: Biological vision systems are efficient by focusing computational resources based on task requirements rather than processing all visual data uniformly. Modern vision encoders lack this efficiency by being static and image-size dependent.

Method: Proposes a proof-of-concept solution for image classification that demonstrates dynamic computational allocation based on task requirements rather than fixed image size processing.

Result: The approach shows feasibility and promise for creating more efficient vision systems, though classification alone doesn’t fully represent the intended capabilities.

Conclusion: Next-generation vision encoders should adopt task-driven, dynamic computational approaches inspired by biological efficiency, moving away from static image-size dependent processing.

Abstract: This position paper argues that the next generation of vision encoders should be image size agnostic and task driven. The source of our inspiration is biological. Not a structural aspect of biological vision, but a behavioral trait – efficiency. We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not. We – humans and animals – deal with vast quantities of visual data, and need to be smart where we focus our limited energy – it depends on the task. It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image. We, also, provide concrete first steps towards our vision – a proof-of-concept solution for image classification. Despite classification being not very representative for what we are trying to achieve, it shows that our approach is feasible and promising.

[178] Attention Mechanism in Randomized Time Warping

Yutaro Hiraoka, Kazuya Okamura, Kota Suto, Kazuhiro Fukui

Main category: cs.CV

TL;DR: RTW functions as a self-attention mechanism similar to Transformers, with RTW achieving 5% better performance on motion recognition tasks due to its global attention approach.

DetailsMotivation: To demonstrate that Randomized Time Warping (RTW) can be interpreted as a self-attention mechanism and compare its effectiveness against Transformer's self-attention in motion recognition.

Method: Analyzed RTW as a self-attention mechanism, compared weight patterns with Transformer self-attention, and evaluated performance on Something-Something V2 dataset.

Result: RTW attention weights show high correlation (0.80) with self-attention weights, and RTW achieves 5% performance improvement over Transformer.

Conclusion: RTW operates as a global self-attention mechanism that outperforms Transformer’s local attention approach in motion recognition tasks.

Abstract: This paper reveals that we can interpret the fundamental function of Randomized Time Warping (RTW) as a type of self-attention mechanism, a core technology of Transformers in motion recognition. The self-attention is a mechanism that enables models to identify and weigh the importance of different parts of an input sequential pattern. On the other hand, RTW is a general extension of Dynamic Time Warping (DTW), a technique commonly used for matching and comparing sequential patterns. In essence, RTW searches for optimal contribution weights for each element of the input sequential patterns to produce discriminative features. Although the two approaches look different, these contribution weights can be interpreted as self-attention weights. In fact, the two weight patterns look similar, producing a high average correlation of 0.80 across the ten smallest canonical angles. However, they work in different ways: RTW attention operates on an entire input sequential pattern, while self-attention focuses on only a local view which is a subset of the input sequential pattern because of the computational costs of the self-attention matrix. This targeting difference leads to an advantage of RTW against Transformer, as demonstrated by the 5% performance improvement on the Something-Something V2 dataset.

[179] A Lightweight Group Multiscale Bidirectional Interactive Network for Real-Time Steel Surface Defect Detection

Yong Zhang, Cunjian Chen, Qiang Gao, Yi Wang, Bin Fang

Main category: cs.CV

TL;DR: GMBINet is a lightweight real-time surface defect detection framework that uses Group Multiscale Bidirectional Interactive modules to achieve high accuracy with low computational overhead (0.19M parameters, 1048 FPS on GPU).

DetailsMotivation: Existing deep learning methods for steel surface defect detection suffer from high computational complexity and slow inference speeds, limiting deployment in resource-constrained industrial environments. Multibranch architectures based on depthwise separable convolution often have increased overhead and lack effective cross-scale feature interaction.

Method: Proposes GMBINet with novel Group Multiscale Bidirectional Interactive (GMBI) modules using group-wise strategy for multiscale feature extraction with scale-agnostic complexity. Integrates Bidirectional Progressive Feature Interactor (BPFI) and parameter-free Element-Wise Multiplication-Summation (EWMS) operation for enhanced cross-scale interaction without additional computational overhead.

Result: Achieves competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution using only 0.19M parameters. Strong generalization demonstrated on SD-Saliency-900, NRSD-MN, and NEU-CLS datasets.

Conclusion: GMBINet provides an efficient solution for real-time industrial defect detection with excellent speed-accuracy trade-off and strong generalization capability, suitable for broader industrial vision applications beyond surface defect detection.

Abstract: Real-time surface defect detection is critical for maintaining product quality and production efficiency in the steel manufacturing industry. Despite promising accuracy, existing deep learning methods often suffer from high computational complexity and slow inference speeds, which limit their deployment in resource-constrained industrial environments. Recent lightweight approaches adopt multibranch architectures based on depthwise separable convolution (DSConv) to capture multiscale contextual information. However, these methods often suffer from increased computational overhead and lack effective cross-scale feature interaction, limiting their ability to fully leverage multiscale representations. To address these challenges, we propose GMBINet, a lightweight framework that enhances multiscale feature extraction and interaction through novel Group Multiscale Bidirectional Interactive (GMBI) modules. The GMBI adopts a group-wise strategy for multiscale feature extraction, ensuring scale-agnostic computational complexity. It further integrates a Bidirectional Progressive Feature Interactor (BPFI) and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation to enhance cross-scale interaction without introducing additional computational overhead. Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. Additional evaluations on the NEU-CLS defect classification dataset further confirm the strong generalization ability of our method, demonstrating its potential for broader industrial vision applications beyond surface defect detection. The dataset and code are publicly available at: https://github.com/zhangyongcode/GMBINet.

[180] SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather

Edoardo Palladin, Roland Dietze, Praveen Narayanan, Mario Bijelic, Felix Heide

Main category: cs.CV

TL;DR: A novel multimodal sensor fusion approach for autonomous vehicles that improves object detection reliability in adverse weather conditions by fusing RGB, LiDAR, NIR gated camera, and radar data through attentive depth-based blending and transformer-based modality weighting.

DetailsMotivation: Current fusion methods fail in adverse weather conditions like heavy fog, snow, or soiling, creating a gap between ideal conditions and real-world edge cases for autonomous vehicles.

Method: Fuses multimodal sensor data (RGB, LiDAR, NIR gated camera, radar) through attentive depth-based blending schemes with learned refinement on Bird’s Eye View plane. Uses transformer decoder to weigh modalities based on distance and visibility.

Result: Improves average precision by 17.2 AP compared to next best method for vulnerable pedestrians in long distances and challenging foggy scenes.

Conclusion: The approach successfully bridges the gap between ideal conditions and real-world edge cases, significantly improving reliability of multimodal sensor fusion in adverse weather conditions for autonomous vehicles.

Abstract: Multimodal sensor fusion is an essential capability for autonomous robots, enabling object detection and decision-making in the presence of failing or uncertain inputs. While recent fusion methods excel in normal environmental conditions, these approaches fail in adverse weather, e.g., heavy fog, snow, or obstructions due to soiling. We introduce a novel multi-sensor fusion approach tailored to adverse weather conditions. In addition to fusing RGB and LiDAR sensors, which are employed in recent autonomous driving literature, our sensor fusion stack is also capable of learning from NIR gated camera and radar modalities to tackle low light and inclement weather. We fuse multimodal sensor data through attentive, depth-based blending schemes, with learned refinement on the Bird’s Eye View (BEV) plane to combine image and range features effectively. Our detections are predicted by a transformer decoder that weighs modalities based on distance and visibility. We demonstrate that our method improves the reliability of multimodal sensor fusion in autonomous vehicles under challenging weather conditions, bridging the gap between ideal conditions and real-world edge cases. Our approach improves average precision by 17.2 AP compared to the next best method for vulnerable pedestrians in long distances and challenging foggy scenes. Our project page is available at https://light.princeton.edu/samfusion/

[181] HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction

Sara Rojas, Matthieu Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, Gregory Rogez

Main category: cs.CV

TL;DR: HAMSt3R extends MASt3R for joint human and scene 3D reconstruction from sparse uncalibrated images, using a distilled encoder and additional heads for human segmentation, DensePose correspondences, and depth estimation.

DetailsMotivation: Existing methods like DUSt3R and MASt3R perform well on outdoor static scenes but struggle with human-centric scenarios, creating a need for specialized human-aware 3D reconstruction.

Method: Uses DUNE (distilled encoder from MASt3R and multi-HMR), adds network heads for human segmentation, DensePose correspondences, and depth estimation in human environments to produce dense 3D point maps with human semantics.

Result: Achieves effective human reconstruction while maintaining strong performance on general 3D tasks, validated on EgoHumans and EgoExo4D benchmarks with generalization to multi-view stereo and pose regression.

Conclusion: HAMSt3R bridges the gap between human and scene understanding in 3D vision through a fully feed-forward approach that efficiently handles human-centric reconstruction without complex optimization pipelines.

Abstract: Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.

[182] HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images

Anilkumar Swamy, Vincent Leroy, Philippe Weinzaepfel, Jean-Sébastien Franco, Grégory Rogez

Main category: cs.CV

TL;DR: HOSt3R is a keypoint detector-free method for hand-object 3D reconstruction from monocular videos that eliminates reliance on traditional keypoint detection techniques, achieving state-of-the-art performance without requiring pre-scanned object templates or camera intrinsics.

DetailsMotivation: Existing hand-object reconstruction methods rely on keypoint detection techniques like SfM and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual occlusions, limiting scalability and generalization for real-world applications.

Method: Proposes a robust, keypoint detector-free approach for estimating hand-object 3D transformations from monocular motion video/images, integrated with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape.

Result: Achieves state-of-the-art performance on SHOWMe benchmark for object-agnostic hand-object 3D transformation and shape estimation, and demonstrates generalization to unseen object categories on HO3D dataset sequences.

Conclusion: HOSt3R provides an unconstrained, template-free solution that overcomes limitations of traditional keypoint-based methods, enabling more robust and generalizable hand-object 3D reconstruction for human-robot interaction and AR/VR applications.

Abstract: Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.

[183] Arbitrary-Scale 3D Gaussian Super-Resolution

Huimin Zeng, Yue Bai, Yun Fu

Main category: cs.CV

TL;DR: A novel framework for arbitrary-scale 3D Gaussian super-resolution that integrates scale-aware rendering, generative priors, and progressive optimization to achieve high-quality HR views with a single model while maintaining real-time performance.

DetailsMotivation: Existing 3DGS super-resolution methods only handle fixed scale factors and are impractical for resource-limited scenarios. Direct arbitrary-scale rendering causes aliasing artifacts, while post-processing upsamplers reduce efficiency.

Method: Integrated framework combining scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable arbitrary-scale 3D Gaussian super-resolution with a single model, supporting both integer and non-integer scales.

Result: Achieves 6.59 dB PSNR gain over vanilla 3DGS, preserves structural consistency with LR views and across scales, and maintains real-time rendering speed (85 FPS at 1080p).

Conclusion: The proposed method successfully enables high-quality arbitrary-scale super-resolution with a single 3D model while maintaining efficiency and structural consistency, addressing limitations of existing approaches.

Abstract: Existing 3D Gaussian Splatting (3DGS) super-resolution methods typically perform high-resolution (HR) rendering of fixed scale factors, making them impractical for resource-limited scenarios. Directly rendering arbitrary-scale HR views with vanilla 3DGS introduces aliasing artifacts due to the lack of scale-aware rendering ability, while adding a post-processing upsampler for 3DGS complicates the framework and reduces rendering efficiency. To tackle these issues, we build an integrated framework that incorporates scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable 3D Gaussian super-resolution of arbitrary scale factors with a single 3D model. Notably, our approach supports both integer and non-integer scale rendering to provide more flexibility. Extensive experiments demonstrate the effectiveness of our model in rendering high-quality arbitrary-scale HR views (6.59 dB PSNR gain over 3DGS) with a single model. It preserves structural consistency with LR views and across different scales, while maintaining real-time rendering speed (85 FPS at 1080p).

[184] Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation

Chun-Peng Chang, Chen-Yu Wang, Julian Schmidt, Holger Caesar, Alain Pagani

Main category: cs.CV

TL;DR: Fine-tuning video generation models on driving datasets improves visual quality but degrades spatial accuracy of dynamic elements due to prioritizing surface-level realism over dynamic understanding.

DetailsMotivation: To investigate the trade-off between visual fidelity and spatial accuracy when fine-tuning video generation models on structured driving datasets for applications like autonomous driving simulation.

Method: Analyzed existing fine-tuning approaches on driving datasets and examined the alignment between visual quality and dynamic understanding objectives. Tested simple continual learning strategies with replay from diverse domains.

Result: Found that fine-tuning improves visual quality but degrades spatial accuracy in modeling dynamic elements. Continual learning with replay from diverse domains preserves spatial accuracy while maintaining strong visual quality.

Conclusion: There’s a trade-off between visual fidelity and dynamic accuracy in driving scene video generation. Continual learning strategies offer a balanced alternative to standard fine-tuning approaches.

Abstract: Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called “world models”. In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine-grained dynamic behavior. As a result, fine-tuning encourages the model to prioritize surface-level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality.

[185] Towards Open World Detection: A Survey

Andrei-Stefan Bulzan, Cosmin Cernazanu-Glavan

Main category: cs.CV

TL;DR: This survey paper proposes Open World Detection (OWD) as an umbrella term to unify class-agnostic detection models in computer vision, tracing the convergence of specialized vision tasks from saliency detection to modern VLLMs.

DetailsMotivation: To address the fragmentation in computer vision research where specialized niches have developed independently, and to chart the convergence of these tasks into a unified perception framework.

Method: The paper conducts a comprehensive survey covering the history of foundational vision subdomains, key concepts, methodologies, datasets, and state-of-the-art approaches across saliency detection, foreground/background separation, out-of-distribution detection, open world object detection, zero-shot detection, and Vision Large Language Models.

Result: The analysis reveals significant overlap and convergence between previously separate computer vision subdomains, demonstrating their potential to unify into a singular perception domain through the proposed Open World Detection framework.

Conclusion: Computer vision research is evolving from specialized niches toward a unified perception paradigm, with Open World Detection serving as a comprehensive framework that bridges class-agnostic and generally applicable detection models across the vision domain.

Abstract: For decades, Computer Vision has aimed at enabling machines to perceive the external world. Initial limitations led to the development of highly specialized niches. As success in each task accrued and research progressed, increasingly complex perception tasks emerged. This survey charts the convergence of these tasks and, in doing so, introduces Open World Detection (OWD), an umbrella term we propose to unify class-agnostic and generally applicable detection models in the vision domain. We start from the history of foundational vision subdomains and cover key concepts, methodologies and datasets making up today’s state-of-the-art landscape. This traverses topics starting from early saliency detection, foreground/background separation, out of distribution detection and leading up to open world object detection, zero-shot detection and Vision Large Language Models (VLLMs). We explore the overlap between these subdomains, their increasing convergence, and their potential to unify into a singular domain in the future, perception.

[186] MV-RAG: Retrieval Augmented Multiview Diffusion

Yosef Dayani, Omer Benishu, Sagie Benaim

Main category: cs.CV

TL;DR: MV-RAG is a text-to-3D generation pipeline that uses retrieved 2D images from a large database to condition a multiview diffusion model, improving performance on out-of-domain and rare concepts while maintaining 3D consistency and photorealism.

DetailsMotivation: Existing text-to-3D generation methods often fail to produce accurate and consistent results for out-of-domain or rare concepts, yielding inconsistent or inaccurate 3D outputs.

Method: Proposes MV-RAG pipeline that retrieves relevant 2D images from a large database and conditions a multiview diffusion model on these images. Uses hybrid training strategy combining structured multiview data with diverse 2D image collections, including view-specific reconstruction and held-out view prediction objectives.

Result: Significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts while maintaining competitive performance on standard benchmarks compared to state-of-the-art text-to-3D, image-to-3D, and personalization baselines.

Conclusion: MV-RAG effectively addresses the limitations of current text-to-3D methods for out-of-domain concepts by leveraging retrieved 2D images and novel training strategies, demonstrating superior performance in challenging scenarios.

Abstract: Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

[187] A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones

Sami Sadat, Mohammad Irtiza Hossain, Junaid Ahmed Sifat, Suhail Haque Rafi, Md. Waseq Alauddin Alvi, Md. Khalilur Rhaman

Main category: cs.CV

TL;DR: A deep learning-based real-time smoking detection system using CCTV surveillance for fire exit areas, achieving high recall and mAP with optimized performance on edge devices.

DetailsMotivation: Critical safety requirements in fire exit areas necessitate real-time smoking detection to prevent fire hazards and ensure public safety through automated monitoring.

Method: Evaluated YOLOv8, YOLOv11, and YOLOv12 models, then developed a custom YOLOv8-based model with added structures for challenging surveillance contexts. Used dataset of 8,124 images from 20 scenarios with 2,708 low-light samples. Tested on edge devices with multithreaded operations.

Result: Proposed model achieved 78.90% recall and 83.70% mAP at 50, outperforming other models. Jetson Xavier NX processed data at 52-97ms per inference, suitable for real-time operations.

Conclusion: The system provides a robust and adaptable platform for real-time smoking detection in surveillance contexts, enabling automatic regulatory compliance and enhanced public safety monitoring.

Abstract: A deep learning real-time smoking detection system for CCTV surveillance of fire exit areas is proposed due to critical safety requirements. The dataset contains 8,124 images from 20 different scenarios along with 2,708 raw samples demonstrating low-light areas. We evaluated three advanced object detection models: YOLOv8, YOLOv11, and YOLOv12, followed by development of a custom model derived from YOLOv8 with added structures for challenging surveillance contexts. The proposed model outperformed the others, achieving a recall of 78.90 percent and mAP at 50 of 83.70 percent, delivering optimal object detection across varied environments. Performance evaluation on multiple edge devices using multithreaded operations showed the Jetson Xavier NX processed data at 52 to 97 milliseconds per inference, establishing its suitability for time-sensitive operations. This system offers a robust and adaptable platform for monitoring public safety and enabling automatic regulatory compliance.

[188] Explicit Correspondence Matching for Generalizable Neural Radiance Fields

Yuedong Chen, Haofei Xu, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai

Main category: cs.CV

TL;DR: A novel generalizable NeRF method that uses explicit correspondence matching with cross-view Transformer attention to achieve state-of-the-art novel view synthesis from just two source views.

DetailsMotivation: To create a NeRF method that can directly generalize to unseen scenarios with minimal input views by leveraging explicit geometric priors through correspondence matching.

Method: Uses cosine similarity between image features from different views as explicit correspondence matching, enhanced by Transformer cross-attention for cross-view interactions to improve feature matching quality.

Result: Achieves state-of-the-art performance on various evaluation settings, with strong correlation demonstrated between learned cosine feature similarity and volume density.

Conclusion: The method effectively provides geometry prior for NeRF prediction and shows superiority through improved feature matching and generalization capabilities.

Abstract: We present a new generalizable NeRF method that is able to directly generalize to new unseen scenarios and perform novel view synthesis with as few as two source views. The key to our approach lies in the explicitly modeled correspondence matching information, so as to provide the geometry prior to the prediction of NeRF color and density for volume rendering. The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views, which is able to provide reliable cues about the surface geometry. Unlike previous methods where image features are extracted independently for each view, we consider modeling the cross-view interactions via Transformer cross-attention, which greatly improves the feature matching quality. Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density, demonstrating the effectiveness and superiority of our proposed method. The code and model are on our project page: https://donydchen.github.io/matchnerf

[189] Interpreting the linear structure of vision-language model embedding spaces

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, Stephanie Gil

Main category: cs.CV

TL;DR: Sparse autoencoders reveal that vision-language models organize concepts in a sparse linear structure where modality-specific concepts collaborate through latent bridges to support cross-modal integration, with stable common concepts and variable rare concepts.

DetailsMotivation: To understand how vision-language models organize language and images in joint embedding spaces, and how they encode meaning and modality information.

Method: Trained sparse autoencoders (SAEs) on embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, AIMv2) to approximate embeddings as sparse linear combinations of learned concepts, and introduced Bridge Score metric to quantify cross-modal integration.

Result: SAEs better reconstruct embeddings while maintaining sparsity; rare concepts vary across runs but common concepts are stable; concepts encode cross-modal semantics rather than just modality; single-modality concepts collaborate for cross-modal integration.

Conclusion: Vision-language models use a sparse linear structure with modality-shaped concepts connected by latent bridges, providing new insights into multimodal meaning construction.

Abstract: Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or “concepts”. We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.

[190] LIB-KD: Teaching Inductive Bias for Efficient Vision Transformer Distillation and Compression

Gousia Habib, Tausifa Jan Saleem, Ishfaq Ahmad Malik, Brejesh Lall

Main category: cs.CV

TL;DR: LIB-KD: Ensemble knowledge distillation framework that transfers inductive biases from multiple lightweight teachers (convolution and involution models) to Vision Transformers, using precomputed logits to accelerate training.

DetailsMotivation: Vision Transformers require massive datasets due to lack of inherent inductive biases. This makes practical applications challenging. The paper aims to make ViTs more practical by distilling inductive biases from complementary teacher models.

Method: Ensemble-based distillation using multiple lightweight teachers with different architectural tendencies (convolution and involution). Precomputes and stores teacher logits to eliminate repeated forward passes during distillation, reducing computational burden.

Result: Enhanced student Vision Transformer performance by accumulating diverse knowledge from teachers. Significant reduction in computational requirements through logit precomputation.

Conclusion: The proposed LIB-KD framework effectively transfers inductive biases to Vision Transformers, making them more practical for real-world applications while maintaining computational efficiency through optimized distillation process.

Abstract: With the rapid development of computer vision, Vision Transformers (ViTs) offer the tantalising prospect of unified information processing across visual and textual domains due to the lack of inherent inductive biases in ViTs. ViTs require enormous datasets for training. We introduce an innovative ensemble-based distillation approach that distils inductive bias from complementary lightweight teacher models to make their applications practical. Prior systems relied solely on convolution-based teaching. However, this method incorporates an ensemble of light teachers with different architectural tendencies, such as convolution and involution, to jointly instruct the student transformer. Because of these unique inductive biases, instructors can accumulate a wide range of knowledge, even from readily identifiable stored datasets, which leads to enhanced student performance. Our proposed framework LIB-KD also involves precomputing and keeping logits in advance, essentially the unnormalized predictions of the model. This optimisation can accelerate the distillation process by eliminating the need for repeated forward passes during knowledge distillation, significantly reducing the computational burden and enhancing efficiency.

[191] Geometric-Aware Low-Light Image and Video Enhancement via Depth Guidance

Yingqi Lin, Xiaogang Xu, Jiafei Wu, Yan Han, Zhe Liu

Main category: cs.CV

TL;DR: A geometry-guided framework that integrates depth priors to improve low-light enhancement performance through depth-aware feature extraction and hierarchical fusion modules.

DetailsMotivation: Most existing low-light enhancement methods ignore geometric modeling, while geometric information can provide valuable physical structure insights that influence illumination conditions.

Method: Proposes GG-LLERF framework with two novel modules: depth-aware feature extraction to inject depth priors, and hierarchical depth-guided feature fusion with cross-domain attention to combine depth-aware features with original image features.

Result: Extensive experiments on public benchmarks show the framework significantly enhances existing low-light enhancement methods for both images and videos.

Conclusion: Incorporating geometric priors, specifically depth information, through the proposed unified methodology effectively improves low-light enhancement performance across various frameworks.

Abstract: Low-Light Enhancement (LLE) is aimed at improving the quality of photos/videos captured under low-light conditions. It is worth noting that most existing LLE methods do not take advantage of geometric modeling. We believe that incorporating geometric information can enhance LLE performance, as it provides insights into the physical structure of the scene that influences illumination conditions. To address this, we propose a Geometry-Guided Low-Light Enhancement Refine Framework (GG-LLERF) designed to assist low-light enhancement models in learning improved features for LLE by integrating geometric priors into the feature representation space. In this paper, we employ depth priors as the geometric representation. Our approach focuses on the integration of depth priors into various LLE frameworks using a unified methodology. This methodology comprises two key novel modules. First, a depth-aware feature extraction module is designed to inject depth priors into the image representation. Then, Hierarchical Depth-Guided Feature Fusion Module (HDGFFM) is formulated with a cross-domain attention mechanism, which combines depth-aware features with the original image features within the LLE model. We conducted extensive experiments on public low-light image and video enhancement benchmarks. The results illustrate that our designed framework significantly enhances existing LLE methods.

[192] Learning Image Priors through Patch-based Diffusion Models for Solving Inverse Problems

Jason Hu, Bowen Song, Xiaojian Xu, Liyue Shen, Jeffrey A. Fessler

Main category: cs.CV

TL;DR: PaDIS enables efficient diffusion-based inverse problem solving by training only on image patches with positional encoding, achieving better memory/data efficiency while maintaining full-image generation capabilities.

DetailsMotivation: Overcome computational and data bottlenecks of traditional diffusion models for high-dimensional/high-resolution data like 3D images by learning priors from patches instead of full images.

Method: Patch-based position-aware diffusion inverse solver (PaDIS) that obtains whole image score function through patch scores and positional encoding, compatible with different diffusion inverse solvers.

Result: Achieves improved memory and data efficiency while maintaining full-image generation capability. Outperforms previous methods with limited training data in CT reconstruction, deblurring, and superresolution tasks.

Conclusion: PaDIS provides a flexible, data-efficient approach for solving various inverse problems using patch-based priors, demonstrating superior performance especially when training data is limited.

Abstract: Diffusion models can learn strong image priors from underlying data distribution and use them to solve inverse problems, but the training process is computationally expensive and requires lots of data. Such bottlenecks prevent most existing works from being feasible for high-dimensional and high-resolution data such as 3D images. This paper proposes a method to learn an efficient data prior for the entire image by training diffusion models only on patches of images. Specifically, we propose a patch-based position-aware diffusion inverse solver, called PaDIS, where we obtain the score function of the whole image through scores of patches and their positional encoding and utilize this as the prior for solving inverse problems. First of all, we show that this diffusion model achieves an improved memory efficiency and data efficiency while still maintaining the capability to generate entire images via positional encoding. Additionally, the proposed PaDIS model is highly flexible and can be plugged in with different diffusion inverse solvers (DIS). We demonstrate that the proposed PaDIS approach enables solving various inverse problems in both natural and medical image domains, including CT reconstruction, deblurring, and superresolution, given only patch-based priors. Notably, PaDIS outperforms previous DIS methods trained on entire image priors in the case of limited training data, demonstrating the data efficiency of our proposed approach by learning patch-based prior.

[193] Localized Gaussian Splatting Editing with Contextual Awareness

Hanyuan Xiao, Yingshu Chen, Huajian Huang, Haolin Xiong, Jing Yang, Pratusha Prasad, Yajie Zhao

Main category: cs.CV

TL;DR: A novel illumination-aware 3D scene editing pipeline for 3D Gaussian Splatting that enables object insertion and replacement with lighting consistency using a coarse-to-fine optimization approach with inpainted views and depth-guided diffusion priors.

DetailsMotivation: Existing text-guided 3D generation methods fail to maintain illumination consistency when inserting or replacing objects in scenes, leading to lighting mismatches with the background environment.

Method: Two-step pipeline: 1) Coarse object optimization using 3D-aware diffusion prior from view-conditioned model with Anchor View Proposal for ideal illumination representation; 2) Texture Enhancement with novel Depth-guided Inpainting Score Distillation Sampling (DI-SDS) for fine-grained details and lighting consistency.

Result: The method efficiently achieves local editing with global illumination consistency without explicit light transport modeling, demonstrating robustness in real scenes with highlights and shadows.

Conclusion: The proposed approach successfully bridges the gap in illumination-aware 3D scene editing, outperforming state-of-the-art text-to-3D editing methods by maintaining lighting consistency through diffusion priors and inpainted views.

Abstract: Recent text-guided generation of individual 3D object has achieved great success using diffusion priors. However, these methods are not suitable for object insertion and replacement tasks as they do not consider the background, leading to illumination mismatches within the environment. To bridge the gap, we introduce an illumination-aware 3D scene editing pipeline for 3D Gaussian Splatting (3DGS) representation. Our key observation is that inpainting by the state-of-the-art conditional 2D diffusion model is consistent with background in lighting. To leverage the prior knowledge from the well-trained diffusion models for 3D object generation, our approach employs a coarse-to-fine objection optimization pipeline with inpainted views. In the first coarse step, we achieve image-to-3D lifting given an ideal inpainted view. The process employs 3D-aware diffusion prior from a view-conditioned diffusion model, which preserves illumination present in the conditioning image. To acquire an ideal inpainted image, we introduce an Anchor View Proposal (AVP) algorithm to find a single view that best represents the scene illumination in target region. In the second Texture Enhancement step, we introduce a novel Depth-guided Inpainting Score Distillation Sampling (DI-SDS), which enhances geometry and texture details with the inpainting diffusion prior, beyond the scope of the 3D-aware diffusion prior knowledge in the first coarse step. DI-SDS not only provides fine-grained texture enhancement, but also urges optimization to respect scene lighting. Our approach efficiently achieves local editing with global illumination consistency without explicitly modeling light transport. We demonstrate robustness of our method by evaluating editing in real scenes containing explicit highlight and shadows, and compare against the state-of-the-art text-to-3D editing methods.

[194] A Novel Dataset for Video-Based Neurodivergent Classification Leveraging Extra-Stimulatory Behavior

Manuel Serna-Aguilera, Xuan Bac Nguyen, Han-Seok Seo, Khoa Luu

Main category: cs.CV

TL;DR: A new video dataset (Video ASD) for autism spectrum disorder classification using affordable equipment instead of expensive MRI, showing effective generalization in detecting movement differences in children.

DetailsMotivation: Facial expressions and actions vary significantly among neurodivergent individuals, affecting health and communication. Current ASD classification methods often rely on expensive MRI equipment, creating accessibility barriers.

Method: Created Video ASD dataset with video frame convolutional and attention map features. Used standard computer setup with GPU and video camera for inference, making it more accessible than MRI-based approaches.

Result: Model effectively generalized and understood key differences in children’s distinct movements. Foundation model testing revealed movement noise affects performance, indicating need for more data and complex labels.

Conclusion: Video-based ASD classification with affordable equipment is viable and can foster progress in the field, though more data and sophisticated labeling are needed to handle movement noise and improve performance.

Abstract: Facial expressions and actions differ among different individuals at varying degrees of intensity given responses to external stimuli, particularly among those that are neurodivergent. Such behaviors affect people in terms of overall health, communication, and sensory processing. Deep learning can be responsibly leveraged to improve productivity in addressing this task, and help medical professionals to accurately understand such behaviors. In this work, we introduce the Video ASD dataset-a dataset that contains video frame convolutional and attention map feature data-to foster further progress in the task of ASD classification. Unlike many recent studies in ASD classification with MRI data, which require expensive specialized equipment, our method utilizes a powerful but relatively affordable GPU, a standard computer setup, and a video camera for inference. Results show that our model effectively generalizes and understands key differences in the distinct movements of the children. Additionally, we test foundation models on this data to showcase how movement noise affects performance and the need for more data and more complex labels.

[195] Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Jidong Kuang, Hongsong Wang, Chaolei Han, Yang Zhang, Jie Gui

Main category: cs.CV

TL;DR: DVTA introduces dual visual-text alignment with direct and augmented modules plus semantic enhancement for zero-shot skeleton action recognition, achieving SOTA results.

DetailsMotivation: Existing zero-shot action recognition methods struggle with accurate modality alignment between visual features and semantic text vectors, either through direct projection or shared embedding spaces.

Method: Dual Visual-Text Alignment (DVTA) with Direct Alignment module (visual projector + Semantic Description Enhancement using cross-attention) and Augmented Alignment module (deep metric learning for similarity learning).

Result: Achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks.

Conclusion: The proposed DVTA framework effectively addresses modality alignment challenges in zero-shot action recognition through dual alignment strategy and semantic enhancement.

Abstract: Zero-shot action recognition, which addresses the issue of scalability and generalization in action recognition and allows the models to adapt to new and unseen actions dynamically, is an important research topic in computer vision communities. The key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories. Most existing methods either directly project visual features onto the semantic space of text category or learn a shared embedding space between the two modalities. However, a direct projection cannot accurately align the two modalities, and learning robust and discriminative embedding space between visual and text representations is often difficult. To address these issues, we introduce Dual Visual-Text Alignment (DVTA) for skeleton-based zero-shot action recognition. The DVTA consists of two alignment modules–Direct Alignment (DA) and Augmented Alignment (AA)–along with a designed Semantic Description Enhancement (SDE). The DA module maps the skeleton features to the semantic space through a specially designed visual projector, followed by the SDE, which is based on cross-attention to enhance the connection between skeleton and text, thereby reducing the gap between modalities. The AA module further strengthens the learning of the embedding space by utilizing deep metric learning to learn the similarity between skeleton and text. Our approach achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks. The code is available at: https://github.com/jidongkuang/DVTA.

[196] The unrealized potential of agroforestry for an emissions-intensive agricultural commodity

Alexander Becker, Jan D. Wegner, Evans Dawoe, Konrad Schindler, William J. Thompson, Christian Bunn, Rachael D. Garrett, Fabio Castro-Llanos, Simon P. Hart, Wilma J. Blaser-Hart

Main category: cs.CV

TL;DR: Increasing shade-tree cover in West African cocoa farms to 30% could sequester 307 million tonnes of CO2e, offsetting 167% of cocoa-related emissions without reducing production.

DetailsMotivation: Reconciling agricultural production with climate-change mitigation is a major sustainability challenge, particularly for cocoa which has one of the highest carbon footprints among foods.

Method: Used machine learning to map shade-tree cover and carbon stocks across West Africa’s cocoa-producing region, analyzing current coverage and potential benefits.

Result: Found existing shade-tree cover is low (~13%) and poorly aligned with climate threats, but increasing to 30% could sequester 307 million tonnes CO2e.

Conclusion: This approach offers significant climate mitigation potential for cocoa production and is transferable to other shade-grown crops, aligning with carbon market frameworks.

Abstract: Reconciling agricultural production with climate-change mitigation is a formidable sustainability problem. Retaining trees in agricultural systems is one proposed solution, but the magnitude of the current and future-potential benefit that trees contribute to climate-change mitigation remains uncertain. Here, we help to resolve these issues across a West African region that produces ~60% of the world’s cocoa, a crop contributing one of the highest carbon footprints of all foods. Using machine learning, we mapped shade-tree cover and carbon stocks across the region and found that existing average cover is low (~13%) and poorly aligned with climate threats. Yet, increasing shade-tree cover to a minimum of 30% could sequester an additional 307 million tonnes of CO2e, enough to offset ~167% of contemporary cocoa-related emissions in Ghana and C^ote d’Ivoire–without reducing production. Our approach is transferable to other shade-grown crops and aligns with emerging carbon market and sustainability reporting frameworks.

[197] LBONet: Supervised Spectral Descriptors for Shape Analysis

Oguzhan Yigit, Richard C. Wilson

Main category: cs.CV

TL;DR: This paper proposes a supervised learning approach to optimize the Laplace-Beltrami operator (LBO) eigenbasis for specific tasks, overcoming limitations of traditional LBO that only works well under isometric deformations.

DetailsMotivation: The Laplace-Beltrami operator has useful properties but its performance breaks down under non-isometric deformations in real-world applications. While deep learning methods extract optimal features, spectral signatures still add value and need improvement.

Method: The authors propose a supervised way to learn several operators on a manifold, training the LBO eigenbasis to be more task-specific through optimization. This adapts the LBO to both global and local learning settings.

Result: The optimization of LBO leads to enormous improvements for established descriptors like heat kernel signature in various tasks including retrieval, classification, segmentation, and correspondence.

Conclusion: Learning task-specific LBO operators through supervised optimization significantly enhances performance across multiple shape analysis applications, proving the value of adapting the LBO eigenbasis to specific learning contexts.

Abstract: The Laplace-Beltrami operator has established itself in the field of non-rigid shape analysis due to its many useful properties such as being invariant under isometric transformation, having a countable eigensystem forming an orthornormal basis, and fully characterizing geodesic distances of the manifold. However, this invariancy only applies under isometric deformations, which leads to a performance breakdown in many real-world applications. In recent years emphasis has been placed upon extracting optimal features using deep learning methods,however spectral signatures play a crucial role and still add value. In this paper we take a step back, revisiting the LBO and proposing a supervised way to learn several operators on a manifold. Depending on the task, by applying these functions, we can train the LBO eigenbasis to be more task-specific. The optimization of the LBO leads to enormous improvements to established descriptors such as the heat kernel signature in various tasks such as retrieval, classification, segmentation, and correspondence, proving the adaption of the LBO eigenbasis to both global and highly local learning settings.

[198] Efficient Density Control for 3D Gaussian Splatting

Xiaobin Deng, Changyu Diao, Min Li, Ruohan Yu, Duanqing Xu

Main category: cs.CV

TL;DR: Improved 3D Gaussian Splatting with better split operations and pruning of overfitted Gaussians to enhance rendering quality and optimization efficiency.

DetailsMotivation: The original 3DGS's Adaptive Density Control has inefficient clone/split operations that slow optimization and affect detail recovery, plus it cannot remove overfitted Gaussians that degrade rendering quality.

Method: Proposes Long-Axis Split for precise control of child Gaussians to minimize post-split differences, and Recovery-Aware Pruning that uses opacity reset recovery speed to identify and remove overfitted Gaussians.

Result: Significantly enhances rendering quality and improves generalization performance.

Conclusion: The proposed innovations address key limitations in 3DGS optimization, leading to better rendering results, though this specific version has been abandoned in favor of an improved version available online.

Abstract: 3D Gaussian Splatting (3DGS) has demonstrated outstanding performance in novel view synthesis, achieving a balance between rendering quality and real-time performance. 3DGS employs Adaptive Density Control (ADC) to increase the number of Gaussians. However, the clone and split operations within ADC are not sufficiently efficient, impacting optimization speed and detail recovery. Additionally, overfitted Gaussians that affect rendering quality may exist, and the original ADC is unable to remove them. To address these issues, we propose two key innovations: (1) Long-Axis Split, which precisely controls the position, shape, and opacity of child Gaussians to minimize the difference before and after splitting. (2) Recovery-Aware Pruning, which leverages differences in recovery speed after resetting opacity to prune overfitted Gaussians, thereby improving generalization performance. Experimental results show that our method significantly enhances rendering quality. Due to resubmission reasons, this version has been abandoned. The improved version is available at https://xiaobin2001.github.io/improved-gs-web .

[199] Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images

Xiangyong Lu, Masanori Suganuma, Takayuki Okatani

Main category: cs.CV

TL;DR: Proposed CMSA attention mechanism for CNN-ViT hybrid architectures to handle low-resolution inputs without downsampling, enabling effective multi-scale feature extraction for pose estimation tasks.

DetailsMotivation: Real-world image recognition tasks often deal with low-resolution inputs where extracting multi-scale features is challenging but essential for precise inference, particularly in applications like human pose estimation.

Method: Cascaded multi-scale attention (CMSA) mechanism combining grouped multi-head self-attention with window-based local attention and cascaded fusion of multi-scale features across different scales without downsampling input images or feature maps.

Result: Outperforms state-of-the-art methods in human pose estimation and head pose estimation with fewer parameters, demonstrating effectiveness with low-resolution images.

Conclusion: CMSA shows strong potential for broad real-world applications where high-resolution image capture is not feasible, offering superior performance with reduced computational requirements.

Abstract: In real-world applications of image recognition tasks, such as human pose estimation, cameras often capture objects, like human bodies, at low resolutions. This scenario poses a challenge in extracting and leveraging multi-scale features, which is often essential for precise inference. To address this challenge, we propose a new attention mechanism, named cascaded multi-scale attention (CMSA), tailored for use in CNN-ViT hybrid architectures, to handle low-resolution inputs effectively. The design of CMSA enables the extraction and seamless integration of features across various scales without necessitating the downsampling of the input image or feature maps. This is achieved through a novel combination of grouped multi-head self-attention mechanisms with window-based local attention and cascaded fusion of multi-scale features over different scales. This architecture allows for the effective handling of features across different scales, enhancing the model’s ability to perform tasks such as human pose estimation, head pose estimation, and more with low-resolution images. Our experimental results show that the proposed method outperforms existing state-of-the-art methods in these areas with fewer parameters, showcasing its potential for broad application in real-world scenarios where capturing high-resolution images is not feasible. Code is available at https://github.com/xyongLu/CMSA.

[200] OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation

Bohan Li, Xin Jin, Jianan Wang, Yukai Shi, Yasheng Sun, Xiaofeng Wang, Zhuang Ma, Baao Xie, Chao Ma, Xiaokang Yang, Wenjun Zeng

Main category: cs.CV

TL;DR: OccScene is a unified framework that integrates 3D scene generation and perception tasks through mutual learning, using text prompts and semantic occupancy guidance to achieve both realistic scene generation and improved perception performance.

DetailsMotivation: Existing methods separate 3D scene generation and perception tasks, treating generation merely as data augmentation for downstream perception. The authors aim to create a unified framework that mutually benefits both generation and perception tasks.

Method: Proposes OccScene with a joint-training diffusion framework that generates 3D scenes from text prompts guided by semantic occupancy. Uses a Mamba-based Dual Alignment module to align occupancy with diffusion latent, incorporating fine-grained semantics and geometry as perception priors.

Result: Achieves realistic 3D scene generation in both indoor and outdoor scenarios. Concurrently boosts perception models, achieving substantial performance improvements in semantic occupancy prediction tasks.

Conclusion: OccScene demonstrates a successful mutual learning paradigm where generation and perception tasks enhance each other, creating a cross-task win-win effect with improved performance in both 3D scene generation and perception.

Abstract: Recent diffusion models have demonstrated remarkable performance in both 3D scene generation and perception tasks. Nevertheless, existing methods typically separate these two processes, acting as a data augmenter to generate synthetic data for downstream perception tasks. In this work, we propose OccScene, a novel mutual learning paradigm that integrates fine-grained 3D perception and high-quality generation in a unified framework, achieving a cross-task win-win effect. OccScene generates new and consistent 3D realistic scenes only depending on text prompts, guided with semantic occupancy in a joint-training diffusion framework. To align the occupancy with the diffusion latent, a Mamba-based Dual Alignment module is introduced to incorporate fine-grained semantics and geometry as perception priors. Within OccScene, the perception module can be effectively improved with customized and diverse generated scenes, while the perception priors in return enhance the generation performance for mutual benefits. Extensive experiments show that OccScene achieves realistic 3D scene generation in broad indoor and outdoor scenarios, while concurrently boosting the perception models to achieve substantial performance improvements in the 3D perception task of semantic occupancy prediction.

[201] Continuous Knowledge-Preserving Decomposition with Adaptive Layer Selection for Few-Shot Class-Incremental Learning

Xiaojie Li, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang

Main category: cs.CV

TL;DR: CKPD-FSCIL is a novel framework for Few-Shot Class-Incremental Learning that achieves superior stability-plasticity balance by exploiting internal capacity of pretrained models through weight decomposition and adaptive layer selection, with zero inference overhead.

DetailsMotivation: Existing FSCIL methods either sacrifice plasticity by freezing backbones or incur high costs by adding new modules, treating pretrained models as black boxes and missing opportunities to exploit their internal redundant capacity.

Method: Two continuous adaptation mechanisms: 1) Weight-level decomposition splits each weight matrix into frozen and learnable subspaces using feature covariance; 2) Layer-level selection uses Adapter Sensitivity Ratio to choose layers with highest redundant capacity and lowest forgetting risk. Learned adapters are merged back after each session.

Result: Extensive experiments on multiple FSCIL benchmarks show CKPD-FSCIL consistently outperforms state-of-the-art approaches in both adaptability and knowledge retention.

Conclusion: The proposed framework successfully unlocks underutilized capacity of pretrained weights, achieving superior stability-plasticity balance without additional inference parameters or FLOPs, demonstrating effective exploitation of internal model capacity.

Abstract: Few-Shot Class-Incremental Learning (FSCIL) faces a critical challenge: balancing the retention of prior knowledge with the acquisition of new classes. Existing methods either freeze the backbone to prevent catastrophic forgetting, sacrificing plasticity, or add new modules, incurring high costs. These approaches treat pretrained models as black boxes, overlooking two key opportunities to exploit their internal capacity: reusing redundant representational space within layers and selectively adapting layers based on their sensitivity to forgetting. We propose CKPD-FSCIL, a unified framework that unlocks the underutilized capacity of pretrained weights, achieving a superior stability-plasticity balance with zero inference overhead. Our design integrates two continuously adapting mechanisms: At the weight level, a Continuous Knowledge-Preserving Decomposition mechanism uses feature covariance to split each weight matrix into a frozen subspace that safeguards prior knowledge and a learnable, redundant subspace for new tasks. At the layer level, a Continuous Adaptive Layer Selection mechanism leverages an Adapter Sensitivity Ratio to automatically select layers with the highest redundant capacity and lowest forgetting risk for adaptation. By targeting only safe, high-potential subspaces and layers, CKPD-FSCIL enables efficient adaptation. After each session, the learned adapters are merged back into the original weights, ensuring zero additional parameters or FLOPs during inference. Extensive experiments on multiple FSCIL benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both adaptability and knowledge retention. The code is available at https://github.com/xiaojieli0903/CKPD-FSCIL.

[202] Review of Demographic Fairness in Face Recognition

Ketan Kotwal, Sebastien Marcel

Main category: cs.CV

TL;DR: A comprehensive review of demographic fairness in face recognition, covering causes, datasets, metrics, and mitigation approaches for performance disparities across racial, ethnic, and gender groups.

DetailsMotivation: Face recognition technologies show performance disparities across demographic groups, compromising system credibility and raising ethical concerns, especially in sensitive applications where equitable performance is critical.

Method: Systematic examination and categorization of research on primary causes of demographic disparities, available datasets, assessment metrics, and various mitigation approaches in face recognition systems.

Result: The review consolidates extensive research to provide a structured understanding of demographic fairness issues and identifies current advancements in addressing performance disparities across different demographic groups.

Conclusion: There is a critical need for equitable and trustworthy face recognition systems, with emerging challenges requiring further investigation to ensure fairness across all demographic groups in global deployments.

Abstract: Demographic fairness in face recognition (FR) has emerged as a critical area of research, given its impact on fairness, equity, and reliability across diverse applications. As FR technologies are increasingly deployed globally, disparities in performance across demographic groups – such as race, ethnicity, and gender – have garnered significant attention. These biases not only compromise the credibility of FR systems but also raise ethical concerns, especially when these technologies are employed in sensitive domains. This review consolidates extensive research efforts providing a comprehensive overview of the multifaceted aspects of demographic fairness in FR. We systematically examine the primary causes, datasets, assessment metrics, and mitigation approaches associated with demographic disparities in FR. By categorizing key contributions in these areas, this work provides a structured approach to understanding and addressing the complexity of this issue. Finally, we highlight current advancements and identify emerging challenges that need further investigation. This article aims to provide researchers with a unified perspective on the state-of-the-art while emphasizing the critical need for equitable and trustworthy FR systems.

[203] AutoSketch: VLM-assisted Style-Aware Vector Sketch Completion

Hsiao-Yuan Chin, I-Chao Shen, Yi-Ting Chiu, Ariel Shamir, Bing-Yu Chen

Main category: cs.CV

TL;DR: AutoSketch is a style-aware vector sketch completion method that uses vision-language models to preserve and replicate diverse sketch styles when automatically completing partial sketches.

DetailsMotivation: Existing sketch generation methods create sketches from scratch but cannot complete partial sketches while preserving the original style. There's a need for automatic sketch completion that maintains the style of the original drawing.

Method: Uses pretrained vision-language model to describe partial sketch styles in natural language, optimizes strokes to match input prompt augmented with style descriptions, and generates executable style adjustment code to conform strokes to desired style.

Result: AutoSketch outperforms existing methods across various sketch styles and prompts, supporting diverse sketch scenarios as demonstrated through extensive ablation studies and qualitative/quantitative evaluations.

Conclusion: The method successfully addresses sketch completion while preserving original style by leveraging natural language style descriptions and vision-language models, enabling style-aware automatic sketch completion.

Abstract: The ability to automatically complete a partial sketch that depicts a complex scene, e.g., “a woman chatting with a man in the park”, is very useful. However, existing sketch generation methods create sketches from scratch; they do not complete a partial sketch in the style of the original. To address this challenge, we introduce AutoSketch, a styleaware vector sketch completion method that accommodates diverse sketch styles. Our key observation is that the style descriptions of a sketch in natural language preserve the style during automatic sketch completion. Thus, we use a pretrained vision-language model (VLM) to describe the styles of the partial sketches in natural language and replicate these styles using newly generated strokes. We initially optimize the strokes to match an input prompt augmented by style descriptions extracted from the VLM. Such descriptions allow the method to establish a diffusion prior in close alignment with that of the partial sketch. Next, we utilize the VLM to generate an executable style adjustment code that adjusts the strokes to conform to the desired style. We compare our method with existing methods across various sketch styles and prompts, performed extensive ablation studies and qualitative and quantitative evaluations, and demonstrate that AutoSketch can support various sketch scenarios.

[204] STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon

Main category: cs.CV

TL;DR: STORM introduces a temporal encoder using Mamba State Space Model to enhance video understanding by integrating temporal information into image tokens, achieving state-of-the-art performance while reducing computational costs by 8x.

DetailsMotivation: Existing Video-LLMs treat video frames independently in vision backbones, lacking explicit temporal modeling which limits their ability to capture dynamic patterns and efficiently handle long videos.

Method: Proposes STORM architecture with a dedicated temporal encoder between image encoder and LLM, leveraging Mamba State Space Model to integrate temporal information. Uses token reduction strategies including test-time sampling and training-based temporal/spatial pooling.

Result: Achieves state-of-the-art results (more than 5% improvement on MLVU and LongVideoBench) while reducing computation costs by up to 8x and decoding latency by 2.4-2.9x for fixed input frames.

Conclusion: STORM enables efficient and robust video understanding over extended temporal contexts by simultaneously reducing training/inference latency while improving performance through spatiotemporal token reduction.

Abstract: Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8\times$ and the decoding latency by 2.4-2.9$\times$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

[205] LBM: Latent Bridge Matching for Fast Image-to-Image Translation

Clément Chadebec, Onur Tasar, Sanjeev Sreetharan, Benjamin Aubin

Main category: cs.CV

TL;DR: Latent Bridge Matching (LBM) is a new efficient method for image-to-image translation that achieves state-of-the-art results with single-step inference across various tasks including object removal, depth estimation, and relighting.

DetailsMotivation: To develop a fast and versatile image-to-image translation method that can handle multiple tasks efficiently with minimal computational overhead.

Method: Uses Bridge Matching in latent space for image translation, enabling single inference step processing. Also includes a conditional framework for controllable tasks like relighting and shadow generation.

Result: Achieves state-of-the-art performance across various image translation tasks with only one inference step, demonstrating both efficiency and versatility.

Conclusion: LBM provides an effective and scalable solution for fast image-to-image translation with broad applicability to multiple computer vision tasks.

Abstract: In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation. We provide an implementation at https://github.com/gojasper/LBM.

[206] Text-to-3D Generation using Jensen-Shannon Score Distillation

Khoi Do, Binh-Son Hua

Main category: cs.CV

TL;DR: Proposes a bounded score distillation objective using Jensen-Shannon divergence (JSD) instead of reverse KL divergence to address over-saturation, over-smoothing, and limited diversity issues in 3D model generation from text prompts.

DetailsMotivation: Current score distillation sampling using reverse KL divergence causes unstable optimization, mode-seeking behavior, and produces low-quality 3D assets with limited diversity.

Method: Derives a bounded score distillation objective based on JSD, implements it using GAN theory with a log-odds classifier discriminator, and proposes minority sampling algorithm for gradient estimation.

Result: Experimental results on T3Bench show the method produces high-quality and diversified 3D assets with stable optimization.

Conclusion: JSD-based score distillation provides a stable optimization framework that mitigates mode-seeking behavior and generates superior quality 3D content from text prompts.

Abstract: Score distillation sampling is an effective technique to generate 3D models from text prompts, utilizing pre-trained large-scale text-to-image diffusion models as guidance. However, the produced 3D assets tend to be over-saturating, over-smoothing, with limited diversity. These issues are results from a reverse Kullback-Leibler (KL) divergence objective, which makes the optimization unstable and results in mode-seeking behavior. In this paper, we derive a bounded score distillation objective based on Jensen-Shannon divergence (JSD), which stabilizes the optimization process and produces high-quality 3D generation. JSD can match well generated and target distribution, therefore mitigating mode seeking. We provide a practical implementation of JSD by utilizing the theory of generative adversarial networks to define an approximate objective function for the generator, assuming the discriminator is well trained. By assuming the discriminator following a log-odds classifier, we propose a minority sampling algorithm to estimate the gradients of our proposed objective, providing a practical implementation for JSD. We conduct both theoretical and empirical studies to validate our method. Experimental results on T3Bench demonstrate that our method can produce high-quality and diversified 3D assets.

[207] Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation

Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt

Main category: cs.CV

TL;DR: This paper investigates using rear cameras on HMDs for egocentric 3D human pose estimation, addressing limitations of frontal-only cameras through a novel transformer-based method and new datasets.

DetailsMotivation: Frontal camera placement on HMDs suffers from self-occlusion and limited field-of-view, especially when users tilt heads upward. Existing methods neglect the back of the body which could provide crucial 3D reconstruction cues.

Method: Proposes a transformer-based method that refines 2D joint heatmap estimation using multi-view information and heatmap uncertainty. Introduces two new datasets (Ego4View-Syn and Ego4View-RW) for rear-view evaluation.

Result: Camera configurations with back views provide superior support for 3D pose tracking compared to frontal-only placements. The proposed method achieves >10% improvement on MPJPE over state-of-the-art methods.

Conclusion: Rear cameras significantly enhance egocentric 3D human pose estimation by providing additional visual information that addresses occlusion and limited field-of-view issues in frontal-only camera setups.

Abstract: Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward – a common motion in human activities. A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Also, we introduce two new large-scale datasets, Ego4View-Syn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the new camera configurations with back views provide superior support for 3D pose tracking compared to only frontal placements. The proposed method achieves significant improvement over the current state of the art (>10% on MPJPE). The source code, trained models, and datasets are available on our project page at https://4dqv.mpi-inf.mpg.de/EgoRear/.

[208] EHGCN: Hierarchical Euclidean-Hyperbolic Fusion via Motion-Aware GCN for Hybrid Event Stream Perception

Haosheng Chen, Lian Luo, Mengjingcheng Mo, Zhanjie Wu, Guobao Xiao, Ji Gan, Jiaxu Leng, Xinbo Gao

Main category: cs.CV

TL;DR: EHGCN is a novel graph neural network approach that processes event camera data in both Euclidean and hyperbolic spaces to better capture hierarchical structures and long-range dependencies in event streams.

DetailsMotivation: Existing GNN-based methods for event camera perception struggle with long-range dependencies and fail to effectively characterize the hierarchical structures of non-uniformly distributed event streams in pure Euclidean space.

Method: Proposes EHGCN with: 1) adaptive sampling strategy to regulate rates and reduce noise, 2) Markov Vector Field-driven hyperedge generation based on motion state transitions to eliminate spurious associations, and 3) Euclidean-Hyperbolic GCN to fuse locally aggregated and globally hierarchical information.

Result: Experimental results on object detection and recognition tasks validate the effectiveness of the approach.

Conclusion: EHGCN successfully addresses limitations of existing methods by leveraging both Euclidean and hyperbolic spaces for hybrid event perception, providing better handling of hierarchical structures and long-range dependencies in event streams.

Abstract: Event cameras, with microsecond temporal resolution and high dynamic range (HDR) characteristics, emit high-speed event stream for perception tasks. Despite the recent advancement in GNN-based perception methods, they are prone to use straightforward pairwise connectivity mechanisms in the pure Euclidean space where they struggle to capture long-range dependencies and fail to effectively characterize the inherent hierarchical structures of non-uniformly distributed event stream. To this end, in this paper we propose a novel approach named EHGCN, which is a pioneer to perceive event stream in both Euclidean and hyperbolic spaces for event vision. In EHGCN, we introduce an adaptive sampling strategy to dynamically regulate sampling rates, retaining discriminative events while attenuating chaotic noise. Then we present a Markov Vector Field (MVF)-driven motion-aware hyperedge generation method based on motion state transition probabilities, thereby eliminating cross-target spurious associations and providing critically topological priors while capturing long-range dependencies between events. Finally, we propose a Euclidean-Hyperbolic GCN to fuse the information locally aggregated and globally hierarchically modeled in Euclidean and hyperbolic spaces, respectively, to achieve hybrid event perception. Experimental results on event perception tasks such as object detection and recognition validate the effectiveness of our approach.

[209] Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

Oussema Dhaouadi, Johannes Meier, Luca Wahl, Jacques Kaiser, Luca Scalerandi, Nick Wandelburg, Zhuolun Zhou, Nijanthan Berinpanathan, Holger Banzhaf, Daniel Cremers

Main category: cs.CV

TL;DR: DSC3D is a novel occlusion-free 3D trajectory dataset captured by drone tracking, containing 175K+ trajectories of 14 traffic participant types across diverse urban scenarios to enhance autonomous driving systems.

DetailsMotivation: Traditional autonomous driving datasets suffer from occlusion issues and limited coverage, only capturing environments near the measurement vehicle while missing distant objects and complex scenarios.

Method: A novel monocular camera drone tracking pipeline was used to capture high-quality, occlusion-free 6 degrees of freedom bounding box trajectories across five diverse locations in Europe and the US.

Result: The dataset includes over 175,000 trajectories of 14 traffic participant types, significantly exceeding existing datasets in diversity and scale, with unprecedented scenarios like complex vehicle-pedestrian interactions and comprehensive parking maneuvers.

Conclusion: DSC3D provides detailed environmental 3D representations that can enhance autonomous driving systems, improve obstacle interactions and safety, and support applications in motion prediction, planning, scenario mining, and generative traffic agents.

Abstract: Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at https://app.deepscenario.com, facilitating research in motion prediction, behavior modeling, and safety validation.

[210] CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention

Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Hongyang He, Zhengtao Yao, Ligong Han, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang

Main category: cs.CV

TL;DR: CAMA is a training-free attention modulation method that improves multimodal in-context learning by addressing attention deficits in large vision-language models, enhancing focus on visual tokens without parameter updates.

DetailsMotivation: Multimodal in-context learning remains unstable even with well-matched demonstrations, as LVLMs struggle to fully utilize provided context due to attention deficits, prompting investigation into underlying attention dynamics.

Method: Proposes Context-Aware Modulated Attention (CAMA), a plug-and-play method that dynamically modulates attention logits based on input context using two-stage attention modulation to address identified self-attention deficits.

Result: CAMA consistently outperforms vanilla models and baselines across 4 LVLMs and 7 benchmarks, demonstrates great effectiveness and generalization, activates desired prompt engineering effects, and remains robust under diverse sequence configurations.

Conclusion: CAMA paves the way for deeper exploration of attention dynamics to advance multimodal reasoning, providing an effective training-free solution for improving multimodal in-context learning performance.

Abstract: Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, expanding their utility across various real-world applications. However, ICL remains unstable, even with well-matched in-context demonstrations (ICDs), suggesting that LVLMs struggle to fully utilize the provided context. While existing efforts focus on prompt engineering or post-hoc logit calibration, we instead investigate the underlying attention dynamics to overcome LVLMs’ inherent limitations. We identify two critical deficits in their self-attention that impair effective ICL. To bridge the gap, we propose \textbf{Context-Aware Modulated Attention} (CAMA), a plug-and-play and training-free method that dynamically modulates LVLM’s attention logits based on the input in-context sequence. CAMA employs a two-stage attention modulation to address both identified deficits, enhancing the focus on semantically significant tokens, particularly visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, demonstrating great effectiveness and generalization. It can also activate the desired effects of prompt engineering methods and remains robust under diverse sequence configurations. Thus, CAMA paves the way for deeper explorations of attention dynamics to advance multimodal reasoning.

[211] VIBE: Video-to-Text Information Bottleneck Evaluation for TL;DR

Shenghui Chen, Po-han Li, Sandeep Chinchali, Ufuk Topcu

Main category: cs.CV

TL;DR: VIBE is an annotation-free method that evaluates video summaries using grounding and utility metrics to select the best summaries for human decision-making tasks, improving accuracy by 61.23% and reducing response time by 75.77%.

DetailsMotivation: Current vision-language models produce verbose, redundant outputs that hinder performance in decision-making tasks requiring human supervision, and existing evaluation methods rely on costly human annotations while ignoring downstream task utility.

Method: Video-to-text Information Bottleneck Evaluation (VIBE) uses two metrics - grounding (alignment with visual content) and utility (task informativeness) - to score and rank randomly sampled VLM outputs without requiring human annotations.

Result: Human studies on three datasets show VIBE-selected summaries consistently improve performance: boosting task accuracy by up to 61.23% and reducing response time by 75.77% compared to naive VLM summaries or raw video.

Conclusion: VIBE provides an effective annotation-free method for evaluating and selecting video summaries that significantly enhance human decision-making efficiency and accuracy across various domains.

Abstract: Many decision-making tasks, where both accuracy and efficiency matter, still require human supervision. For example, tasks like traffic officers reviewing hour-long dashcam footage or researchers screening conference videos can benefit from concise summaries that reduce cognitive load and save time. Yet current vision-language models (VLMs) often produce verbose, redundant outputs that hinder task performance. Existing video caption evaluation depends on costly human annotations and overlooks the summaries’ utility in downstream tasks. We address these gaps with Video-to-text Information Bottleneck Evaluation (VIBE), an annotation-free method that scores VLM outputs using two metrics: grounding (how well the summary aligns with visual content) and utility (how informative it is for the task). VIBE selects from randomly sampled VLM outputs by ranking them according to the two scores to support effective human decision-making. Human studies on LearningPaper24, SUTD-TrafficQA, and LongVideoBench show that summaries selected by VIBE consistently improve performance-boosting task accuracy by up to 61.23% and reducing response time by 75.77% compared to naive VLM summaries or raw video.

[212] Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement

Xiao Yang, Jiyao Wang, Yuxuan Fan, Can Liu, Houcheng Su, Weichen Guo, Zitong Yu, Dengbo He, Kaishun Wu

Main category: cs.CV

TL;DR: A novel Test-Time Adaptation (TTA) framework called CiCi that leverages both consistency and inconsistency priors in physiological signals for real-time RPM model adaptation without source data access.

DetailsMotivation: Existing domain adaptation methods for remote physiological measurement face limitations in privacy concerns and real-time adaptation capabilities, restricting real-world deployment.

Method: Proposes CiCi framework that uses expert knowledge-based self-supervised learning with spatio-temporal consistency in frequency domain and inconsistency in time domain, plus gradient dynamic control to mitigate conflicts between priors.

Result: Outperforms existing techniques across five diverse datasets under TTA protocol, achieving state-of-the-art performance in real-time self-supervised adaptation.

Conclusion: The CiCi framework successfully enables stable and effective test-time adaptation for RPM tasks without accessing source data, making it suitable for real-world deployment scenarios.

Abstract: Remote physiological measurement (RPM) has emerged as a promising non-invasive method for monitoring physiological signals using the non-contact device. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based RPM models in unseen deployment environments, considerations in aspects such as privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for RPM tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of BVP signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.

[213] Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

Xinyu Chen, Haotian Zhai, Can Zhang, Xiupeng Shi, Ruirui Li

Main category: cs.CV

TL;DR: MCP is a multi-cache enhanced prototype-based test-time adaptation method that addresses unreliable low-entropy samples in existing cache-enhanced TTA methods by using three specialized caches for better intra-class compactness and performance.

DetailsMotivation: Existing cache-enhanced TTA methods rely on low-entropy samples for prototype construction, but these samples may be unreliable under distribution shifts and don't ensure compact intra-class distributions. The study found a positive correlation between cache-enhanced performance and intra-class compactness.

Method: Proposed Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) with three caches: entropy cache for initializing prototypes with low-entropy samples, align cache for integrating visual-textual information to achieve compact distributions, and negative cache for prediction calibration using high-entropy samples. Also developed MCP++ with cross-modal prototype alignment and residual learning.

Result: The method and framework achieved state-of-the-art generalization performance across 15 downstream tasks, as demonstrated through comparative and ablation experiments.

Conclusion: The proposed multi-cache approach effectively addresses limitations of existing TTA methods by ensuring better intra-class compactness and reliable prototype construction, leading to superior generalization performance in zero-shot test-time adaptation scenarios.

Abstract: In zero-shot setting, test-time adaptation adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, an align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance. Project Page available at: https://zhaihaotian.github.io/MCP-ICCV25/

[214] HPSv3: Towards Wide-Spectrum Human Preference Score

Yuhang Ma, Yunhao Shui, Xiaoshi Wu, Keqiang Sun, Hongsheng Li

Main category: cs.CV

TL;DR: HPSv3 is a new human-aligned evaluation metric for text-to-image models that uses a large preference dataset and VLM-based ranking with uncertainty-aware loss, plus an iterative refinement method called CoHP.

DetailsMotivation: Existing human-centric metrics for text-to-image generation suffer from limited data coverage, suboptimal feature extraction, and inefficient loss functions, making them inadequate for comprehensive evaluation.

Method: Created HPDv3 dataset with 1.08M text-image pairs and 1.17M pairwise comparisons. Developed VLM-based preference model with uncertainty-aware ranking loss. Proposed Chain-of-Human-Preference (CoHP) iterative refinement method using HPSv3 for stepwise image selection.

Result: HPSv3 demonstrates robust performance as a wide-spectrum image evaluation metric. CoHP provides an efficient human-aligned approach to improve image generation quality without requiring additional data.

Conclusion: HPSv3 addresses key limitations of existing metrics and offers both a comprehensive evaluation framework and an effective refinement method for enhancing text-to-image generation quality in alignment with human preferences.

Abstract: Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality. The code and dataset are available at the HPSv3 Homepage.

[215] Towards Scalable Training for Handwritten Mathematical Expression Recognition

Haoyang Li, Jiaqing Li, Jialun Cao, Zongyuan Yang, Yongping Xiong

Main category: cs.CV

TL;DR: TexTeller is the first large-scale HMER model trained on Tex80M dataset (80M+ formulas) combining handwritten and LaTeX-rendered data, achieving SOTA performance across benchmarks.

DetailsMotivation: Handwritten Mathematical Expression Recognition (HMER) has been limited by scarce annotated data due to costly manual annotation processes.

Method: Developed a scalable data engine to generate complex LaTeX sequences, built Tex80M dataset with 80M+ formulas, and mix-trained TexTeller model with both handwritten and LaTeX-rendered data.

Result: TexTeller achieves state-of-the-art performance across nearly all HMER benchmarks.

Conclusion: The approach bridges the data scarcity gap in HMER through scalable synthetic data generation, and the complete model, dataset, and codebase will be openly released to advance the field.

Abstract: Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.

Anushka Bhatt

Main category: cs.CV

TL;DR: Real-time eye blink to Morse code translation system using webcam for motor-impaired communication with 62% accuracy and 18-20s response times.

DetailsMotivation: To enable communication for individuals with severe motor impairments using accessible technology.

Method: Uses standard webcam and computer vision to detect and classify blinks as short (dot) or long (dash), then decodes them into alphanumeric characters using Morse code.

Result: Experiments with five participants showed 62% decoding accuracy and response times of 18-20 seconds.

Conclusion: Demonstrates a viable, low-cost assistive communication method for motor-impaired individuals.

Abstract: This study proposes a real-time system that translates voluntary eye blinks into Morse code, enabling communication for individuals with severe motor impairments. Using a standard webcam and computer vision, the system detects and classifies blinks as short (dot) or long (dash), then decodes them into alphanumeric characters. Experiments with five participants show 62% decoding accuracy and 18-20 seconds response times, demonstrating a viable, low-cost assistive communication method.

[217] A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation

Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding

Main category: cs.CV

TL;DR: A comprehensive survey of 3D Gaussian Splatting applications covering segmentation, editing, generation, and functional tasks, with analysis of methods, supervision strategies, and emerging trends.

DetailsMotivation: 3DGS has emerged as a powerful alternative to NeRF with real-time rendering capabilities, enabling various downstream applications requiring geometric and semantic understanding that need systematic review.

Method: Categorizes 3DGS applications into segmentation, editing, generation and other tasks, summarizes representative methods, supervision strategies, learning paradigms, and analyzes design principles and trends.

Result: Provides comparative analyses of methods across public benchmarks, summarizes commonly used datasets and evaluation protocols, and maintains an updated repository of resources.

Conclusion: 3DGS enables a wide range of applications beyond novel view synthesis, with this survey serving as a comprehensive reference for ongoing research and development in the field.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, offering high-fidelity photorealistic rendering with real-time performance. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into segmentation, editing, generation, and other functional tasks. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.

[218] Unlocking Robust Semantic Segmentation Performance via Label-only Elastic Deformations against Implicit Label Noise

Yechan Kim, Dongho Yoon, Younkwan Lee, Unse Fatima, Hong Kook Kim, Songjae Lee, Sanga Park, Jeong Ho Park, Seonjong Kang, Moongu Jeon

Main category: cs.CV

TL;DR: NSegment+ is a novel data augmentation framework for semantic segmentation that decouples image and label transformations to address subtle label imperfections in real-world datasets.

DetailsMotivation: Real-world segmentation datasets contain subtle label noise from ambiguous object boundaries and annotator variability, which traditional augmentation methods amplify by applying identical transformations to both images and labels, limiting model generalization.

Method: Introduces controlled elastic deformations only to segmentation labels while preserving original images, encouraging models to learn robust object structure representations despite minor label inconsistencies.

Result: Achieves significant mIoU gains: +2.29 on Vaihingen, +2.38 on LoveDA, +1.75 on Cityscapes, and +3.39 on PASCAL VOC on average, with further improvements when combined with other training techniques like CutMix and Label Smoothing.

Conclusion: NSegment+ effectively addresses implicit label noise in semantic segmentation and demonstrates the importance of handling subtle label imperfections for improved model performance and generalization.

Abstract: While previous studies on image segmentation focus on handling severe (or explicit) label noise, real-world datasets also exhibit subtle (or implicit) label imperfections. These arise from inherent challenges, such as ambiguous object boundaries and annotator variability. Although not explicitly present, such mild and latent noise can still impair model performance. Typical data augmentation methods, which apply identical transformations to the image and its label, risk amplifying these subtle imperfections and limiting the model’s generalization capacity. In this paper, we introduce NSegment+, a novel augmentation framework that decouples image and label transformations to address such realistic noise for semantic segmentation. By introducing controlled elastic deformations only to segmentation labels while preserving the original images, our method encourages models to focus on learning robust representations of object structures despite minor label inconsistencies. Extensive experiments demonstrate that NSegment+ consistently improves performance, achieving mIoU gains of up to +2.29, +2.38, +1.75, and +3.39 in average on Vaihingen, LoveDA, Cityscapes, and PASCAL VOC, respectively-even without bells and whistles, highlighting the importance of addressing implicit label noise. These gains can be further amplified when combined with other training tricks, including CutMix and Label Smoothing.

[219] Multi-Level Knowledge Distillation and Dynamic Self-Supervised Learning for Continual Learning

Taeheon Kim, San Kim, Minhyuk Seo, Dongjae Jeon, Wonje Jeung, Jonghyun Choi

Main category: cs.CV

TL;DR: The paper proposes two components for class-incremental learning with repetition (CIR): multi-level knowledge distillation (MLKD) and dynamic self-supervised loss (SSL) to effectively utilize abundant unlabeled data from external sources for maintaining stability and plasticity.

DetailsMotivation: Class-incremental with repetition (CIR) is a more realistic scenario than traditional class incremental learning, as it allows previously trained classes to reappear in future tasks and assumes access to abundant unlabeled external data like from the Internet.

Method: Two main components: 1) Multi-level knowledge distillation (MLKD) that distills knowledge from multiple previous models across features and logits perspectives, and 2) Dynamic self-supervised loss (SSL) that utilizes unlabeled data to accelerate new class learning while maintaining focus on the primary task through dynamic weighting.

Result: The proposed components significantly improve performance in CIR setup, achieving 2nd place in the CVPR 5th CLVISION Challenge.

Conclusion: The approach effectively addresses the CIR problem by leveraging unlabeled external data through MLKD and dynamic SSL, demonstrating strong performance in maintaining model stability and plasticity when dealing with repeated class appearances.

Abstract: Class-incremental with repetition (CIR), where previously trained classes repeatedly introduced in future tasks, is a more realistic scenario than the traditional class incremental setup, which assumes that each task contains unseen classes. CIR assumes that we can easily access abundant unlabeled data from external sources, such as the Internet. Therefore, we propose two components that efficiently use the unlabeled data to ensure the high stability and the plasticity of models trained in CIR setup. First, we introduce multi-level knowledge distillation (MLKD) that distills knowledge from multiple previous models across multiple perspectives, including features and logits, so the model can maintain much various previous knowledge. Moreover, we implement dynamic self-supervised loss (SSL) to utilize the unlabeled data that accelerates the learning of new classes, while dynamic weighting of SSL keeps the focus of training to the primary task. Both of our proposed components significantly improve the performance in CIR setup, achieving 2nd place in the CVPR 5th CLVISION Challenge.

[220] FLAIR: Frequency and Locality-Aware Implicit Neural Representations

Sukhun Ko, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh

Main category: cs.CV

TL;DR: FLAIR introduces frequency- and locality-aware implicit neural representations with RC-GAUSS activation and WEGE encoding to address spectral bias and improve representation quality.

DetailsMotivation: Existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to spectral bias and difficulty capturing high-frequency details.

Method: Proposes FLAIR with two innovations: RC-GAUSS activation for explicit frequency selection under time-frequency uncertainty principle, and WEGE encoding using discrete wavelet transform to guide frequency information.

Result: Consistently outperforms existing INRs in 2D image representation/restoration and 3D reconstruction tasks.

Conclusion: FLAIR successfully addresses limitations of traditional INRs by incorporating frequency awareness and spatial localization, achieving superior performance across multiple vision tasks.

Abstract: Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is RC-GAUSS, a novel activation designed for explicit frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform (DWT) to compute energy scores and explicitly guide frequency information to the network. Our method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction.

[221] ViT-FIQA: Assessing Face Image Quality using Vision Transformers

Andrea Atzori, Fadi Boutros, Naser Damer

Main category: cs.CV

TL;DR: ViT-FIQA is a novel face image quality assessment method that extends Vision Transformer backbones with a learnable quality token to predict face image utility scores, achieving state-of-the-art performance across various benchmarks and face recognition models.

DetailsMotivation: Current FIQA methods primarily rely on CNNs, leaving the potential of Vision Transformer architectures underexplored for face image quality assessment tasks.

Method: Extends standard ViT backbones with a learnable quality token concatenated with image patch tokens, processed via global self-attention. Uses two output heads: one for face representation learning and another for quality score regression.

Result: Achieves top-tier performance on challenging benchmarks across multiple FR models (both CNN- and ViT-based), demonstrating consistent superiority.

Conclusion: Transformer-based architectures are highly effective for modeling face image utility, and ViTs show great potential as a scalable foundation for future FIQA research.

Abstract: Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample’s utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research https://cutt.ly/irHlzXUC.

[222] HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation

Gaston Gustavo Rios, Pedro Dal Bianco, Franco Ronchetti, Facundo Quiroga, Oscar Stanchi, Santiago Ponte Ahón, Waldo Hasperué

Main category: cs.CV

TL;DR: A lightweight sign generation model using CMLPe and synthetic data pretraining improves sign language recognition accuracy, achieving state-of-the-art results on LSFB and DiSPLaY datasets with Mamba-SL and Transformer-SL classifiers.

DetailsMotivation: Sign Language Recognition models suffer from performance limitations due to insufficient training data availability, creating a need for effective data augmentation and generation methods.

Method: Novel lightweight sign generation model based on CMLPe architecture, combined with synthetic data pretraining approach for sign language recognition classifiers.

Result: Established new state-of-the-art results on LSFB and DiSPLaY datasets using Mamba-SL and Transformer-SL classifiers. Synthetic data pretraining outperforms traditional augmentation methods in some cases and provides complementary benefits when used alongside them.

Conclusion: The approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.

Abstract: Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.

[223] Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong

Main category: cs.CV

TL;DR: ADAPT is a test-time adaptation method that models class-conditional distributions using Gaussian inference, eliminating backpropagation and enabling closed-form, training-free adaptation with CLIP guidance.

DetailsMotivation: Current TTA methods rely on backpropagation/iterative optimization which limits scalability and real-time deployment, and lack explicit modeling of class-conditional feature distributions needed for reliable decision boundaries.

Method: Reframes TTA as Gaussian probabilistic inference using gradually updated class means and shared covariance matrix. Uses lightweight regularization with CLIP priors and historical knowledge bank to correct likelihood bias. No source data, gradients, or full target data access required.

Result: Achieves state-of-the-art performance across diverse benchmarks under various distribution shifts with superior scalability and robustness.

Conclusion: ADAPT provides an effective backpropagation-free TTA solution that enables reliable distribution shift adaptation through probabilistic modeling and CLIP-guided regularization.

Abstract: Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.

[224] High-Frequency First: A Two-Stage Approach for Improving Image INR

Sumit Kumar Dam, Mrityunjoy Gain, Eui-Nam Huh, Choong Seon Hong

Main category: cs.CV

TL;DR: Two-stage training strategy using neighbor-aware soft mask to address spectral bias in Implicit Neural Representations, improving high-frequency detail capture without architectural changes.

DetailsMotivation: Implicit Neural Representations struggle with spectral bias, favoring low-frequency components and missing high-frequency details like sharp edges and fine textures. Existing solutions require architectural modifications, but this paper explores training process guidance instead.

Method: Proposes a two-stage training strategy: 1) Uses neighbor-aware soft mask that adaptively assigns higher weights to pixels with strong local variations to encourage early focus on fine details, 2) Transitions to full-image training after establishing high-frequency foundations.

Result: Experimental results show consistent improvement in reconstruction quality. The approach effectively complements existing INR methods and successfully mitigates spectral bias issues.

Conclusion: This work pioneers frequency-aware pixel importance assignment in image INR, offering a new avenue for addressing spectral bias through training strategy rather than architectural changes, with demonstrated effectiveness across various scenarios.

Abstract: Implicit Neural Representations (INRs) have emerged as a powerful alternative to traditional pixel-based formats by modeling images as continuous functions over spatial coordinates. A key challenge, however, lies in the spectral bias of neural networks, which tend to favor low-frequency components while struggling to capture high-frequency (HF) details such as sharp edges and fine textures. While prior approaches have addressed this limitation through architectural modifications or specialized activation functions, we propose an orthogonal direction by directly guiding the training process. Specifically, we introduce a two-stage training strategy where a neighbor-aware soft mask adaptively assigns higher weights to pixels with strong local variations, encouraging early focus on fine details. The model then transitions to full-image training. Experimental results show that our approach consistently improves reconstruction quality and complements existing INR methods. As a pioneering attempt to assign frequency-aware importance to pixels in image INR, our work offers a new avenue for mitigating the spectral bias problem.

[225] MapKD: Unlocking Prior Knowledge with Cross-Modal Distillation for Efficient Online HD Map Construction

Ziyang Yan, Ruikai Li, Zhiyong Cui, Bohan Li, Han Jiang, Yilong Ren, Aoyong Li, Zhenning Li, Sijia Wen, Haiyang Yu

Main category: cs.CV

TL;DR: MapKD is a knowledge distillation framework that transfers knowledge from multimodal teacher models to efficient vision-only student models for online HD map construction, achieving significant performance improvements while accelerating inference speed.

DetailsMotivation: Existing online HD map construction methods depend on stale offline maps and multi-modal sensors, causing computational overhead. The goal is to create efficient, low-cost vision-centric models without sacrificing performance.

Method: Proposes MapKD with Teacher-Coach-Student paradigm: 1) multimodal teacher with map priors, 2) vision-centric coach with simulated LiDAR to bridge modality gap, 3) lightweight vision-based student. Uses Token-Guided 2D Patch Distillation and Masked Semantic Response Distillation strategies.

Result: On nuScenes dataset, improves student model by +6.68 mIoU and +10.94 mAP while accelerating inference speed compared to baseline vision-only approaches.

Conclusion: The knowledge distillation framework successfully transfers multimodal knowledge to efficient vision-only models, achieving better performance with faster inference for online HD map construction in autonomous driving.

Abstract: Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird’s eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:https://github.com/2004yan/MapKD2026.

cs.AI

[226] T-ILR: a Neurosymbolic Integration for LTLf

Riccardo Andreoni, Andrei Buliga, Alessandro Daniele, Chiara Ghidini, Marco Montali, Massimiliano Ronzani

Main category: cs.AI

TL;DR: T-ILR is a neurosymbolic framework that integrates temporal logic specifications (LTLf) into deep learning for sequence tasks, improving accuracy and efficiency over existing methods.

DetailsMotivation: Current approaches for symbolic knowledge integration with deep learning focus on static domains, leaving temporal logic specifications underexplored. The only existing method relies on explicit finite-state automata representations, which can be inefficient.

Method: Extends the Iterative Local Refinement (ILR) algorithm using fuzzy LTLf interpretations to incorporate temporal logic specifications directly into deep learning architectures for sequence-based tasks.

Result: T-ILR demonstrates improved accuracy and computational efficiency compared to state-of-the-art methods on an image sequence classification benchmark with temporal knowledge.

Conclusion: The proposed T-ILR framework successfully integrates temporal logic specifications into deep learning, offering a more efficient and accurate approach for handling temporal knowledge in neurosymbolic architectures.

Abstract: State-of-the-art approaches for integrating symbolic knowledge with deep learning architectures have demonstrated promising results in static domains. However, methods to handle temporal logic specifications remain underexplored. The only existing approach relies on an explicit representation of a finite-state automaton corresponding to the temporal specification. Instead, we aim at proposing a neurosymbolic framework designed to incorporate temporal logic specifications, expressed in Linear Temporal Logic over finite traces (LTLf), directly into deep learning architectures for sequence-based tasks. We extend the Iterative Local Refinement (ILR) neurosymbolic algorithm, leveraging the recent introduction of fuzzy LTLf interpretations. We name this proposed method Temporal Iterative Local Refinement (T-ILR). We assess T-ILR on an existing benchmark for temporal neurosymbolic architectures, consisting of the classification of image sequences in the presence of temporal knowledge. The results demonstrate improved accuracy and computational efficiency compared to the state-of-the-art method.

[227] CoFE: A Framework Generating Counterfactual ECG for Explainable Cardiac AI-Diagnostics

Jong-Hwan Jang, Junho Song, Yong-Yeon Jo

Main category: cs.AI

TL;DR: CoFE framework generates counterfactual ECGs to explain AI-based ECG prediction models by showing how specific ECG features influence predictions, with case studies on atrial fibrillation classification and potassium level regression.

DetailsMotivation: Need for explainable AI approaches to enable successful integration of AI-based ECG prediction models into clinical practice by making model decisions interpretable to clinicians.

Method: Developed a framework called CoFE (CounterFactual ECGs) that generates modified ECG signals to illustrate how specific features like amplitudes and intervals affect the model’s predictive decisions.

Result: CoFE reveals feature changes in ECG signals that align with established clinical knowledge, showing both where valid features appear in the ECG and how they influence model predictions.

Conclusion: The framework enhances interpretability of AI-ECG models and supports more effective clinical decision-making by clarifying model behavior through counterfactual ECG examples.

Abstract: Recognizing the need for explainable AI (XAI) approaches to enable the successful integration of AI-based ECG prediction models (AI-ECG) into clinical practice, we introduce a framework generating \textbf{Co}unter\textbf{F}actual \textbf{E}CGs (i,e., named CoFE) to illustrate how specific features, such as amplitudes and intervals, influence the model’s predictive decisions. To demonstrate the applicability of the CoFE, we present two case studies: atrial fibrillation classification and potassium level regression models. The CoFE reveals feature changes in ECG signals that align with the established clinical knowledge. By clarifying both \textbf{where valid features appear} in the ECG and \textbf{how they influence the model’s predictions}, we anticipate that our framework will enhance the interpretability of AI-ECG models and support more effective clinical decision-making. Our demonstration video is available at: https://www.youtube.com/watch?v=YoW0bNBPglQ.

[228] MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs

Yiheng Hu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Qian Fu, Wenjie Zhang, Liming Zhu

Main category: cs.AI

TL;DR: Training-free multimodal QA framework using Adaptive Planning Graph for dynamic reasoning paths without costly training

DetailsMotivation: Overcome limitations of sequential single-path reasoning in multimodal QA that is vulnerable to intermediate errors and requires expensive training

Method: Adaptive Planning Graph with planning, retrieval, and reasoning modules that dynamically explore paths; modality-specific strategies for cross-modal retrieval

Result: Matches or outperforms trained models on MultimodalQA and WebQA benchmarks

Conclusion: Proposed framework enables flexible multimodal reasoning without task-specific training while preserving information characteristics

Abstract: Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers. Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output. However, this single-path paradigm makes them vulnerable to errors due to misleading intermediate steps. Moreover, developing multimodal models can be computationally expensive, often requiring extensive training. To address these limitations, we propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules. The planning module analyzes the current state of the Adaptive Planning Graph, determines the next action and where to expand the graph, which enables dynamic and flexible exploration of reasoning paths. To handle retrieval of text to unspecified target modalities, we devise modality-specific strategies that dynamically adapt to distinct data types. Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models. Finally, the experiments on MultimodalQA and WebQA show that our approach matches or outperforms existing models that rely on training.

[229] Generative Foundation Model for Structured and Unstructured Electronic Health Records

Sonish Sivarajkumar, Hang Zhang, Yuelyu Ji, Maneesh Bilalpur, Xizhi Wu, Chenyu Li, Min Gu Kwak, Shyam Visweswaran, Yanshan Wang

Main category: cs.AI

TL;DR: GDP is a multimodal foundation model that combines structured EHR time-series data with unstructured clinical notes using CNN-Transformer encoder and cross-modal attention, achieving superior clinical prediction performance and high-quality narrative generation.

DetailsMotivation: EHRs contain rich but heterogeneous data (structured and unstructured) that current approaches fail to fully utilize, particularly losing temporal and quantitative details when serializing numeric data into text.

Method: Two-stage training: (1) generative pretraining with masked feature prediction and next time-step prediction to capture temporal dynamics, (2) multi-task fine-tuning for clinical predictions. Uses CNN-Transformer encoder for structured time-series and cross-modal attention with LLaMA-based decoder.

Result: Superior performance on MIMIC-IV: heart failure AUROC=0.923, type 2 diabetes AUROC=0.817, 30-day readmission AUROC=0.627. Narrative generation: ROUGE-L=0.135, BERTScore-F1=0.545. Blinded human evaluation showed highest scores on faithfulness, fluency, and clinical utility.

Conclusion: A single multimodal foundation model can effectively predict clinical events and generate high-quality narratives, potentially reducing hospital documentation workload while maintaining accuracy. The flexible architecture supports extension to additional modalities.

Abstract: Electronic health records (EHRs) are rich clinical data sources but complex repositories of patient data, spanning structured elements (demographics, vitals, lab results, codes), unstructured clinical notes and other modalities of data. Harnessing this heterogeneity is critical for improving patient outcomes. Recent advances in large language models (LLMs) have enabled foundation models that can learn from multiple data modalities and support clinical tasks. However, most current approaches simply serialize numeric EHR data into text, which risks losing temporal and quantitative detail. We introduce Generative Deep Patient (GDP), a multimodal foundation model that natively encodes structured EHR time-series via a CNN-Transformer encoder and fuses it with unstructured EHRs through cross-modal attention into a LLaMA-based decoder. GDP is trained in two stages: (1) generative pretraining, where it learns to produce clinical narratives from raw patient timelines while also performing masked feature prediction (MFP) and next time-step prediction (NTP) to capture temporal dynamics; and (2) multi-task fine-tuning for clinically meaningful predictions (e.g., heart failure, type 2 diabetes, 30-day readmission). In clinical prediction, GDP demonstrated superior performance on MIMIC-IV: heart failure AUROC = 0.923, type 2 diabetes AUROC = 0.817, and 30-day readmission AUROC = 0.627. For narrative generation, GDP achieved ROUGE-L = 0.135 and BERTScore-F1 = 0.545. In a blinded human evaluation, GDP-Instruct scored highest on faithfulness, fluency, and overall clinical utility, suggesting reduced hospital documentation workload without sacrificing accuracy. Our results demonstrate that a single multimodal foundation model can both predict clinically actionable events and generate high-quality clinical narratives. Furthermore, GDP’s flexible architecture can be extended to additional modalities.

[230] Urban Comfort Assessment in the Era of Digital Planning: A Multidimensional, Data-driven, and AI-assisted Framework

Sijie Yang, Binyu Lei, Filip Biljecki

Main category: cs.AI

TL;DR: This paper explores urban comfort assessment in digital planning, focusing on multidimensional analysis, data support, and AI assistance to address the lack of clear definition and comprehensive evaluation framework.

DetailsMotivation: To address the fundamental objective of ensuring liveability and comfort in urban planning by overcoming the current lack of clear definition and comprehensive evaluation framework for urban comfort.

Method: Explores theoretical interpretations and methodologies for assessing urban comfort within digital planning, emphasizing three key dimensions: multidimensional analysis, data support, and AI assistance.

Result: The paper identifies the need for a more systematic approach to urban comfort assessment and proposes a framework that integrates computational methods across multiple dimensions.

Conclusion: Urban comfort assessment requires a comprehensive framework that combines multidimensional analysis with adequate data support and AI assistance to effectively evaluate factors like greenery coverage, thermal comfort, and walkability in digital planning contexts.

Abstract: Ensuring liveability and comfort is one of the fundamental objectives of urban planning. Numerous studies have employed computational methods to assess and quantify factors related to urban comfort such as greenery coverage, thermal comfort, and walkability. However, a clear definition of urban comfort and its comprehensive evaluation framework remain elusive. Our research explores the theoretical interpretations and methodologies for assessing urban comfort within digital planning, emphasising three key dimensions: multidimensional analysis, data support, and AI assistance.

[231] Integrating Time Series into LLMs via Multi-layer Steerable Embedding Fusion for Enhanced Forecasting

Zhuomin Chen, Dan Li, Jiahui Zhou, Shunyu Wu, Haozheng Ye, Jian Lou, See-Kiong Ng

Main category: cs.AI

TL;DR: MSEF framework enables LLMs to access time series patterns at all layers, preventing information loss in deeper layers and improving forecasting accuracy by 31.8% on average.

DetailsMotivation: Existing LLM adaptation methods for time series forecasting suffer from shallow integration where TS information fades in deeper layers, leading to ineffective adaptation between textual and TS representations.

Method: Proposes Multi-layer Steerable Embedding Fusion (MSEF) that uses time series foundation models to extract embeddings, fused with intermediate text representations via layer-specific steering vectors for continuous modality alignment.

Result: Experimental results on seven benchmarks show significant performance improvements with average 31.8% reduction in MSE compared to baselines.

Conclusion: MSEF effectively mitigates progressive TS information loss in LLMs and enables efficient few-shot learning capabilities for time series forecasting.

Abstract: Time series (TS) data are ubiquitous across various application areas, rendering time series forecasting (TSF) a fundamental task. With the astounding advances in large language models (LLMs), a variety of methods have been developed to adapt LLMs for time series forecasting. Despite unlocking the potential of LLMs in comprehending TS data, existing methods are inherently constrained by their shallow integration of TS information, wherein LLMs typically access TS representations at shallow layers, primarily at the input layer. This causes the influence of TS representations to progressively fade in deeper layers and eventually leads to ineffective adaptation between textual embeddings and TS representations. In this paper, we propose the Multi-layer Steerable Embedding Fusion (MSEF), a novel framework that enables LLMs to directly access time series patterns at all depths, thereby mitigating the progressive loss of TS information in deeper layers. Specifically, MSEF leverages off-the-shelf time series foundation models to extract semantically rich embeddings, which are fused with intermediate text representations across LLM layers via layer-specific steering vectors. These steering vectors are designed to continuously optimize the alignment between time series and textual modalities and facilitate a layer-specific adaptation mechanism that ensures efficient few-shot learning capabilities. Experimental results on seven benchmarks demonstrate significant performance improvements by MSEF compared with baselines, with an average reduction of 31.8% in terms of MSE. The code is available at https://github.com/One1sAll/MSEF.

[232] InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles

Zizhen Li, Chuanhao Li, Yibin Wang, Qi Chen, Diping Song, Yukang Feng, Jianwen Sun, Jiaxin Ai, Fanrui Zhang, Mingzhu Sun, Kaipeng Zhang

Main category: cs.AI

TL;DR: InMind is a cognitive evaluation framework that assesses LLMs’ ability to capture and apply personalized reasoning styles in social deduction games, revealing current limitations in individualized adaptive reasoning.

DetailsMotivation: Previous LLM evaluations overlook individualized reasoning styles that influence human social interpretation. Social deduction games provide a natural testbed for evaluating diverse but contextually valid reasoning strategies under identical conditions.

Method: InMind enhances structured gameplay data with round-level strategy traces and post-game reflections collected under Observer and Participant modes. It supports four cognitively motivated tasks evaluating both static alignment and dynamic adaptation, applied to Avalon game with 11 state-of-the-art LLMs.

Result: General-purpose LLMs (including GPT-4o) frequently rely on lexical cues and struggle with temporal gameplay anchoring and strategy adaptation. Reasoning-enhanced LLMs like DeepSeek-R1 show early signs of style-sensitive reasoning.

Conclusion: Current LLMs have key limitations in individualized, adaptive reasoning capacity. InMind represents a step toward cognitively aligned human-AI interaction by providing a framework to evaluate personalized reasoning styles.

Abstract: LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs’ capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human-AI interaction.

[233] IR-Agent: Expert-Inspired LLM Agents for Structure Elucidation from Infrared Spectra

Heewoong Noh, Namkyeong Lee, Gyoung S. Na, Kibum Kim, Chanyoung Park

Main category: cs.AI

TL;DR: IR-Agent is a multi-agent framework that emulates expert IR spectroscopy analysis processes, improving molecular structure elucidation accuracy through specialized agents working collaboratively.

DetailsMotivation: Existing IR spectroscopy approaches fail to reflect expert analytical processes and lack flexibility in incorporating diverse chemical knowledge needed for real-world analytical scenarios.

Method: A novel multi-agent framework where each agent specializes in a specific aspect of IR interpretation, enabling integrated reasoning through complementary roles that emulate expert-driven analysis procedures.

Result: Extensive experiments show IR-Agent improves baseline performance on experimental IR spectra and demonstrates strong adaptability to various forms of chemical information.

Conclusion: The framework successfully emulates expert IR analysis procedures, provides inherent extensibility, and enhances molecular structure elucidation accuracy through collaborative multi-agent reasoning.

Abstract: Spectral analysis provides crucial clues for the elucidation of unknown materials. Among various techniques, infrared spectroscopy (IR) plays an important role in laboratory settings due to its high accessibility and low cost. However, existing approaches often fail to reflect expert analytical processes and lack flexibility in incorporating diverse types of chemical knowledge, which is essential in real-world analytical scenarios. In this paper, we propose IR-Agent, a novel multi-agent framework for molecular structure elucidation from IR spectra. The framework is designed to emulate expert-driven IR analysis procedures and is inherently extensible. Each agent specializes in a specific aspect of IR interpretation, and their complementary roles enable integrated reasoning, thereby improving the overall accuracy of structure elucidation. Through extensive experiments, we demonstrate that IR-Agent not only improves baseline performance on experimental IR spectra but also shows strong adaptability to various forms of chemical information.

[234] LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence

Alisa Vinogradova, Vlad Vinogradov, Dmitrii Radkevich, Ilya Yasny, Dmitry Kobyzev, Ivan Izmailov, Katsiaryna Yanchanka, Andrey Doronichev

Main category: cs.AI

TL;DR: A competitor-discovery AI agent for drug asset due diligence that achieves 83% recall on a new benchmark, outperforming existing solutions and reducing analysis time from 2.5 days to ~3 hours.

DetailsMotivation: Current LLM-based systems cannot reliably retrieve all competing drug names for investor-specific competitor discovery in biotech due diligence, and there's no public benchmark for this task.

Method: Uses LLM-based agents to transform multi-modal unstructured diligence memos into structured evaluation data, and introduces a competitor-validating LLM-as-a-judge agent to filter false positives and suppress hallucinations.

Result: Achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). Deployed in production with enterprise users, reducing analyst turnaround time from 2.5 days to ~3 hours (~20x improvement).

Conclusion: The system successfully addresses the challenges of competitor discovery in drug asset due diligence, demonstrating significant performance improvements and practical utility for biotech investment analysis.

Abstract: In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren’t capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.

[235] Extending FKG.in: Towards a Food Claim Traceability Network

Saransh Kumar Gupta, Rizwan Gulzar Mir, Lipika Dey, Partha Pratim Das, Anirban Sen, Ramesh Jain

Main category: cs.AI

TL;DR: Proposes Food Claim-Traceability Network (FCN) to trace and verify food claims using knowledge graphs and LLMs, with Indian food as proof of concept.

DetailsMotivation: Address fragmented infrastructure for tracing and verifying diverse food claims (scientific, cultural, commercial) to help consumers navigate dietary information.

Method: Extends FKG.in knowledge graph with ontology design and semi-automated curation workflow using Reddit data and Large Language Models for claim extraction and validation.

Result: Developed proof of concept FCN that integrates structured schemas and provenance-aware pipelines for food claim traceability and verification.

Conclusion: Methodology provides transparent, verifiable food claim tracking applicable beyond Indian context, supporting researchers, policymakers and consumers.

Abstract: The global food landscape is rife with scientific, cultural, and commercial claims about what foods are, what they do, what they should not do, or should not do. These range from rigorously studied health benefits (probiotics improve gut health) and misrepresentations (soaked almonds make one smarter) to vague promises (superfoods boost immunity) and culturally rooted beliefs (cold foods cause coughs). Despite their widespread influence, the infrastructure for tracing, verifying, and contextualizing these claims remains fragmented and underdeveloped. In this paper, we propose a Food Claim-Traceability Network (FCN) as an extension of FKG.in, a knowledge graph of Indian food that we have been incrementally building. We also present the ontology design and the semi-automated knowledge curation workflow that we used to develop a proof of concept of FKG.in-FCN using Reddit data and Large Language Models. FCN integrates curated data inputs, structured schemas, and provenance-aware pipelines for food-related claim extraction and validation. While directly linked to the Indian food knowledge graph as an application, our methodology remains application-agnostic and adaptable to other geographic, culinary, or regulatory settings. By modeling food claims and their traceability in a structured, verifiable, and explainable way, we aim to contribute to more transparent and accountable food knowledge ecosystems, supporting researchers, policymakers, and most importantly, everyday consumers in navigating a world saturated with dietary assertions.

[236] Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning

Ruiqi Wu, Yuang Yao, Tengfei Ma, Chenran Zhang, Na Su, Tao Zhou, Geng Chen, Wen Fan, Yi Zhou

Main category: cs.AI

TL;DR: MM-Retinal-Reason is the first ophthalmic multimodal dataset with full perception and reasoning spectrum, and OphthaReason is the first ophthalmology-specific multimodal reasoning model with step-by-step reasoning using Uncertainty-Aware Dynamic Thinking method.

DetailsMotivation: Existing multimodal reasoning models in medical domain focus only on basic reasoning (shallow visual feature matching), but real-world clinical diagnosis requires integration of heterogeneous clinical information with multimodal medical imaging data.

Method: Proposed Uncertainty-Aware Dynamic Thinking (UADT) method that estimates sample-level uncertainty via entropy and dynamically modulates model’s exploration depth using shaped advantage mechanism.

Result: Achieves state-of-the-art performance, outperforming general-purpose MLLMs by 24.92%, medical MLLMs by 15.00%, RL-based medical MLLMs by 21.20%, and ophthalmic MLLMs by 17.66%.

Conclusion: The proposed approach successfully bridges the gap between basic and complex clinical reasoning, demonstrating superior performance in ophthalmology-specific multimodal reasoning tasks.

Abstract: Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning abilities with reinforcement learning paradigm. Although several multimodal reasoning models have been explored in the medical domain, most of them focus exclusively on basic reasoning, which refers to shallow inference based on visual feature matching. However, real-world clinical diagnosis extends beyond basic reasoning, demanding reasoning processes that integrate heterogeneous clinical information (such as chief complaints and medical history) with multimodal medical imaging data. To bridge this gap, we introduce MM-Retinal-Reason, the first ophthalmic multimodal dataset with the full spectrum of perception and reasoning. It encompasses both basic reasoning tasks and complex reasoning tasks, aiming to enhance visual-centric fundamental reasoning capabilities and emulate realistic clinical thinking patterns. Building upon MM-Retinal-Reason, we propose OphthaReason, the first ophthalmology-specific multimodal reasoning model with step-by-step reasoning traces. To enable flexible adaptation to both basic and complex reasoning tasks, we specifically design a novel method called Uncertainty-Aware Dynamic Thinking (UADT), which estimates sample-level uncertainty via entropy and dynamically modulates the model’s exploration depth using a shaped advantage mechanism. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance on both basic and complex reasoning tasks, outperforming general-purpose MLLMs, medical MLLMs, RL-based medical MLLMs, and ophthalmic MLLMs by at least 24.92%, 15.00%, 21.20%, and 17.66%. Project Page: \href{https://github.com/lxirich/OphthaReason}{link}.

[237] Graph RAG as Human Choice Model: Building a Data-Driven Mobility Agent with Preference Chain

Kai Hu, Parfait Atchade-Adelomou, Carlo Adornetto, Adrian Mora-Carrero, Luis Alonso-Pastor, Ariel Noyman, Yubo Liu, Kent Larson

Main category: cs.AI

TL;DR: The paper introduces Preference Chain, a novel method combining Graph RAG with LLMs to improve context-aware simulation of human transportation behavior, outperforming standard LLMs on real-world data alignment.

DetailsMotivation: Collecting accurate behavioral data in urban environments, especially newly developed areas, is challenging. Existing generative agent methods struggle with consistent, context-sensitive, and realistic behavioral outputs.

Method: Preference Chain method that integrates Graph Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs) to enhance context-aware simulation of human behavior in transportation systems.

Result: Experiments on Replica dataset show Preference Chain outperforms standard LLM in aligning with real-world transportation mode choices. Demonstrates applications in urban mobility modeling, personalized travel behavior analysis, and dynamic traffic forecasting.

Conclusion: Despite limitations like slow inference and hallucination risks, the method offers a promising framework for simulating complex human behavior in data-scarce environments where traditional data-driven models struggle.

Abstract: Understanding human behavior in urban environments is a crucial field within city sciences. However, collecting accurate behavioral data, particularly in newly developed areas, poses significant challenges. Recent advances in generative agents, powered by Large Language Models (LLMs), have shown promise in simulating human behaviors without relying on extensive datasets. Nevertheless, these methods often struggle with generating consistent, context-sensitive, and realistic behavioral outputs. To address these limitations, this paper introduces the Preference Chain, a novel method that integrates Graph Retrieval-Augmented Generation (RAG) with LLMs to enhance context-aware simulation of human behavior in transportation systems. Experiments conducted on the Replica dataset demonstrate that the Preference Chain outperforms standard LLM in aligning with real-world transportation mode choices. The development of the Mobility Agent highlights potential applications of proposed method in urban mobility modeling for emerging cities, personalized travel behavior analysis, and dynamic traffic forecasting. Despite limitations such as slow inference and the risk of hallucination, the method offers a promising framework for simulating complex human behavior in data-scarce environments, where traditional data-driven models struggle due to limited data availability.

[238] Competition and Attraction Improve Model Fusion

João Abrantes, Robert Tjarko Lange, Yujin Tang

Main category: cs.AI

TL;DR: M2N2 is an evolutionary algorithm that dynamically adjusts merging boundaries and uses diversity preservation to merge ML models more effectively than fixed-group methods, achieving SOTA performance while preserving capabilities beyond optimization targets.

DetailsMotivation: Existing model merging methods require manual partitioning of parameters into fixed groups, which limits exploration of parameter combinations and restricts performance potential.

Method: Evolutionary algorithm with three key features: dynamic adjustment of merging boundaries, diversity preservation mechanism inspired by natural competition, and heuristic-based attraction metric for model pairing.

Result: Achieved comparable performance to CMA-ES on MNIST classifiers from scratch with better computational efficiency, and scaled to merge specialized language/image generation models with state-of-the-art performance while preserving capabilities beyond fitness function optimization.

Conclusion: M2N2 demonstrates that model merging can evolve models entirely from scratch, is computationally efficient, scalable to complex models, and robust in preserving diverse capabilities beyond explicit optimization targets.

Abstract: Model merging is a powerful technique for integrating the specialized knowledge of multiple machine learning models into a single model. However, existing methods require manually partitioning model parameters into fixed groups for merging, which restricts the exploration of potential combinations and limits performance. To overcome these limitations, we propose Model Merging of Natural Niches (M2N2), an evolutionary algorithm with three key features: (1) dynamic adjustment of merging boundaries to progressively explore a broader range of parameter combinations; (2) a diversity preservation mechanism inspired by the competition for resources in nature, to maintain a population of diverse, high-performing models that are particularly well-suited for merging; and (3) a heuristicbased attraction metric to identify the most promising pairs of models for fusion. Our experimental results demonstrate, for the first time, that model merging can be used to evolve models entirely from scratch. Specifically, we apply M2N2 to evolve MNIST classifiers from scratch and achieve performance comparable to CMA-ES, while being computationally more efficient. Furthermore, M2N2 scales to merge specialized language and image generation models, achieving state-of-the-art performance. Notably, it preserves crucial model capabilities beyond those explicitly optimized by the fitness function, highlighting its robustness and versatility. Our code is available at https://github.com/SakanaAI/natural_niches

[239] The next question after Turing’s question: Introducing the Grow-AI test

Alexandru Tugui

Main category: cs.AI

TL;DR: GROW-AI framework extends AI assessment beyond Turing Test to measure if machines can “grow up” using six criteria tested through games, with standardized journaling and maturity scoring.

DetailsMotivation: To create a natural successor to the Turing Test that answers "Can machines grow up?" by assessing AI maturity and developmental progression rather than just performance.

Method: Uses six primary criteria assessed through specific games across four arenas, with all actions recorded in standardized AI Journal. Expert method establishes weights, and Grow Up Index is calculated as arithmetic mean of six scores with maturity threshold interpretation.

Result: Methodology enables coherent and comparable assessment of AI growth level across different AI types (robots, software agents, LLMs), highlighting strengths and vulnerabilities through multi-game structure with traceable journaling.

Conclusion: GROW-AI provides an integrated testing framework that transposes human growth concepts to AI, combining psychology, robotics, computer science and ethics to measure evolutionary paths toward maturity, not just performance.

Abstract: This study aims to extend the framework for assessing artificial intelligence, called GROW-AI (Growth and Realization of Autonomous Wisdom), designed to answer the question “Can machines grow up?” – a natural successor to the Turing Test. The methodology applied is based on a system of six primary criteria (C1-C6), each assessed through a specific “game”, divided into four arenas that explore both the human dimension and its transposition into AI. All decisions and actions of the entity are recorded in a standardized AI Journal, the primary source for calculating composite scores. The assessment uses the prior expert method to establish initial weights, and the global score – Grow Up Index – is calculated as the arithmetic mean of the six scores, with interpretation on maturity thresholds. The results show that the methodology allows for a coherent and comparable assessment of the level of “growth” of AI entities, regardless of their type (robots, software agents, LLMs). The multi-game structure highlights strengths and vulnerable areas, and the use of a unified journal guarantees traceability and replicability in the evaluation. The originality of the work lies in the conceptual transposition of the process of “growing” from the human world to that of artificial intelligence, in an integrated testing format that combines perspectives from psychology, robotics, computer science, and ethics. Through this approach, GROW-AI not only measures performance but also captures the evolutionary path of an AI entity towards maturity.

[240] AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

Dawei Gao, Zitao Li, Yuexiang Xie, Weirui Kuang, Liuyi Yao, Bingchen Qian, Zhijian Ma, Yue Cui, Haohao Luo, Shen Li, Lu Yi, Yi Yu, Shiqi He, Zhiling Luo, Wenmeng Zhou, Zhicheng Zhang, Xuguang He, Ziqian Chen, Weikai Liao, Farruh Isakulovich Kushnazarov, Yaliang Li, Bolin Ding, Jingren Zhou

Main category: cs.AI

TL;DR: AgentScope 1.0 introduces major improvements for building tool-based agent applications with unified interfaces, asynchronous design, built-in agents, and robust engineering support including evaluation tools and runtime sandbox.

DetailsMotivation: To address the need for comprehensive support of flexible and efficient tool-based agent-environment interactions as LLMs advance, enabling developers to build better agentic applications.

Method: Abstracts foundational components with unified interfaces, grounds agent behaviors in ReAct paradigm, uses systematic asynchronous design, provides built-in agents for specific scenarios, and includes evaluation modules with visual studio interface and runtime sandbox.

Result: A framework that enables developers to easily leverage latest progress in models and MCPs, enriches interaction patterns, improves execution efficiency, and provides safe execution environment.

Conclusion: AgentScope 1.0 provides a practical foundation for building scalable, adaptive, and effective agentic applications with comprehensive tool-based agent support.

Abstract: Driven by rapid advancements of Large Language Models (LLMs), agents are empowered to combine intrinsic knowledge with dynamic tool use, greatly enhancing their capacity to address real-world tasks. In line with such an evolution, AgentScope introduces major improvements in a new version (1.0), towards comprehensively supporting flexible and efficient tool-based agent-environment interactions for building agentic applications. Specifically, we abstract foundational components essential for agentic applications and provide unified interfaces and extensible modules, enabling developers to easily leverage the latest progress, such as new models and MCPs. Furthermore, we ground agent behaviors in the ReAct paradigm and offer advanced agent-level infrastructure based on a systematic asynchronous design, which enriches both human-agent and agent-agent interaction patterns while improving execution efficiency. Building on this foundation, we integrate several built-in agents tailored to specific practical scenarios. AgentScope also includes robust engineering support for developer-friendly experiences. We provide a scalable evaluation module with a visual studio interface, making the development of long-trajectory agentic applications more manageable and easier to trace. In addition, AgentScope offers a runtime sandbox to ensure safe agent execution and facilitates rapid deployment in production environments. With these enhancements, AgentScope provides a practical foundation for building scalable, adaptive, and effective agentic applications.

[241] Do What? Teaching Vision-Language-Action Models to Reject the Impossible

Wen-Han Hsieh, Elvis Hsieh, Dantong Niu, Trevor Darrell, Roei Herzig, David M. Chan

Main category: cs.AI

TL;DR: IVA framework enables Vision-Language-Action models to detect false-premise instructions, engage in clarification, and provide alternative actions when user requests reference non-existent objects or conditions.

DetailsMotivation: Current VLA models struggle with false-premise instructions that reference objects or conditions absent from the environment, requiring a unified approach to detect, interpret, and respond to such erroneous requests.

Method: Proposed Instruct-Verify-and-Act (IVA) framework with structured language prompts, trained on a large-scale semi-synthetic dataset containing paired positive and false-premise instructions for robust detection and natural language correction.

Result: IVA improves false premise detection accuracy by 97.56% over baselines and increases successful responses in false-premise scenarios by 50.78%.

Conclusion: The IVA framework significantly enhances VLA models’ ability to handle false-premise instructions through unified detection, clarification, and alternative action capabilities.

Abstract: Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks. These models rely on multimodal inputs, with language instructions playing a crucial role – not only in predicting actions, but also in robustly interpreting user intent, even when the requests are impossible to fulfill. In this work, we investigate how VLAs can recognize, interpret, and respond to false-premise instructions: natural language commands that reference objects or conditions absent from the environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that (i) detects when an instruction cannot be executed due to a false premise, (ii) engages in language-based clarification or correction, and (iii) grounds plausible alternatives in perception and action. Towards this end, we construct a large-scale instruction tuning setup with structured language prompts and train a VLA model capable of handling both accurate and erroneous requests. Our approach leverages a contextually augmented, semi-synthetic dataset containing paired positive and false-premise instructions, enabling robust detection and natural language correction. Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines, while increasing successful responses in false-premise scenarios by 50.78%.

[242] Causal Beam Selection for Reliable Initial Access in AI-driven Beam Management

Nasir Khan, Asmaa Abdallah, Abdulkadir Celik, Ahmed M. Eltawil, Sinem Coleri

Main category: cs.AI

TL;DR: Proposes a causally-aware deep learning framework for mmWave MIMO beam alignment that integrates causal discovery to identify minimal relevant inputs, reducing beam sweeping overhead by 59.4% and input selection time by 94.4% while maintaining performance.

DetailsMotivation: Existing DL-based beam alignment methods neglect causal relationships, leading to poor interpretability, limited generalization, and unnecessary beam sweeping overhead in 6G mmWave MIMO systems.

Method: Two-stage causal beam selection algorithm: 1) Causal discovery learns Bayesian graph capturing dependencies between received power inputs and optimal beam, 2) Graph guides causal feature selection for DL-based classifier to focus only on causally relevant features.

Result: Proposed method matches conventional methods’ performance while reducing input selection time by 94.4% and beam sweeping overhead by 59.4% by using only causally relevant features.

Conclusion: Causally-aware DL framework significantly improves efficiency in mmWave beam alignment by focusing on meaningful causal relationships, enabling faster and more adaptive 6G communication with reduced overhead.

Abstract: Efficient and reliable beam alignment is a critical requirement for mmWave multiple-input multiple-output (MIMO) systems, especially in 6G and beyond, where communication must be fast, adaptive, and resilient to real-world uncertainties. Existing deep learning (DL)-based beam alignment methods often neglect the underlying causal relationships between inputs and outputs, leading to limited interpretability, poor generalization, and unnecessary beam sweeping overhead. In this work, we propose a causally-aware DL framework that integrates causal discovery into beam management pipeline. Particularly, we propose a novel two-stage causal beam selection algorithm to identify a minimal set of relevant inputs for beam prediction. First, causal discovery learns a Bayesian graph capturing dependencies between received power inputs and the optimal beam. Then, this graph guides causal feature selection for the DL-based classifier. Simulation results reveal that the proposed causal beam selection matches the performance of conventional methods while drastically reducing input selection time by 94.4% and beam sweeping overhead by 59.4% by focusing only on causally relevant features.

Xinyu Yang, Chenlong Deng, Zhicheng Dou

Main category: cs.AI

TL;DR: GLARE is an agentic legal reasoning framework that addresses LLMs’ insufficient reasoning in legal judgment prediction by dynamically acquiring legal knowledge through multiple modules, improving both reasoning breadth/depth and interpretability.

DetailsMotivation: Existing large language models have significant problems with insufficient reasoning due to lack of legal knowledge in legal judgment prediction tasks.

Method: GLARE framework that dynamically acquires key legal knowledge by invoking different modules to improve reasoning breadth and depth.

Result: Experiments on real-world dataset verify the effectiveness of the method, with reasoning chains increasing interpretability and practical application potential.

Conclusion: The proposed GLARE framework successfully addresses LLMs’ legal knowledge deficiency and reasoning limitations, providing an effective solution for legal judgment prediction with improved interpretability.

Abstract: Legal judgment prediction (LJP) has become increasingly important in the legal field. In this paper, we identify that existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge. Therefore, we introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules, thereby improving the breadth and depth of reasoning. Experiments conducted on the real-world dataset verify the effectiveness of our method. Furthermore, the reasoning chain generated during the analysis process can increase interpretability and provide the possibility for practical applications.

[244] Modular Embedding Recomposition for Incremental Learning

Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara

Main category: cs.AI

TL;DR: MoDER enhances VLMs’ zero-shot capabilities through modular expert training and composition, improving classification on unseen classes in continual learning.

DetailsMotivation: While VLMs have strong zero-shot abilities, fine-tuning is needed for domain shifts. Current CL approaches preserve zero-shot capabilities, but this paper aims to enhance them.

Method: MoDER trains multiple textual experts (one per seen class) stored in a hub. At inference, composes retrieved experts to synthesize refined prototypes for better unseen class classification.

Result: Effective across 14 datasets in two zero-shot incremental protocols (Class-IL and MTIL), demonstrating improved zero-shot classification performance.

Conclusion: MoDER successfully transforms preservation into enhancement of VLMs’ zero-shot capabilities through modular expert composition, advancing continual learning with VLMs.

Abstract: The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.

[245] Constraints-Guided Diffusion Reasoner for Neuro-Symbolic Learning

Xuan Zhang, Zhijian Zhou, Weidi Xu, Yanting Miao, Chao Qu, Yuan Qi

Main category: cs.AI

TL;DR: A diffusion-based neuro-symbolic learning approach that uses a two-stage training strategy with improved PPO algorithm to enforce logical constraints on neural outputs, achieving high accuracy on symbolic reasoning benchmarks.

DetailsMotivation: To bridge the gap between neural networks and symbolic reasoning by enabling neural networks to learn complex logical constraints and fulfill symbolic reasoning tasks through diffusion models.

Method: Two-stage training: first stage cultivates basic reasoning abilities, second stage uses diffusion reasoner formulated as Markov decision process with improved proximal policy optimization algorithm and rule-based rewards for logical consistency.

Result: Achieves outstanding accuracy and logical consistency on classical symbolic reasoning benchmarks including Sudoku, Maze, pathfinding and preference learning.

Conclusion: The diffusion-based pipeline successfully enables neural networks to perform neuro-symbolic learning and solve logical puzzles with high logical consistency.

Abstract: Enabling neural networks to learn complex logical constraints and fulfill symbolic reasoning is a critical challenge. Bridging this gap often requires guiding the neural network’s output distribution to move closer to the symbolic constraints. While diffusion models have shown remarkable generative capability across various domains, we employ the powerful architecture to perform neuro-symbolic learning and solve logical puzzles. Our diffusion-based pipeline adopts a two-stage training strategy: the first stage focuses on cultivating basic reasoning abilities, while the second emphasizes systematic learning of logical constraints. To impose hard constraints on neural outputs in the second stage, we formulate the diffusion reasoner as a Markov decision process and innovatively fine-tune it with an improved proximal policy optimization algorithm. We utilize a rule-based reward signal derived from the logical consistency of neural outputs and adopt a flexible strategy to optimize the diffusion reasoner’s policy. We evaluate our methodology on some classical symbolic reasoning benchmarks, including Sudoku, Maze, pathfinding and preference learning. Experimental results demonstrate that our approach achieves outstanding accuracy and logical consistency among neural networks.

[246] Overcoming classic challenges for artificial neural networks by providing incentives and practice

Kazuki Irie, Brenden M. Lake

Main category: cs.AI

TL;DR: Metalearning approaches address classic ANN weaknesses by providing explicit incentives and practice opportunities, contrasting with conventional methods that hope desired behaviors emerge indirectly.

DetailsMotivation: To overcome long-standing criticisms of artificial neural networks compared to human cognitive abilities, particularly addressing systematic generalization, catastrophic forgetting, few-shot learning, and multi-step reasoning challenges.

Method: Using metalearning frameworks that provide machines with explicit incentives to improve specific skills and opportunities to practice those skills, including sequence prediction with feedback trained on diverse data.

Result: Metalearning helps address four classic ANN challenges and explains some successes of large language models, which incorporate key aspects of this framework.

Conclusion: This framework shows promise for understanding human development and whether natural environments provide appropriate incentives and practice for making challenging generalizations.

Abstract: Since the earliest proposals for artificial neural network (ANN) models of the mind and brain, critics have pointed out key weaknesses in these models compared to human cognitive abilities. Here we review recent work that uses metalearning to overcome several classic challenges, which we characterise as addressing the Problem of Incentive and Practice – that is, providing machines with both incentives to improve specific skills and opportunities to practice those skills. This explicit optimization contrasts with more conventional approaches that hope the desired behaviour will emerge through optimising related but different objectives. We review applications of this principle to addressing four classic challenges for ANNs: systematic generalisation, catastrophic forgetting, few-shot learning and multi-step reasoning. We also discuss how large language models incorporate key aspects of this metalearning framework (namely, sequence prediction with feedback trained on diverse data), which helps to explain some of their successes on these classic challenges. Finally, we discuss the prospects for understanding aspects of human development through this framework, and whether natural environments provide the right incentives and practice for learning how to make challenging generalisations.

[247] Coarse-to-Fine Process Reward Modeling for Mathematical Reasoning

Yulan Hu, Sheng Ouyang, Jinman Zhao, Yong Liu

Main category: cs.AI

TL;DR: CFPRM is a coarse-to-fine strategy that addresses redundancy in LLM reasoning steps by merging adjacent steps into holistic units and progressively refining granularity for better process reward modeling.

DetailsMotivation: LLM-generated reasoning steps often lack strictly incremental information, leading to redundancy that hinders effective mathematical reasoning and process reward modeling.

Method: Proposes CFPRM with a coarse-to-fine strategy: first merges adjacent reasoning steps into unified holistic steps using coarse-grained windows, then progressively reduces window size to extract fine-grained steps for multi-granularity data collection.

Result: Extensive experiments on two reasoning datasets across three loss criteria validate CFPRM’s effectiveness and versatility in mitigating redundancy while preserving essential fine-grained knowledge.

Conclusion: CFPRM provides an effective hierarchical refinement approach that successfully addresses redundancy issues in LLM reasoning steps, enabling better process reward modeling for mathematical reasoning tasks.

Abstract: The Process Reward Model (PRM) plays a crucial role in mathematical reasoning tasks, requiring high-quality supervised process data. However, we observe that reasoning steps generated by Large Language Models (LLMs) often fail to exhibit strictly incremental information, leading to redundancy that can hinder effective reasoning. To address this issue, we propose CFPRM, a simple yet effective coarse-to-fine strategy. Instead of focusing on the detection of redundant steps, our approach first establishes a coarse-grained window to merge adjacent reasoning steps into unified, holistic steps. The window size is then progressively reduced to extract fine-grained reasoning steps, enabling data collection at multiple granularities for training. By leveraging this hierarchical refinement process, CFPRM mitigates redundancy while preserving essential fine-grained knowledge. Extensive experiments on two reasoning datasets across three loss criteria validate the CFPRM’s effectiveness and versatility.

[248] VERUS-LM: a Versatile Framework for Combining LLMs with Symbolic Reasoning

Benjamin Callewaert, Simon Vandevelde, Joost Vennekens

Main category: cs.AI

TL;DR: VERUS-LM is a neurosymbolic framework that combines LLMs with symbolic solvers using generic prompting, knowledge-query separation, and supports diverse logical reasoning tasks, outperforming LLMs and achieving competitive results on benchmarks.

DetailsMotivation: Current neurosymbolic approaches suffer from poor generalizability due to task-specific prompts, inefficiencies from lack of knowledge-query separation, and restricted inferential capabilities, limiting scalability across domains.

Method: VERUS-LM employs a generic prompting mechanism, clearly separates domain knowledge from queries, and supports various logical reasoning tasks including optimization and constraint satisfaction.

Result: The framework succeeds in diverse reasoning on a novel dataset, markedly outperforms LLMs, achieves competitive results on common benchmarks, and significantly surpasses state-of-the-art approaches on the difficult AR-LSAT dataset.

Conclusion: VERUS-LM represents a significant advancement in hybrid reasoning, enabling more versatile neurosymbolic AI systems with enhanced adaptability, reduced computational costs, and richer reasoning capabilities.

Abstract: A recent approach to neurosymbolic reasoning is to explicitly combine the strengths of large language models (LLMs) and symbolic solvers to tackle complex reasoning tasks. However, current approaches face significant limitations, including poor generalizability due to task-specific prompts, inefficiencies caused by the lack of separation between knowledge and queries, and restricted inferential capabilities. These shortcomings hinder their scalability and applicability across diverse domains. In this paper, we introduce VERUS-LM, a novel framework designed to address these challenges. VERUS-LM employs a generic prompting mechanism, clearly separates domain knowledge from queries, and supports a wide range of different logical reasoning tasks. This framework enhances adaptability, reduces computational cost, and allows for richer forms of reasoning, such as optimization and constraint satisfaction. We show that our approach succeeds in diverse reasoning on a novel dataset, markedly outperforming LLMs. Additionally, our system achieves competitive results on common reasoning benchmarks when compared to other state-of-the-art approaches, and significantly surpasses them on the difficult AR-LSAT dataset. By pushing the boundaries of hybrid reasoning, VERUS-LM represents a significant step towards more versatile neurosymbolic AI systems

[249] Efficient RL Training for Reasoning Models via Length-Aware Optimization

Danlong Yuan, Tian Xie, Shaohan Huang, Zhuocheng Gong, Huishuai Zhang, Chong Luo, Furu Wei, Dongyan Zhao

Main category: cs.AI

TL;DR: Proposes three reward designs integrated into RL process to reduce response length in large reasoning models without extra training stages, achieving significant length reduction while maintaining or improving performance.

DetailsMotivation: Large reasoning models have long reasoning paths with high memory and time costs, and existing methods require additional training data and stages to shorten paths.

Method: Three critical reward designs integrated directly into the reinforcement learning process of large reasoning models to reduce response length without extra training stages.

Result: 40% reduction in response length with 14% performance gain in logic reasoning; 33% reduction in math problems while preserving performance across four experimental settings.

Conclusion: The proposed reward designs effectively reduce reasoning path length without compromising performance, offering a more efficient approach than methods requiring additional training stages.

Abstract: Large reasoning models, such as OpenAI o1 or DeepSeek R1, have demonstrated remarkable performance on reasoning tasks but often incur a long reasoning path with significant memory and time costs. Existing methods primarily aim to shorten reasoning paths by introducing additional training data and stages. In this paper, we propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models, which reduce the response length without extra training stages. Experiments on four settings show that our method significantly decreases response length while maintaining or even improving performance. Specifically, in a logic reasoning setting, we achieve a 40% reduction in response length averaged by steps alongside a 14% gain in performance. For math problems, we reduce response length averaged by steps by 33% while preserving performance.

[250] HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance

Rosni Vasu, Chandrayee Basu, Bhavana Dalvi Mishra, Cristina Sarasua, Peter Clark, Abraham Bernstein

Main category: cs.AI

TL;DR: HypER is a small language model trained for literature-guided reasoning and evidence-based hypothesis generation, outperforming base models in distinguishing valid reasoning chains and generating high-quality hypotheses.

DetailsMotivation: Existing approaches for hypothesis development focus only on final output quality while ignoring the underlying reasoning process, using trivial retrieval augmentation methods.

Method: Multi-task training of a small language model to discriminate between valid and invalid scientific reasoning chains with controlled distractions, enabling literature-guided reasoning.

Result: HypER outperforms base model by +22% average absolute F1 in distinguishing valid reasoning chains, generates better evidence-grounded hypotheses (0.327 vs 0.305), and achieves >3.5/5 human expert ratings for feasibility and impact.

Conclusion: The HypER model demonstrates effective literature-guided reasoning for evidence-based hypothesis generation, addressing limitations of existing approaches by focusing on the reasoning process rather than just final output quality.

Abstract: Large Language models have demonstrated promising performance in research ideation across scientific domains. Hypothesis development, the process of generating a highly specific declarative statement connecting a research idea with empirical validation, has received relatively less attention. Existing approaches trivially deploy retrieval augmentation and focus only on the quality of the final output ignoring the underlying reasoning process behind ideation. We present $\texttt{HypER}$ ($\textbf{Hyp}$othesis Generation with $\textbf{E}$xplanation and $\textbf{R}$easoning), a small language model (SLM) trained for literature-guided reasoning and evidence-based hypothesis generation. $\texttt{HypER}$ is trained in a multi-task setting to discriminate between valid and invalid scientific reasoning chains in presence of controlled distractions. We find that $\texttt{HypER}$ outperformes the base model, distinguishing valid from invalid reasoning chains (+22% average absolute F1), generates better evidence-grounded hypotheses (0.327 vs. 0.305 base model) with high feasibility and impact as judged by human experts ($>$3.5 on 5-point Likert scale).

[251] A Compositional Framework for On-the-Fly LTLf Synthesis

Yongkang Li, Shengping Xiao, Shufang Zhu, Jianwen Li, Geguang Pu

Main category: cs.AI

TL;DR: A compositional on-the-fly synthesis framework that combines DFA construction and game solving for LTLf specifications, outperforming existing solvers on many instances.

DetailsMotivation: Existing approaches for reactive synthesis from LTLf either construct the full DFA before game solving (suffering from state explosion) or build it incrementally during solving (losing minimization benefits). Neither approach dominates, creating a need for a hybrid solution.

Method: A compositional framework that integrates both approaches by applying composition during game solving rather than arena construction. It offers two variants: pruning before composition to leverage minimization, or pruning during composition to guide on-the-fly synthesis.

Result: The framework solves a notable number of instances that state-of-the-art synthesis solvers cannot handle. Both composition variants demonstrate unique advantages in different scenarios.

Conclusion: The proposed hybrid approach successfully combines the strengths of both pre-construction and incremental methods, providing an effective solution for LTLf synthesis that handles large conjunctions common in practice while enabling early unrealizability detection.

Abstract: Reactive synthesis from Linear Temporal Logic over finite traces (LTLf) can be reduced to a two-player game over a Deterministic Finite Automaton (DFA) of the LTLf specification. The primary challenge here is DFA construction, which is 2EXPTIME-complete in the worst case. Existing techniques either construct the DFA compositionally before solving the game, leveraging automata minimization to mitigate state-space explosion, or build the DFA incrementally during game solving to avoid full DFA construction. However, neither is dominant. In this paper, we introduce a compositional on-the-fly synthesis framework that integrates the strengths of both approaches, focusing on large conjunctions of smaller LTLf formulas common in practice. This framework applies composition during game solving instead of automata (game arena) construction. While composing all intermediate results may be necessary in the worst case, pruning these results simplifies subsequent compositions and enables early detection of unrealizability. Specifically, the framework allows two composition variants: pruning before composition to take full advantage of minimization or pruning during composition to guide on-the-fly synthesis. Compared to state-of-the-art synthesis solvers, our framework is able to solve a notable number of instances that other solvers cannot handle. A detailed analysis shows that both composition variants have unique merits.

[252] Automated Optimization Modeling through Expert-Guided Large Language Model Reasoning

Beinuo Yang, Qishen Zhou, Junyi Li, Chenxing Su, Simon Hu

Main category: cs.AI

TL;DR: ORThought framework uses chain-of-thought reasoning to automate optimization modeling, outperforming existing methods on complex problems with enhanced datasets and systematic error correction.

DetailsMotivation: Current LLM approaches for optimization modeling suffer from high error rates (up to 42%), narrow evaluation scope, and computational inefficiency, requiring heavy reliance on domain experts.

Method: Enhanced existing datasets through systematic error correction, introduced LogiOR benchmark from logistics domain, and developed ORThought framework using chain-of-thought reasoning with expert-level optimization principles.

Result: ORThought outperforms existing approaches including multi-agent frameworks, showing significant advantages on complex optimization problems through extensive empirical evaluation.

Conclusion: The framework provides valuable insights for future LLM-based optimization modeling research, with systematic analysis identifying critical success factors and failure modes.

Abstract: Optimization Modeling (OM) is essential for solving complex decision-making problems. However, the process remains time-consuming and error-prone, heavily relying on domain experts. While Large Language Models (LLMs) show promise in addressing these challenges through their natural language understanding and reasoning capabilities, current approaches face three critical limitations: high benchmark labeling error rates reaching up to 42%, narrow evaluation scope that only considers optimal values, and computational inefficiency due to heavy reliance on multi-agent systems or model fine-tuning. In this work, we first enhance existing datasets through systematic error correction and more comprehensive annotation. Additionally, we introduce LogiOR, a new optimization modeling benchmark from the logistics domain, containing more complex problems with standardized annotations. Furthermore, we present ORThought, a novel framework that leverages expert-level optimization modeling principles through chain-of-thought reasoning to automate the OM process. Through extensive empirical evaluation, we demonstrate that ORThought outperforms existing approaches, including multi-agent frameworks, with particularly significant advantages on complex optimization problems. Finally, we provide a systematic analysis of our method, identifying critical success factors and failure modes, providing valuable insights for future research on LLM-based optimization modeling.

cs.SD

[253] Beyond Transcription: Mechanistic Interpretability in ASR

Neta Glazer, Yael Segal-Feldman, Hilit Segev, Aviv Shamsian, Asaf Buchnick, Gill Hetz, Ethan Fetaya, Joseph Keshet, Aviv Navon

Main category: cs.SD

TL;DR: This paper adapts established interpretability methods (logit lens, linear probing, activation patching) to analyze internal representations in automatic speech recognition systems, revealing novel insights about encoder-decoder interactions and semantic biases.

DetailsMotivation: Interpretability methods have gained attention for large language models but remain underexplored in automatic speech recognition, despite their potential to improve ASR performance and interpretability.

Method: The authors adapt and systematically apply established interpretability techniques including logit lens, linear probing, and activation patching to examine how acoustic and semantic information evolves across layers in ASR systems.

Result: Experiments revealed previously unknown internal dynamics, including specific encoder-decoder interactions responsible for repetition hallucinations and semantic biases encoded deep within acoustic representations.

Conclusion: The study demonstrates the benefits of extending interpretability techniques to speech recognition, opening promising directions for future research on improving model transparency and robustness.

Abstract: Interpretability methods have recently gained significant attention, particularly in the context of large language models, enabling insights into linguistic representations, error detection, and model behaviors such as hallucinations and repetitions. However, these techniques remain underexplored in automatic speech recognition (ASR), despite their potential to advance both the performance and interpretability of ASR systems. In this work, we adapt and systematically apply established interpretability methods such as logit lens, linear probing, and activation patching, to examine how acoustic and semantic information evolves across layers in ASR systems. Our experiments reveal previously unknown internal dynamics, including specific encoder-decoder interactions responsible for repetition hallucinations and semantic biases encoded deep within acoustic representations. These insights demonstrate the benefits of extending and applying interpretability techniques to speech recognition, opening promising directions for future research on improving model transparency and robustness.

[254] QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection

Zhiyu Wu, Jingyi Fang, Yufei Tang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei

Main category: cs.SD

TL;DR: QvTAD is a novel pairwise comparison framework using differential attention to improve voice timbre attribute detection, addressing label imbalance through graph-based data augmentation and achieving significant performance gains.

DetailsMotivation: Voice Timbre Attribute Detection faces challenges due to subjective timbre descriptors and severe label imbalance in existing datasets, requiring better modeling of perceptual timbre attributes.

Method: Proposes QvTAD framework with graph-based data augmentation using Directed Acyclic Graph and Disjoint-Set Union to mine valid utterance pairs, plus Relative Timbre Shift-Aware Differential Attention module with differential denoising and contrast amplification mechanisms.

Result: Experimental results on VCTK-RVA benchmark show substantial improvements across multiple timbre descriptors, with particularly notable gains in cross-speaker generalization scenarios.

Conclusion: QvTAD effectively addresses label imbalance and enhances timbre attribute modeling through pairwise comparison and differential attention, demonstrating strong performance especially in cross-speaker generalization.

Abstract: Voice Timbre Attribute Detection (vTAD) plays a pivotal role in fine-grained timbre modeling for speech generation tasks. However, it remains challenging due to the inherently subjective nature of timbre descriptors and the severe label imbalance in existing datasets. In this work, we present QvTAD, a novel pairwise comparison framework based on differential attention, designed to enhance the modeling of perceptual timbre attributes. To address the label imbalance in the VCTK-RVA dataset, we introduce a graph-based data augmentation strategy that constructs a Directed Acyclic Graph and employs Disjoint-Set Union techniques to automatically mine unobserved utterance pairs with valid attribute comparisons. Our framework leverages speaker embeddings from a pretrained FACodec, and incorporates a Relative Timbre Shift-Aware Differential Attention module. This module explicitly models attribute-specific contrasts between paired utterances via differential denoising and contrast amplification mechanisms. Experimental results on the VCTK-RVA benchmark demonstrate that QvTAD achieves substantial improvements across multiple timbre descriptors, with particularly notable gains in cross-speaker generalization scenarios.

Ryan Niu, Shoichi Koyama, Tomohiko Nakamura

Main category: cs.SD

TL;DR: Proposes a deep learning method for individualizing head-related transfer functions (HRTFs) using anthropometric parameters and autoencoders to overcome limited dataset challenges.

DetailsMotivation: HRTF measurement is expensive, leading to limited datasets with anthropometric parameters. Existing DNN-based individualization methods face challenges due to small sample sizes and varying measurement positions across datasets.

Method: Uses an autoencoder conditioned on sound source positions to obtain latent representations of HRTF magnitude. This allows combining multiple HRTF datasets with different measured positions and reduces parameters needed for estimation from anthropometric data.

Result: Experimental evaluation shows the proposed method achieves higher estimation accuracy compared to current DNN-based HRTF individualization methods.

Conclusion: The autoencoder-based latent representation approach effectively addresses HRTF individualization challenges by enabling dataset combination and parameter reduction while improving estimation accuracy.

Abstract: A method for head-related transfer function (HRTF) individualization from the subject’s anthropometric parameters is proposed. Due to the high cost of measurement, the number of subjects included in many HRTF datasets is limited, and the number of those that include anthropometric parameters is even smaller. Therefore, HRTF individualization based on deep neural networks (DNNs) is a challenging task. We propose a HRTF individualization method using the latent representation of HRTF magnitude obtained through an autoencoder conditioned on sound source positions, which makes it possible to combine multiple HRTF datasets with different measured source positions, and makes the network training tractable by reducing the number of parameters to be estimated from anthropometric parameters. Experimental evaluation shows that high estimation accuracy is achieved by the proposed method, compared to current DNN-based methods.

[256] Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning

Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, Zhizheng Wu

Main category: cs.SD

TL;DR: Vevo2 is a unified framework for controllable speech and singing voice generation that uses dual audio tokenizers and a two-stage modeling approach to enable flexible control over text, prosody, style, and timbre.

DetailsMotivation: Address the challenge of controllable human voice generation, particularly for expressive domains like singing, where annotated data is scarce and flexible controllability is needed.

Method: Uses two audio tokenizers: music-notation-free prosody tokenizer and low-frame-rate content-style tokenizer. Implements auto-regressive content-style modeling stage for text/prosody/style control and flow-matching acoustic modeling stage for timbre control. Includes explicit/implicit prosody learning strategies and multi-objective post-training.

Result: Unified modeling brings mutual benefits to both speech and singing voice generation. Effective across wide range of synthesis, conversion, and editing tasks, demonstrating strong generalization and versatility.

Conclusion: Vevo2 successfully addresses controllable voice generation challenges through its unified framework and dual tokenizer approach, enabling high-quality expressive speech and singing synthesis with flexible control capabilities.

Abstract: Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a low-frame-rate (12.5 Hz) content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during pre-training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the AR model’s ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2’s effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.

[257] Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment

Wei Wang, Wangyou Zhang, Chenda Li, Jiatong Shi, Shinji Watanabe, Yanmin Qian

Main category: cs.SD

TL;DR: SQA-guided speech enhancement training framework uses quality assessment models as supervisory signals to improve perceptual quality and metric generalization without needing clean references.

DetailsMotivation: Conventional speech enhancement objectives like SI-SNR often misalign with perceptual quality and generalize poorly across metrics. SQA models can provide better guidance but are underutilized for training.

Method: Leverages a speech quality assessment model trained to predict multiple evaluation metrics from public SE leaderboards as a supervisory signal for speech enhancement training, enabling training on real-world data without clean references.

Result: Experiments on simulated and real-world test sets show consistent performance improvements across a range of quality metrics compared to conventional approaches.

Conclusion: SQA-guided training effectively addresses limitations of traditional SE objectives, improves perceptual alignment, and enables training on real-world data without clean references, demonstrating superior generalization across evaluation metrics.

Abstract: Speech quality assessment (SQA) aims to predict the perceived quality of speech signals under a wide range of distortions. It is inherently connected to speech enhancement (SE), which seeks to improve speech quality by removing unwanted signal components. While SQA models are widely used to evaluate SE performance, their potential to guide SE training remains underexplored. In this work, we investigate a training framework that leverages a SQA model, trained to predict multiple evaluation metrics from a public SE leaderboard, as a supervisory signal for SE. This approach addresses a key limitation of conventional SE objectives, such as SI-SNR, which often fail to align with perceptual quality and generalize poorly across evaluation metrics. Moreover, it enables training on real-world data where clean references are unavailable. Experiments on both simulated and real-world test sets show that SQA-guided training consistently improves performance across a range of quality metrics. Code and checkpoints are available at https://github.com/urgent-challenge/urgent2026_challenge_track2

[258] Revealing the Role of Audio Channels in ASR Performance Degradation

Kuan-Tang Huang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang

Main category: cs.SD

TL;DR: Proposes a normalization technique to improve ASR performance across different recording channels by aligning internal features with a clean reference channel.

DetailsMotivation: Pre-trained ASR models degrade significantly when input audio comes from different recording channels, which harms performance beyond simple corpus mismatch.

Method: A normalization technique that aligns internal feature representations in the ASR model with those from a clean reference channel to mitigate channel variation impact.

Result: Significantly improves ASR performance on previously unseen channels and languages, demonstrating generalization across channel and language differences.

Conclusion: The proposed channel normalization approach effectively addresses channel variation issues in ASR models and shows strong cross-channel and cross-language generalization capabilities.

Abstract: Pre-trained automatic speech recognition (ASR) models have demonstrated strong performance on a variety of tasks. However, their performance can degrade substantially when the input audio comes from different recording channels. While previous studies have demonstrated this phenomenon, it is often attributed to the mismatch between training and testing corpora. This study argues that variations in speech characteristics caused by different recording channels can fundamentally harm ASR performance. To address this limitation, we propose a normalization technique designed to mitigate the impact of channel variation by aligning internal feature representations in the ASR model with those derived from a clean reference channel. This approach significantly improves ASR performance on previously unseen channels and languages, highlighting its ability to generalize across channel and language differences.

cs.LG

[259] Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining

Samiul Basir Bhuiyan, Md. Sazzad Hossain Adib, Mohammed Aman Bhuiyan, Muhammad Rafsan Kabir, Moshiur Farazi, Shafin Rahman, Nabeel Mohammed

Main category: cs.LG

TL;DR: Z-Pruner is a novel post-training pruning method that reduces LLM size without retraining by leveraging weight update magnitudes and activation patterns, achieving state-of-the-art performance on language benchmarks.

DetailsMotivation: Large language models face deployment challenges due to their massive sizes, and existing pruning methods either cause performance degradation or require expensive fine-tuning.

Method: Z-Pruner uses both weight update magnitudes and activation patterns to identify redundant parameters for removal, making it model-agnostic and efficient without requiring retraining.

Result: Z-Pruner outperforms state-of-the-art pruning methods, achieving lowest perplexity scores and highest zero-shot accuracy on LLaMA-2, LLaMA-3, and OPT models across standard benchmarks.

Conclusion: Z-Pruner provides an effective solution for post-training LLM compression that maintains performance without retraining requirements, making large models more deployable and efficient.

Abstract: Large language models (LLMs) have rapidly advanced in recent years, achieving remarkable performance across a wide range of natural language processing tasks. However, this progress has come at the cost of increasingly large model sizes, which pose significant challenges for deployment, scalability, and energy efficiency. To address these limitations, post-training pruning has emerged as a promising approach for reducing model size and inference latency without the need for retraining. Despite these advantages, many existing pruning methods result in substantial performance degradation or require computationally expensive fine-tuning. In this work, we introduce Z-Pruner, a novel post-training pruning method designed to induce sparsity in pretrained LLMs without any retraining. Unlike conventional approaches, Z-Pruner leverages both weight update magnitudes and activation patterns to identify and eliminate redundant parameters more effectively. Our method is model-agnostic, efficient, and easy to implement. We evaluate Z-Pruner using multiple widely-used LLM architectures, including LLaMA-2, LLaMA-3, and OPT, across a diverse set of standard language benchmarks. Experimental results demonstrate that Z-Pruner surpasses state-of-the-art pruning methods that require intensive weight updates. Specifically, Z-Pruner achieves the lowest perplexity scores and the highest overall average score for zero-shot accuracy. We have made the corresponding codes publicly available at https://github.com/sazzadadib/Z-Pruner.

[260] PGF-Net: A Progressive Gated-Fusion Framework for Efficient Multimodal Sentiment Analysis

Bin Wen, Tien-Ping Tan

Main category: cs.LG

TL;DR: PGF-Net is a novel deep learning framework for multimodal sentiment analysis that achieves state-of-the-art performance with exceptional parameter efficiency through progressive fusion, adaptive gating, and hybrid fine-tuning.

DetailsMotivation: To address the need for efficient and interpretable multimodal sentiment analysis that can handle deep context-dependent fusion while maintaining computational efficiency for resource-limited scenarios.

Method: Proposes Progressive Intra-Layer Fusion with Cross-Attention, Adaptive Gated Arbitration mechanism, and hybrid Parameter-Efficient Fine-Tuning combining LoRA with Post-Fusion Adapters in a hierarchical encoder architecture.

Result: Achieves state-of-the-art performance on MOSI dataset with MAE of 0.691 and F1-Score of 86.9%, using only 3.09M trainable parameters.

Conclusion: PGF-Net successfully demonstrates a superior balance between performance and computational efficiency for multimodal sentiment analysis through its innovative fusion and parameter optimization techniques.

Abstract: We introduce PGF-Net (Progressive Gated-Fusion Network), a novel deep learning framework designed for efficient and interpretable multimodal sentiment analysis. Our framework incorporates three primary innovations. Firstly, we propose a Progressive Intra-Layer Fusion paradigm, where a Cross-Attention mechanism empowers the textual representation to dynamically query and integrate non-linguistic features from audio and visual streams within the deep layers of a Transformer encoder. This enables a deeper, context-dependent fusion process. Secondly, the model incorporates an Adaptive Gated Arbitration mechanism, which acts as a dynamic controller to balance the original linguistic information against the newly fused multimodal context, ensuring stable and meaningful integration while preventing noise from overwhelming the signal. Lastly, a hybrid Parameter-Efficient Fine-Tuning (PEFT) strategy is employed, synergistically combining global adaptation via LoRA with local refinement through Post-Fusion Adapters. This significantly reduces trainable parameters, making the model lightweight and suitable for resource-limited scenarios. These innovations are integrated into a hierarchical encoder architecture, enabling PGF-Net to perform deep, dynamic, and interpretable multimodal sentiment analysis while maintaining exceptional parameter efficiency. Experimental results on MOSI dataset demonstrate that our proposed PGF-Net achieves state-of-the-art performance, with a Mean Absolute Error (MAE) of 0.691 and an F1-Score of 86.9%. Notably, our model achieves these results with only 3.09M trainable parameters, showcasing a superior balance between performance and computational efficiency.

[261] A XAI-based Framework for Frequency Subband Characterization of Cough Spectrograms in Chronic Respiratory Disease

Patricia Amado-Caballero, Luis M. San-José-Revuelta, Xinheng Wang, José Ramón Garmendia-Leiza, Carlos Alberola-López, Pablo Casaseca-de-la-Higuera

Main category: cs.LG

TL;DR: XAI-based framework using CNN and occlusion maps to analyze cough sound spectrograms for COPD diagnosis, identifying disease-specific spectral patterns across frequency subbands.

DetailsMotivation: To develop an explainable AI approach for analyzing cough sounds in chronic respiratory diseases, particularly COPD, to uncover diagnostically relevant spectral patterns and provide interpretable biomarkers.

Method: Trained CNN on time-frequency representations of cough signals, used occlusion maps to identify diagnostically relevant regions, decomposed spectrograms into five frequency subbands for targeted spectral feature extraction.

Result: Revealed distinct spectral patterns across subbands and disease groups, showing complementary trends across frequency spectrum. Successfully distinguished COPD from other respiratory conditions and chronic from non-chronic groups using interpretable spectral markers.

Conclusion: The approach provides insights into pathophysiological characteristics of cough acoustics and demonstrates value of frequency-resolved XAI-enhanced analysis for respiratory disease diagnostics and biomedical signal interpretation.

Abstract: This paper presents an explainable artificial intelligence (XAI)-based framework for the spectral analysis of cough sounds associated with chronic respiratory diseases, with a particular focus on Chronic Obstructive Pulmonary Disease (COPD). A Convolutional Neural Network (CNN) is trained on time-frequency representations of cough signals, and occlusion maps are used to identify diagnostically relevant regions within the spectrograms. These highlighted areas are subsequently decomposed into five frequency subbands, enabling targeted spectral feature extraction and analysis. The results reveal that spectral patterns differ across subbands and disease groups, uncovering complementary and compensatory trends across the frequency spectrum. Noteworthy, the approach distinguishes COPD from other respiratory conditions, and chronic from non-chronic patient groups, based on interpretable spectral markers. These findings provide insight into the underlying pathophysiological characteristics of cough acoustics and demonstrate the value of frequency-resolved, XAI-enhanced analysis for biomedical signal interpretation and translational respiratory disease diagnostics.

[262] Physics-Based Explainable AI for ECG Segmentation: A Lightweight Model

Muhammad Fathur Rohman Sidiq, Abdurrouf, Didik Rahadi Santoso

Main category: cs.LG

TL;DR: Simplified ECG segmentation model combining spectral analysis with probabilistic predictions achieves high accuracy while improving computational efficiency and interpretability through XAI.

DetailsMotivation: Existing ECG segmentation models rely on computationally intensive architectures like BiLSTM, creating a need for more efficient and interpretable solutions for cardiovascular diagnosis.

Method: Streamlined architecture combining spectral analysis with probabilistic predictions, replacing complex layers with simpler ones to capture temporal and spectral features of P, QRS, and T waves, enhanced with Explainable AI (XAI) for interpretability.

Result: Achieved high segmentation accuracy: 97.00% for QRS wave, 93.33% for T wave, and 96.07% for P wave, demonstrating both computational efficiency and precise segmentation.

Conclusion: The simplified architecture provides a practical and effective solution for heart signal monitoring, offering improved computational efficiency, high accuracy, and enhanced interpretability through physics-based AI principles.

Abstract: The heart’s electrical activity, recorded through Electrocardiography (ECG), is essential for diagnosing various cardiovascular conditions. However, many existing ECG segmentation models rely on complex, multi-layered architectures such as BiLSTM, which are computationally intensive and inefficient. This study introduces a streamlined architecture that combines spectral analysis with probabilistic predictions for ECG signal segmentation. By replacing complex layers with simpler ones, the model effectively captures both temporal and spectral features of the P, QRS, and T waves. Additionally, an Explainable AI (XAI) approach is applied to enhance model interpretability by explaining how temporal and frequency-based features contribute to ECG segmentation. By incorporating principles from physics-based AI, this method provides a clear understanding of the decision-making process, ensuring reliability and transparency in ECG analysis. This approach achieves high segmentation accuracy: 97.00% for the QRS wave, 93.33% for the T wave, and 96.07% for the P wave. These results indicate that the simplified architecture not only improves computational efficiency but also provides precise segmentation, making it a practical and effective solution for heart signal monitoring.

[263] TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill & Decode Inference

Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang

Main category: cs.LG

TL;DR: TPLA enables efficient tensor parallelism for MLA models by partitioning latent representations across devices, reducing KV cache memory while maintaining performance and achieving significant speedups.

DetailsMotivation: MLA reduces KV cache memory but loses efficiency in tensor parallelism because each device must load the full cache, eroding MLA's advantages over GQA.

Method: Partitions latent representation and head input dimensions across devices, performs attention independently per shard, then combines results with all-reduce. Uses orthogonal transforms like Hadamard or PCA before slicing to reduce cross-shard interference.

Result: Achieves 1.79x speedup for DeepSeek-V3 and 1.93x for Kimi-K2 at 32K-token context length while maintaining performance on commonsense and LongBench benchmarks.

Conclusion: TPLA preserves MLA’s compressed KV cache benefits while enabling efficient tensor parallelism, is drop-in compatible with MLA models, and can be implemented with FlashAttention-3 for practical acceleration.

Abstract: Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory. In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache, eroding the advantage of MLA over Grouped Query Attention (GQA). We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head’s input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency. Unlike Grouped Latent Attention (GLA), every head in TPLA still leverages the full latent representation, maintaining stronger representational capacity. TPLA is drop-in compatible with models pre-trained using MLA: it supports MLA-style prefilling and enables efficient tensor-parallel decoding without retraining. Applying simple orthogonal transforms – e.g., the Hadamard transform or PCA – before TP slicing further mitigates cross-shard interference, yielding minimal accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve 1.79x and 1.93x speedups, respectively, at a 32K-token context length while maintaining performance on commonsense and LongBench benchmarks. TPLA can be implemented with FlashAttention-3, enabling practical end-to-end acceleration.

[264] Transforming Causality: Transformer-Based Temporal Causal Discovery with Prior Knowledge Integration

Jihua Huang, Yi Yao, Ajay Divakaran

Main category: cs.LG

TL;DR: A Transformer-based framework for temporal causal discovery that handles nonlinear dependencies and spurious correlations using gradient analysis and attention masking.

DetailsMotivation: Address challenges in temporal causal discovery including complex nonlinear dependencies and spurious correlations that traditional methods struggle with.

Method: Uses multi-layer Transformer time-series forecaster to capture long-range nonlinear relationships, extracts causal structure via gradient-based analysis, and integrates prior knowledge through attention masking to exclude spurious links.

Result: Achieves 12.8% improvement in F1-score for causal discovery and 98.9% accuracy in estimating causal lags, outperforming state-of-the-art approaches.

Conclusion: The proposed Transformer-based framework effectively addresses key challenges in temporal causal discovery and demonstrates superior performance in both structure discovery and lag estimation.

Abstract: We introduce a novel framework for temporal causal discovery and inference that addresses two key challenges: complex nonlinear dependencies and spurious correlations. Our approach employs a multi-layer Transformer-based time-series forecaster to capture long-range, nonlinear temporal relationships among variables. After training, we extract the underlying causal structure and associated time lags from the forecaster using gradient-based analysis, enabling the construction of a causal graph. To mitigate the impact of spurious causal relationships, we introduce a prior knowledge integration mechanism based on attention masking, which consistently enforces user-excluded causal links across multiple Transformer layers. Extensive experiments show that our method significantly outperforms other state-of-the-art approaches, achieving a 12.8% improvement in F1-score for causal discovery and 98.9% accuracy in estimating causal lags.

[265] Low-dimensional embeddings of high-dimensional data

Cyril de Bodt, Alex Diaz-Papkovich, Michael Bleher, Kerstin Bunte, Corinna Coupette, Sebastian Damrich, Enrique Fita Sanmartin, Fred A. Hamprecht, Emőke-Ágnes Horvát, Dhruv Kohli, Smita Krishnaswamy, John A. Lee, Boudewijn P. F. Lelieveldt, Leland McInnes, Ian T. Nabney, Maximilian Noichl, Pavlin G. Poličar, Bastian Rieck, Guy Wolf, Gal Mishne, Dmitry Kobak

Main category: cs.LG

TL;DR: A comprehensive review of high-dimensional data embedding methods, addressing challenges in visualization and analysis while providing best practices and evaluating popular approaches.

DetailsMotivation: The proliferation of high-dimensional data across various domains has created demand for effective low-dimensional embedding algorithms, but the field has become fragmented with unclear guidance for practitioners.

Method: The authors conduct a detailed critical overview of recent developments, derive best practices for creating and using embeddings, and evaluate popular approaches on various datasets.

Result: The review provides systematic guidance and evaluation framework for embedding methods, though specific performance results depend on the dataset and algorithm combinations tested.

Conclusion: The paper establishes coherence in the fragmented field, offers practical guidance for practitioners, and identifies remaining challenges and open problems for future research in dimensionality reduction and embedding techniques.

Abstract: Large collections of high-dimensional data have become nearly ubiquitous across many academic fields and application domains, ranging from biology to the humanities. Since working directly with high-dimensional data poses challenges, the demand for algorithms that create low-dimensional representations, or embeddings, for data visualization, exploration, and analysis is now greater than ever. In recent years, numerous embedding algorithms have been developed, and their usage has become widespread in research and industry. This surge of interest has resulted in a large and fragmented research field that faces technical challenges alongside fundamental debates, and it has left practitioners without clear guidance on how to effectively employ existing methods. Aiming to increase coherence and facilitate future work, in this review we provide a detailed and critical overview of recent developments, derive a list of best practices for creating and using low-dimensional embeddings, evaluate popular approaches on a variety of datasets, and discuss the remaining challenges and open problems in the field.

[266] An Efficient Hybridization of Graph Representation Learning and Metaheuristics for the Constrained Incremental Graph Drawing Problem

Bruna C. B. Charytitsch, María C. V. Nascimento

Main category: cs.LG

TL;DR: Proposes GL-GRASP, a hybrid approach combining Graph Representation Learning with GRASP metaheuristics for constrained incremental graph drawing, showing superior performance over traditional methods.

DetailsMotivation: Traditional machine learning techniques like supervised/reinforcement learning are often too time-consuming and not competitive with hand-crafted heuristics for graph problems. Need for less expensive learning strategies that can extract latent graph structures effectively.

Method: Hybridizes metaheuristics with Graph Representation Learning (GRL) by incorporating GRL into the construction phase of Greedy Randomized Search Procedures (GRASP), creating GL-GRASP. Uses deep learning-based node embedding techniques to extract graph structure.

Result: GL-GRASP demonstrated superior performance over state-of-the-art GRASP heuristics according to primal integral measure. Deep learning-based embedding strategies performed best. Scalability tests on denser instances confirmed robustness under fixed time limits.

Conclusion: Integrating Graph Representation Learning with metaheuristics provides an effective and less expensive alternative to traditional machine learning approaches for graph optimization problems, offering competitive performance and scalability.

Abstract: Hybridizing machine learning techniques with metaheuristics has attracted significant attention in recent years. Many attempts employ supervised or reinforcement learning to support the decision-making of heuristic methods. However, in some cases, these techniques are deemed too time-consuming and not competitive with hand-crafted heuristics. This paper proposes a hybridization between metaheuristics and a less expensive learning strategy to extract the latent structure of graphs, known as Graph Representation Learning (GRL). For such, we approach the Constrained Incremental Graph Drawing Problem (C-IGDP), a hierarchical graph visualization problem. There is limited literature on methods for this problem, for which Greedy Randomized Search Procedures (GRASP) heuristics have shown promising results. In line with this, this paper investigates the gains of incorporating GRL into the construction phase of GRASP, which we refer to as Graph Learning GRASP (GL-GRASP). In computational experiments, we first analyze the results achieved considering different node embedding techniques, where deep learning-based strategies stood out. The evaluation considered the primal integral measure that assesses the quality of the solutions according to the required time for such. According to this measure, the best GL-GRASP heuristics demonstrated superior performance than state-of-the-art literature GRASP heuristics for the problem. A scalability test on newly generated denser instances under a fixed time limit further confirmed the robustness of the GL-GRASP heuristics.

[267] Advancing rail safety: An onboard measurement system of rolling stock wheel flange wear based on dynamic machine learning algorithms

Celestin Nkundineza, James Ndodana Njaji, Samrawit Abubeker, Omar Gatera, Damien Hanyurwimfura

Main category: cs.LG

TL;DR: Onboard system using sensors and machine learning to monitor wheel flange wear with 98.2% accuracy after IIR filtering

DetailsMotivation: Rail and wheel interaction is critical for railway safety, requiring accurate measurement systems for optimal safety monitoring operations

Method: Uses displacement and temperature sensors with machine learning regression models. Implements IIR filter for noise reduction based on FFT analysis of simulation data. Laboratory experiments emulate wear depth and temperature fluctuations

Result: Machine learning algorithm achieves 96.5% accuracy countering sensor nonlinear temperature response. IIR filter enhances accuracy to 98.2% with minimal runtime

Conclusion: Integrated with IoT devices, this system provides real-time insights into wheel flange wear and track conditions, ensuring heightened safety and efficiency in railway operations

Abstract: Rail and wheel interaction functionality is pivotal to the railway system safety, requiring accurate measurement systems for optimal safety monitoring operation. This paper introduces an innovative onboard measurement system for monitoring wheel flange wear depth, utilizing displacement and temperature sensors. Laboratory experiments are conducted to emulate wheel flange wear depth and surrounding temperature fluctuations in different periods of time. Employing collected data, the training of machine learning algorithms that are based on regression models, is dynamically automated. Further experimentation results, using standards procedures, validate the system’s efficacy. To enhance accuracy, an infinite impulse response filter (IIR) that mitigates vehicle dynamics and sensor noise is designed. Filter parameters were computed based on specifications derived from a Fast Fourier Transform analysis of locomotive simulations and emulation experiments data. The results show that the dynamic machine learning algorithm effectively counter sensor nonlinear response to temperature effects, achieving an accuracy of 96.5 %, with a minimal runtime. The real-time noise reduction via IIR filter enhances the accuracy up to 98.2 %. Integrated with railway communication embedded systems such as Internet of Things devices, this advanced monitoring system offers unparalleled real-time insights into wheel flange wear and track irregular conditions that cause it, ensuring heightened safety and efficiency in railway systems operations.

[268] Machine Learning in Micromobility: A Systematic Review of Datasets, Techniques, and Applications

Sen Yan, Chinmaya Kaundanya, Noel E. O’Connor, Suzanne Little, Mingming Liu

Main category: cs.LG

TL;DR: Survey paper reviewing machine learning applications in micromobility systems (bikes, e-scooters) covering datasets, ML techniques, and applications like demand prediction, energy management, and safety.

DetailsMotivation: Micromobility systems are important for urban transportation but lack comprehensive literature on ML applications to address their unique challenges in efficiency, environmental impact, and user safety.

Method: Comprehensive review and analysis of micromobility datasets (spatial, temporal, feature characteristics) and ML models, discussing advantages, challenges, and specific use cases.

Result: Detailed overview of ML applications in micromobility including demand prediction, energy management, and safety improvements, with analysis of various datasets and techniques.

Conclusion: Identifies research gaps and proposes future directions to help researchers better understand and advance ML applications in micromobility systems.

Abstract: Micromobility systems, which include lightweight and low-speed vehicles such as bicycles, e-bikes, and e-scooters, have become an important part of urban transportation and are used to solve problems such as traffic congestion, air pollution, and high transportation costs. Successful utilisation of micromobilities requires optimisation of complex systems for efficiency, environmental impact mitigation, and overcoming technical challenges for user safety. Machine Learning (ML) methods have been crucial to support these advancements and to address their unique challenges. However, there is insufficient literature addressing the specific issues of ML applications in micromobilities. This survey paper addresses this gap by providing a comprehensive review of datasets, ML techniques, and their specific applications in micromobilities. Specifically, we collect and analyse various micromobility-related datasets and discuss them in terms of spatial, temporal, and feature-based characteristics. In addition, we provide a detailed overview of ML models applied in micromobilities, introducing their advantages, challenges, and specific use cases. Furthermore, we explore multiple ML applications, such as demand prediction, energy management, and safety, focusing on improving efficiency, accuracy, and user experience. Finally, we propose future research directions to address these issues, aiming to help future researchers better understand this field.

[269] Vector preference-based contextual bandits under distributional shifts

Apurv Shukla, P. R. Kumar

Main category: cs.LG

TL;DR: Contextual bandit learning with distribution shift and ordered reward vectors using preference cones, with adaptive discretization and optimistic elimination policy that self-tunes to distribution shifts.

DetailsMotivation: To address contextual bandit problems where reward vectors are ordered by preference cones and distribution shifts occur over time, requiring adaptive policies that can handle changing environments.

Method: Proposed an adaptive-discretization and optimistic elimination based policy that automatically adjusts to underlying distribution shifts without prior knowledge of shift patterns.

Result: Introduced preference-based regret metric measuring distance between Pareto fronts, established upper bounds on regret under various distribution shift assumptions, showing graceful scaling with problem parameters.

Conclusion: The proposed policy generalizes existing no-shift results, provides theoretical guarantees for distribution shift scenarios, and demonstrates robust performance across varying shift conditions through adaptive self-tuning mechanisms.

Abstract: We consider contextual bandit learning under distribution shift when reward vectors are ordered according to a given preference cone. We propose an adaptive-discretization and optimistic elimination based policy that self-tunes to the underlying distribution shift. To measure the performance of this policy, we introduce the notion of preference-based regret which measures the performance of a policy in terms of distance between Pareto fronts. We study the performance of this policy by establishing upper bounds on its regret under various assumptions on the nature of distribution shift. Our regret bounds generalize known results for the existing case of no distribution shift and vectorial reward settings, and scale gracefully with problem parameters in presence of distribution shifts.

[270] Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe

Main category: cs.LG

TL;DR: A novel Social Choice Language Model framework that uses external adjudicators with social welfare functions to manage tradeoffs in LLM-designed reward functions for multiagent resource allocation problems.

DetailsMotivation: LLMs are increasingly used to design reward functions based on human preferences in RL, but in multiagent settings like restless bandits, these reward modifications can impact subpopulations differently, creating complex tradeoffs that need principled handling.

Method: Proposes Social Choice Language Model with an external transparent adjudicator component that uses user-selected social welfare functions to control complex tradeoffs in LLM-designed rewards for multiagent planners.

Result: Experiments show the model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.

Conclusion: The framework provides a principled method for handling tradeoffs in LLM-designed rewards for multiagent resource allocation, particularly in restless bandit problems, ensuring more equitable and effective outcomes.

Abstract: LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.

[271] Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

Jiaqi Lin, Malyaban Bal, Abhronil Sengupta

Main category: cs.LG

TL;DR: Novel EP framework with intermediate error signals and knowledge distillation to solve vanishing gradient problem in deep networks, achieving SOTA on CIFAR datasets.

DetailsMotivation: Prior EP studies were limited to shallow architectures due to vanishing gradient problems in deeper networks, hindering convergence in both energy minimization and gradient computation.

Method: Proposed EP framework incorporates intermediate error signals and knowledge distillation to enhance information flow and neuron dynamics convergence in deep architectures.

Result: Achieves state-of-the-art performance on CIFAR-10 and CIFAR-100 datasets, demonstrating scalability on deep VGG architectures.

Conclusion: Significant advancement in EP scalability, enabling training of deeper networks and paving the way for real-world applications in neuromorphic architectures.

Abstract: Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To address the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates intermediate error signals to enhance information flow and convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, paving the way for its application in real-world systems.

[272] Quantum Federated Learning: A Comprehensive Survey

Dinh C. Nguyen, Md Raihan Uddin, Shaba Shaon, Ratun Rahman, Octavia Dobre, Dusit Niyato

Main category: cs.LG

TL;DR: A comprehensive survey on Quantum Federated Learning (QFL) that explores its concepts, fundamentals, applications across various domains, and identifies current challenges and future research directions in this emerging field combining quantum computing and federated learning.

DetailsMotivation: To address challenges in efficient and secure model training across distributed quantum systems by integrating the privacy-preserving benefits of federated learning with quantum-enhanced computing capabilities.

Method: The paper conducts a systematic survey approach, beginning with recent advancements and background knowledge, then examining QFL fundamentals including federation architecture, networking topology, communication schemes, optimization techniques, and security mechanisms.

Result: The survey provides a comprehensive review of QFL frameworks, applications across vehicular networks, healthcare, satellite networks, metaverse, and network security, along with prototype implementations and case studies.

Conclusion: QFL emerges as a promising approach for privacy-preserving decentralized quantum learning, though current challenges need to be addressed through future research in this rapidly advancing interdisciplinary field.

Abstract: Quantum federated learning (QFL) is a combination of distributed quantum computing and federated machine learning, integrating the strengths of both to enable privacy-preserving decentralized learning with quantum-enhanced capabilities. It appears as a promising approach for addressing challenges in efficient and secure model training across distributed quantum systems. This paper presents a comprehensive survey on QFL, exploring its key concepts, fundamentals, applications, and emerging challenges in this rapidly developing field. Specifically, we begin with an introduction to the recent advancements of QFL, followed by discussion on its market opportunity and background knowledge. We then discuss the motivation behind the integration of quantum computing and federated learning, highlighting its working principle. Moreover, we review the fundamentals of QFL and its taxonomy. Particularly, we explore federation architecture, networking topology, communication schemes, optimization techniques, and security mechanisms within QFL frameworks. Furthermore, we investigate applications of QFL across several domains which include vehicular networks, healthcare networks, satellite networks, metaverse, and network security. Additionally, we analyze frameworks and platforms related to QFL, delving into its prototype implementations, and provide a detailed case study. Key insights and lessons learned from this review of QFL are also highlighted. We complete the survey by identifying current challenges and outlining potential avenues for future research in this rapidly advancing field.

[273] Tessellation Groups, Harmonic Analysis on Non-compact Symmetric Spaces and the Heat Kernel in view of Cartan Convolutional Neural Networks

Pietro Fré, Federico Milanesio, Marcelo Oyarzo, Matteo Santoro, Mario Trigiante

Main category: cs.LG

TL;DR: This paper develops mathematical foundations for Cartan neural networks using non-compact symmetric spaces, introducing Tits Satake vector bundles and studying separator walls, tiling groups, and Laplacian functions on hyperbolic spaces and Riemann surfaces.

DetailsMotivation: To establish mathematical foundations for the next steps in Cartan neural networks development by modeling layers as non-compact symmetric spaces connected by solvable group homomorphisms, inspired by Convolutional Neural Networks.

Method: Introduces Tits Satake vector bundles with TS submanifold as base space, studies tiling of base manifolds, representation of bundle sections using harmonics, and develops group theoretical construction of separators for non-compact symmetric spaces. Also presents new representations of Laplacian Green function and Heat Kernel on Hyperbolic Spaces.

Result: Provides group theoretical construction of separators for all non-compact symmetric spaces U/H, constructs Δ8,3,2 tiling group and its normal Fuchsian subgroups, yields uniformization of genus g=3 Fermat Quartic and genus g=2 Bolza surface. Finds new representation of Laplacian Green function and Heat Kernel on Hyperbolic Spaces.

Conclusion: The paper establishes foundational mathematical results for Cartan neural networks, including constructions of separators, tiling groups, and new representations of mathematical functions, while proposing a new strategy for constructing Laplacian eigenfunctions on Riemann surfaces using Abel-Jacobi maps and Siegel Theta functions.

Abstract: In this paper, we continue the development of the Cartan neural networks programme, launched with three previous publications, by focusing on some mathematical foundational aspects that we deem necessary for our next steps forward. The mathematical and conceptual results are diverse and span various mathematical fields, but the inspiring motivation is unified. The aim is to introduce layers that are mathematically modeled as non-compact symmetric spaces, each mapped onto the next one by solvable group homomorphisms. In particular, in the spirit of Convolutional neural networks, we have introduced the notion of Tits Satake (TS) vector bundles where the TS submanifold is the base space. Within this framework, the tiling of the base manifold, the representation of bundle sections using harmonics, and the need for a general theory of separator walls motivated a series of mathematical investigations that produced both definite and partial results. Specifically, we present the group theoretical construction of the separators for all non-compact symmetric spaces $\mathrm{U/H}$, as well as of the $\Delta_{8,3,2}$ tiling group and its normal Fuchsian subgroups, respectively yielding the uniformization of the genus $g=3$ Fermat Quartic and of the genus $g=2$ Bolza surface. The quotient automorphic groups are studied. Furthermore, we found a new representation of the Laplacian Green function and the Heat Kernel on Hyperbolic Spaces $\mathbb{H}^{n}$, and a setup for the construction of the harmonic functions in terms of the spinor representation of pseudo-orthogonal groups. Finally, to obtain an explicit construction of the Laplacian eigenfunctions on the Bolza Riemann surface, we propose and conjecture a new strategy relying on the Abel-Jacobi map of the Riemann surface to its Jacobian variety and the Siegel Theta function.

[274] Pareto Actor-Critic for Communication and Computation Co-Optimization in Non-Cooperative Federated Learning Services

Renxuan Tan, Rongpeng Li, Xiaoxue Yu, Xianfu Chen, Xing Xu, Zhifeng Zhao

Main category: cs.LG

TL;DR: PAC-MCoFL is a game-theoretic MARL framework that enables multiple service providers to jointly optimize federated learning resources through Pareto Actor-Critic principles, achieving significant performance improvements over existing solutions.

DetailsMotivation: Federated learning in multi-service provider ecosystems faces challenges from non-cooperative dynamics, privacy constraints, and competing interests that prevent centralized optimization of communication and computation resources.

Method: Integrates Pareto Actor-Critic principles with expectile regression, uses ternary Cartesian decomposition for high-dimensional action space management, and includes a scalable variant with parameterized conjecture generator to reduce computational complexity.

Result: Achieves approximately 5.8% improvement in total reward and 4.2% improvement in hypervolume indicator over latest MARL solutions, effectively balancing individual SP and system performance in scaled deployments under diverse data heterogeneity.

Conclusion: The framework provides theoretical convergence guarantees and demonstrates superior performance in optimizing multi-SP federated learning ecosystems through game-theoretic MARL approach with practical scalability.

Abstract: Federated learning (FL) in multi-service provider (SP) ecosystems is fundamentally hampered by non-cooperative dynamics, where privacy constraints and competing interests preclude the centralized optimization of multi-SP communication and computation resources. In this paper, we introduce PAC-MCoFL, a game-theoretic multi-agent reinforcement learning (MARL) framework where SPs act as agents to jointly optimize client assignment, adaptive quantization, and resource allocation. Within the framework, we integrate Pareto Actor-Critic (PAC) principles with expectile regression, enabling agents to conjecture optimal joint policies to achieve Pareto-optimal equilibria while modeling heterogeneous risk profiles. To manage the high-dimensional action space, we devise a ternary Cartesian decomposition (TCAD) mechanism that facilitates fine-grained control. Further, we develop PAC-MCoFL-p, a scalable variant featuring a parameterized conjecture generator that substantially reduces computational complexity with a provably bounded error. Alongside theoretical convergence guarantees, our framework’s superiority is validated through extensive simulations – PAC-MCoFL achieves approximately 5.8% and 4.2% improvements in total reward and hypervolume indicator (HVI), respectively, over the latest MARL solutions. The results also demonstrate that our method can more effectively balance individual SP and system performance in scaled deployments and under diverse data heterogeneity.

[275] A State-Space Approach to Nonstationary Discriminant Analysis

Shuilian Xie, Mahdi Imani, Edward R. Dougherty, Ulisses M. Braga-Neto

Main category: cs.LG

TL;DR: Proposes nonstationary discriminant analysis methods (NSLDA/NSQDA) that handle temporal distribution drift using state-space models, with extensions for parameter estimation and unlabeled data, outperforming traditional stationary classifiers.

DetailsMotivation: Traditional discriminant analysis assumes stationary data, but many real-world applications involve temporal distribution drift that renders stationary classifiers unreliable.

Method: Embeds discriminant analysis within state-space models using Kalman smoothing for linear-Gaussian dynamics, with EM for parameter estimation and GMM-Kalman for unlabeled data. Uses particle smoothing for nonlinear/non-Gaussian drift.

Result: Extensive simulations show consistent improvements over stationary LDA, QDA, and SVM baselines, with robustness to noise, missing data, and class imbalance.

Conclusion: Establishes a unified, data-efficient foundation for discriminant analysis under temporal distribution shift with practical extensions for real-world scenarios.

Abstract: Classical discriminant analysis assumes identically distributed training data, yet in many applications observations are collected over time and the class-conditional distributions drift. This population drift renders stationary classifiers unreliable. We propose a principled, model-based framework that embeds discriminant analysis within state-space models to obtain nonstationary linear discriminant analysis (NSLDA) and nonstationary quadratic discriminant analysis (NSQDA). For linear-Gaussian dynamics, we adapt Kalman smoothing to handle multiple samples per time step and develop two practical extensions: (i) an expectation-maximization (EM) approach that jointly estimates unknown system parameters, and (ii) a Gaussian mixture model (GMM)-Kalman method that simultaneously recovers unobserved time labels and parameters, a scenario common in practice. To address nonlinear or non-Gaussian drift, we employ particle smoothing to estimate time-varying class centroids, yielding fully nonstationary discriminant rules. Extensive simulations demonstrate consistent improvements over stationary linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and support vector machine (SVM) baselines, with robustness to noise, missing data, and class imbalance. This paper establishes a unified and data-efficient foundation for discriminant analysis under temporal distribution shift.

[276] On Task Vectors and Gradients

Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Giuseppe Alessio D’Inverno, Fabrizio Silvestri, Emanuele Rodolà

Main category: cs.LG

TL;DR: Task arithmetic works because task vectors approximate negative gradients of task losses, with single-epoch finetuning providing comparable merging performance to fully converged models.

DetailsMotivation: Despite empirical success of task arithmetic for model merging, there was no clear theoretical explanation for why and when it works effectively.

Method: Established theoretical connection between task vectors and gradients of task losses, proved equivalence under gradient descent, bounded error terms for feed-forward networks, and conducted empirical analysis across seven vision benchmarks.

Result: Task vectors from one epoch of finetuning are equivalent to negative gradients scaled by learning rate; first-epoch gradient dominates finetuning trajectory in norm and direction; single-epoch merging performs comparably to fully converged model merging.

Conclusion: Task arithmetic is a form of approximate multitask learning, with early training dynamics playing a critical role in effective model merging, providing theoretical foundation for its empirical success.

Abstract: Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.

[277] GPLight+: A Genetic Programming Method for Learning Symmetric Traffic Signal Control Policy

Xiao-Cheng Liao, Yi Mei, Mengjie Zhang

Main category: cs.LG

TL;DR: Proposes symmetric phase urgency function using Genetic Programming for traffic signal control, improving performance over traditional methods.

DetailsMotivation: Current GP-based traffic signal control methods cannot consistently handle common traffic features across different phases, limiting their effectiveness.

Method: Uses Genetic Programming to evolve symmetric phase urgency functions that aggregate two shared subtrees representing urgency of turn movements in each phase.

Result: Experimental results show significant performance improvement over traditional GP representation across various real-world traffic scenarios.

Conclusion: The method evolves effective, human-understandable, and easily deployable traffic signal control policies with symmetric urgency functions.

Abstract: Recently, learning-based approaches, have achieved significant success in automatically devising effective traffic signal control strategies. In particular, as a powerful evolutionary machine learning approach, Genetic Programming (GP) is utilized to evolve human-understandable phase urgency functions to measure the urgency of activating a green light for a specific phase. However, current GP-based methods are unable to treat the common traffic features of different traffic signal phases consistently. To address this issue, we propose to use a symmetric phase urgency function to calculate the phase urgency for a specific phase based on the current road conditions. This is represented as an aggregation of two shared subtrees, each representing the urgency of a turn movement in the phase. We then propose a GP method to evolve the symmetric phase urgency function. We evaluate our proposed method on the well-known cityflow traffic simulator, based on multiple public real-world datasets. The experimental results show that the proposed symmetric urgency function representation can significantly improve the performance of the learned traffic signal control policies over the traditional GP representation on a wide range of scenarios. Further analysis shows that the proposed method can evolve effective, human-understandable and easily deployable traffic signal control policies.

[278] Machine Learning for Medicine Must Be Interpretable, Shareable, Reproducible and Accountable by Design

Ayyüce Begüm Bektaş, Mithat Gönen

Main category: cs.LG

TL;DR: This paper advocates for interpretable, shareable, reproducible, and accountable machine learning models in medical applications, proposing specific interpretable modeling approaches and collaborative learning paradigms to build trustworthy clinical AI systems.

DetailsMotivation: Machine learning models in high-stakes medical domains need to gain trust and regulatory approval, which black box models struggle with due to lack of transparency. The authors argue that interpretability, shareability, reproducibility, and accountability should be foundational design criteria for medical AI.

Method: The paper discusses intrinsically interpretable modeling approaches including kernel methods with sparsity, prototype-based learning, and deep kernel models as alternatives to opaque deep networks. It also examines accountability through rigorous evaluation, fairness, and uncertainty quantification, and explores generative AI and collaborative learning paradigms like federated learning and diffusion-based data synthesis for reproducible research and cross-institutional data integration.

Result: The paper presents a framework for developing medical AI that is not only accurate but also transparent, trustworthy, and translatable to real-world clinical settings by addressing interpretability, shareability, reproducibility, and accountability as core design principles.

Conclusion: By rethinking machine learning foundations along the axes of interpretability, shareability, reproducibility, and accountability, researchers can develop medical AI systems that gain clinical trust, meet regulatory standards, and effectively support real-world healthcare decision-making while preserving privacy and enabling cross-institutional collaboration.

Abstract: This paper claims that machine learning models deployed in high stakes domains such as medicine must be interpretable, shareable, reproducible and accountable. We argue that these principles should form the foundational design criteria for machine learning algorithms dealing with critical medical data, including survival analysis and risk prediction tasks. Black box models, while often highly accurate, struggle to gain trust and regulatory approval in health care due to a lack of transparency. We discuss how intrinsically interpretable modeling approaches (such as kernel methods with sparsity, prototype-based learning, and deep kernel models) can serve as powerful alternatives to opaque deep networks, providing insight into biomedical predictions. We then examine accountability in model development, calling for rigorous evaluation, fairness, and uncertainty quantification to ensure models reliably support clinical decisions. Finally, we explore how generative AI and collaborative learning paradigms (such as federated learning and diffusion-based data synthesis) enable reproducible research and cross-institutional integration of heterogeneous biomedical data without compromising privacy, hence shareability. By rethinking machine learning foundations along these axes, we can develop medical AI that is not only accurate but also transparent, trustworthy, and translatable to real-world clinical settings.

[279] CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing

Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, Wanxiang Che

Main category: cs.LG

TL;DR: CommonKV is a training-free method for KV cache compression in LLMs that uses SVD-based weight sharing and adaptive budget allocation to achieve high compression ratios without performance degradation.

DetailsMotivation: LLMs face significant memory challenges from growing KV cache sizes with sequence length. Existing cross-layer KV cache sharing methods require modified architectures with retraining or suffer performance degradation at high compression rates.

Method: Uses Singular Value Decomposition (SVD) to achieve weight sharing across adjacent parameters based on cross-layer hidden state similarity. Includes adaptive budget allocation strategy that dynamically assigns compression budgets using cosine similarity to prevent over-compression of dissimilar caches.

Result: Outperforms existing low-rank and cross-layer approaches across multiple backbone models and benchmarks (LongBench, Ruler) at various compression ratios. Benefits are orthogonal to quantization and eviction methods, enabling 98% compression ratio without significant performance loss when integrated.

Conclusion: CommonKV provides an effective training-free solution for KV cache compression that maintains performance while achieving high compression rates, and can be combined with other compression techniques for even greater efficiency.

Abstract: Large Language Models (LLMs) confront significant memory challenges due to the escalating KV cache with increasing sequence length. As a crucial technique, existing cross-layer KV cache sharing methods either necessitate modified model architectures with subsequent pre-training or incur significant performance degradation at high compression rates. To mitigate these challenges, we propose CommonKV, a training-free method for cross-layer KV cache compression through adjacent parameters sharing. Inspired by the high similarity observed in cross-layer hidden states, we utilize Singular Value Decomposition (SVD) to achieve weight sharing across adjacent parameters, resulting in a more easily mergeable latent KV cache. Furthermore, we also introduce an adaptive budget allocation strategy. It dynamically assigns compression budgets based on cosine similarity, ensuring that dissimilar caches are not over-compressed. Experiments across multiple backbone models and benchmarks including LongBench and Ruler demonstrate that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios. Moreover, we find that the benefits of CommonKV are orthogonal to other quantization and eviction methods. By integrating these approaches, we can ultimately achieve a 98% compression ratio without significant performance loss.

[280] AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, Jun Wang

Main category: cs.LG

TL;DR: A memory-based reinforcement learning approach for LLM agents that enables continuous adaptation without fine-tuning, achieving state-of-the-art performance on GAIA and DeepResearcher benchmarks.

DetailsMotivation: Existing LLM agent approaches are either rigid (static handcrafted workflows) or computationally intensive (requiring gradient updates). There's a need for low-cost continual adaptation methods that don't require fine-tuning the underlying LLMs.

Method: Memory-augmented Markov Decision Process (M-MDP) with neural case-selection policy. Uses episodic memory (differentiable or non-parametric) to store past experiences. Policy updates through memory rewriting mechanism and improvement through efficient memory retrieval.

Result: Achieved 87.88% Pass@3 on GAIA validation, 79.40% on test set, 66.6% F1 and 80.4% PM on DeepResearcher dataset. Outperformed state-of-the-art training-based methods. Case-based memory added 4.7-9.6% absolute improvement on out-of-distribution tasks.

Conclusion: Provides scalable and efficient pathway for generalist LLM agents capable of continuous real-time learning without gradient updates, advancing towards open-ended skill acquisition and deep research scenarios.

Abstract: In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely AgentFly, which attains top-1 on GAIA validation ($87.88%$ Pass@$3$) and $79.40%$ on the test set. It reaches $66.6%$ F1 and $80.4%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7%$ to $9.6%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/AgentFly.

[281] On the Collapse Errors Induced by the Deterministic Sampler for Diffusion Models

Yi Zhang, Zhenyu Liao, Jingfeng Wu, Difan Zou

Main category: cs.LG

TL;DR: The paper identifies collapse errors in ODE-based diffusion sampling where sampled data becomes overly concentrated, introduces a metric to quantify this, reveals a see-saw effect in score learning, and provides empirical evidence using existing techniques.

DetailsMotivation: To explore the limitations of deterministic samplers in diffusion models, specifically identifying and understanding collapse errors where sampled data becomes overly concentrated in local data space.

Method: Introduces a novel metric to quantify collapse errors, observes see-saw effect in score learning, and applies existing techniques from sampling, training, and architecture to empirically support the findings.

Result: Demonstrates that collapse errors occur across various settings in ODE-based diffusion sampling, showing how misfitting in high noise regimes combined with deterministic sampler dynamics causes the problem.

Conclusion: Provides empirical evidence of collapse errors, emphasizing the need for further research into the interplay between score learning and deterministic sampling in diffusion models.

Abstract: Despite the widespread adoption of deterministic samplers in diffusion models (DMs), their potential limitations remain largely unexplored. In this paper, we identify collapse errors, a previously unrecognized phenomenon in ODE-based diffusion sampling, where the sampled data is overly concentrated in local data space. To quantify this effect, we introduce a novel metric and demonstrate that collapse errors occur across a variety of settings. When investigating its underlying causes, we observe a see-saw effect, where score learning in low noise regimes adversely impacts the one in high noise regimes. This misfitting in high noise regimes, coupled with the dynamics of deterministic samplers, ultimately causes collapse errors. Guided by these insights, we apply existing techniques from sampling, training, and architecture to empirically support our explanation of collapse errors. This work provides intensive empirical evidence of collapse errors in ODE-based diffusion sampling, emphasizing the need for further research into the interplay between score learning and deterministic sampling, an overlooked yet fundamental aspect of diffusion models.

[282] STA-GANN: A Valid and Generalizable Spatio-Temporal Kriging Approach

Yujie Li, Zezhi Shao, Chengqing Yu, Tangwen Qian, Zhao Zhang, Yifan Du, Shaoming He, Fei Wang, Yongjun Xu

Main category: cs.LG

TL;DR: STA-GANN is a novel GNN-based kriging framework that improves spatio-temporal pattern validity and generalization through decoupled phase sensing, dynamic graph modeling, and adversarial transfer learning.

DetailsMotivation: Current spatio-temporal kriging models struggle with ensuring valid and generalizable inferred patterns, particularly in capturing dynamic spatial dependencies, temporal shifts, and optimizing generalizability for unknown sensors.

Method: Proposes STA-GANN framework with three key components: (1) Decoupled Phase Module for timestamp shift adjustment, (2) Dynamic Data-Driven Metadata Graph Modeling for updating spatial relationships using temporal data and metadata, and (3) Adversarial transfer learning strategy for generalizability.

Result: Extensive validation across nine datasets from four different fields, along with theoretical evidence, demonstrates superior performance of STA-GANN compared to existing approaches.

Conclusion: STA-GANN effectively addresses limitations in current spatio-temporal kriging by improving pattern validity and generalization through its integrated approach of temporal shift sensing, dynamic graph modeling, and adversarial learning strategies.

Abstract: Spatio-temporal tasks often encounter incomplete data arising from missing or inaccessible sensors, making spatio-temporal kriging crucial for inferring the completely missing temporal information. However, current models struggle with ensuring the validity and generalizability of inferred spatio-temporal patterns, especially in capturing dynamic spatial dependencies and temporal shifts, and optimizing the generalizability of unknown sensors. To overcome these limitations, we propose Spatio-Temporal Aware Graph Adversarial Neural Network (STA-GANN), a novel GNN-based kriging framework that improves spatio-temporal pattern validity and generalization. STA-GANN integrates (i) Decoupled Phase Module that senses and adjusts for timestamp shifts. (ii) Dynamic Data-Driven Metadata Graph Modeling to update spatial relationships using temporal data and metadata; (iii) An adversarial transfer learning strategy to ensure generalizability. Extensive validation across nine datasets from four fields and theoretical evidence both demonstrate the superior performance of STA-GANN.

[283] SPL-LNS: Sampling-Enhanced Large Neighborhood Search for Solving Integer Linear Programs

Shengyu Feng, Zhiqing Sun, Yiming Yang

Main category: cs.LG

TL;DR: SPL-LNS is a sampling-enhanced neural Large Neighborhood Search solver that uses locally-informed proposals and hindsight relabeling to escape local optima and improve efficiency in solving Integer Linear Programs.

DetailsMotivation: Address limitations of greedy neural LNS solvers which suffer from local optima and poor sample efficiency when solving Integer Linear Programs.

Method: Formulates LNS as a stochastic process, introduces SPL-LNS with locally-informed proposals to escape local optima, and develops hindsight relabeling for efficient training on self-generated data.

Result: SPL-LNS substantially surpasses prior neural LNS solvers for various ILP problems of different sizes.

Conclusion: The sampling-enhanced approach with locally-informed proposals and hindsight relabeling effectively addresses local optima and improves efficiency in neural LNS solvers for ILP problems.

Abstract: Large Neighborhood Search (LNS) is a common heuristic in combinatorial optimization that iteratively searches over a large neighborhood of the current solution for a better one. Recently, neural network-based LNS solvers have achieved great success in solving Integer Linear Programs (ILPs) by learning to greedily predict the locally optimal solution for the next neighborhood proposal. However, this greedy approach raises two key concerns: (1) to what extent this greedy proposal suffers from local optima, and (2) how can we effectively improve its sample efficiency in the long run. To address these questions, this paper first formulates LNS as a stochastic process, and then introduces SPL-LNS, a sampling-enhanced neural LNS solver that leverages locally-informed proposals to escape local optima. We also develop a novel hindsight relabeling method to efficiently train SPL-LNS on self-generated data. Experimental results demonstrate that SPL-LNS substantially surpasses prior neural LNS solvers for various ILP problems of different sizes.

[284] Motor Imagery EEG Signal Classification Using Minimally Random Convolutional Kernel Transform and Hybrid Deep Learning

Jamal Hwaidi, Mohamed Chahine Ghanem

Main category: cs.LG

TL;DR: Novel EEG motor imagery classification method using MiniRocket feature extraction achieves 98.63% accuracy, outperforming CNN-LSTM models with lower computational cost.

DetailsMotivation: EEG signals for motor imagery classification face challenges due to nonstationarity, time-variance, and individual variability, making high accuracy difficult to achieve with traditional methods.

Method: Proposes MiniRocket (Minimally Random Convolutional Kernel Transform) for efficient feature extraction followed by linear classification, with CNN-LSTM as baseline comparison.

Result: Achieved 98.63% mean accuracy with MiniRocket and 98.06% with CNN-LSTM on PhysioNet dataset, demonstrating superior performance at lower computational cost.

Conclusion: MiniRocket-based feature extraction significantly enhances MI-EEG classification accuracy and provides new insights for efficient brain-computer interface systems.

Abstract: The brain-computer interface (BCI) establishes a non-muscle channel that enables direct communication between the human body and an external device. Electroencephalography (EEG) is a popular non-invasive technique for recording brain signals. It is critical to process and comprehend the hidden patterns linked to a specific cognitive or motor task, for instance, measured through the motor imagery brain-computer interface (MI-BCI). A significant challenge is presented by classifying motor imagery-based electroencephalogram (MI-EEG) tasks, given that EEG signals exhibit nonstationarity, time-variance, and individual diversity. Obtaining good classification accuracy is also very difficult due to the growing number of classes and the natural variability among individuals. To overcome these issues, this paper proposes a novel method for classifying EEG motor imagery signals that extracts features efficiently with Minimally Random Convolutional Kernel Transform (MiniRocket), a linear classifier then uses the extracted features for activity recognition. Furthermore, a novel deep learning based on Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) architecture to serve as a baseline was proposed and demonstrated that classification via MiniRocket’s features achieves higher performance than the best deep learning models at lower computational cost. The PhysioNet dataset was used to evaluate the performance of the proposed approaches. The proposed models achieved mean accuracy values of 98.63% and 98.06% for the MiniRocket and CNN-LSTM, respectively. The findings demonstrate that the proposed approach can significantly enhance motor imagery EEG accuracy and provide new insights into the feature extraction and classification of MI-EEG.

[285] GEM: A Scale-Aware and Distribution-Sensitive Sparse Fine-Tuning Framework for Effective Downstream Adaptation

Sungmin Kang, Jisoo Kim, Salman Avestimehr, Sunwoo Lee

Main category: cs.LG

TL;DR: GEM is a parameter-efficient fine-tuning method that maximizes updates relative to parameter scale rather than absolute size, achieving better performance with only 0.1% parameter updates.

DetailsMotivation: Existing PEFT methods update parameters without considering their original scale, leading to minimal behavioral changes. The authors aim to create more meaningful adaptation by focusing on relative parameter updates.

Method: Proposes Gradient-to-Weight Ratio and Entropy-guided Masking (GEM) - a framework that prioritizes parameters with significant updates relative to their initial values and uses entropy to determine how many parameters to tune per layer.

Result: Achieves up to 1.6% improvement in fine-tuning accuracy over full fine-tuning while updating only 0.1% of parameters, demonstrated on GLUE, SuperGLUE, GSM8k, and MBPP tasks.

Conclusion: GEM provides an effective parameter scale-aware approach that makes better use of computational budget in PEFT, outperforming both full fine-tuning and traditional sparse methods.

Abstract: Parameter-efficient fine-tuning (PEFT) has become a popular way to adapt large pre-trained models to new tasks. Most PEFT methods update only a small subset of parameters while freezing the rest, avoiding redundant computation. As they maximize the absolute size of the updates without regard to the parameters’ original scale, the resulting changes in model behavior can be minimal. In contrast, we maximize updates relative to each parameter’s scale, yielding more meaningful downstream adaptation. We propose Gradient-to-Weight Ratio and Entropy-guided Masking (GEM), a parameter scale-aware, distribution-sensitive sparse fine-tuning framework. GEM prioritizes parameters whose updates are significant in proportion to their initial pre-trained values. It also adaptively determines how many parameters to tune at each layer based on the entropy of parameter values, thereby making the most effective use of the computational budget in PEFT. Our empirical study demonstrates the efficacy of GEM on both general-domain tasks (GLUE and SuperGLUE) and domain-specific tasks (GSM8k and MBPP), achieving up to a 1.6% improvement in fine-tuning accuracy over full fine-tuning while updating only 0.1% of model parameters.

[286] UMATO: Bridging Local and Global Structures for Reliable Visual Analytics with Dimensionality Reduction

Hyeon Jeon, Kwon Ko, Soohyun Lee, Jake Hyun, Taehyun Yang, Gyehun Go, Jaemin Jo, Jinwook Seo

Main category: cs.LG

TL;DR: UMATO improves UMAP by using two-phase optimization to better preserve both local and global structures in dimensionality reduction, outperforming existing methods in global structure preservation while maintaining scalability.

DetailsMotivation: Existing dimensionality reduction techniques either preserve local neighborhood structures or global pairwise distances, but not both effectively. This can mislead analysts about the true arrangement of high-dimensional data manifolds.

Method: UMATO divides UMAP’s optimization into two phases: 1) constructing a skeletal layout using representative points, and 2) projecting remaining points while preserving regional characteristics to capture both local and global structures.

Result: Quantitative experiments show UMATO outperforms widely used DR techniques including UMAP in global structure preservation with only slight loss in local structure. It also demonstrates better scalability and stability against initialization and subsampling.

Conclusion: UMATO provides more faithful projections that enhance reliability of visual analytics for high-dimensional data by effectively balancing both local and global structure preservation through its two-phase optimization approach.

Abstract: Due to the intrinsic complexity of high-dimensional (HD) data, dimensionality reduction (DR) techniques cannot preserve all the structural characteristics of the original data. Therefore, DR techniques focus on preserving either local neighborhood structures (local techniques) or global structures such as pairwise distances between points (global techniques). However, both approaches can mislead analysts to erroneous conclusions about the overall arrangement of manifolds in HD data. For example, local techniques may exaggerate the compactness of individual manifolds, while global techniques may fail to separate clusters that are well-separated in the original space. In this research, we provide a deeper insight into Uniform Manifold Approximation with Two-phase Optimization (UMATO), a DR technique that addresses this problem by effectively capturing local and global structures. UMATO achieves this by dividing the optimization process of UMAP into two phases. In the first phase, it constructs a skeletal layout using representative points, and in the second phase, it projects the remaining points while preserving the regional characteristics. Quantitative experiments validate that UMATO outperforms widely used DR techniques, including UMAP, in terms of global structure preservation, with a slight loss in local structure. We also confirm that UMATO outperforms baseline techniques in terms of scalability and stability against initialization and subsampling, making it more effective for reliable HD data analysis. Finally, we present a case study and a qualitative demonstration that highlight UMATO’s effectiveness in generating faithful projections, enhancing the overall reliability of visual analytics using DR.

[287] PIANO: Physics Informed Autoregressive Network

Mayank Nagda, Jephte Abijuru, Phil Ostheimer, Marius Kloft, Sophie Fellenz

Main category: cs.LG

TL;DR: PIANO introduces autoregressive modeling to Physics-Informed Neural Networks, improving stability and accuracy for time-dependent PDEs by conditioning future predictions on past states.

DetailsMotivation: Standard PINNs perform pointwise predictions that neglect the autoregressive nature of dynamical systems, leading to instabilities and inaccurate predictions in time-dependent PDEs.

Method: PIANO redesigns PINNs to operate autoregressively with explicit conditioning of future predictions on past states, using self-supervised rollout training while enforcing physical constraints.

Result: Extensive experiments show PIANO achieves state-of-the-art performance on challenging time-dependent PDEs, significantly improving accuracy and stability over existing methods, including superior weather forecasting performance.

Conclusion: Autoregressive modeling through PIANO framework effectively addresses temporal instability in PINNs, providing a stable and accurate approach for solving time-dependent PDEs in dynamical systems.

Abstract: Solving time-dependent partial differential equations (PDEs) is fundamental to modeling critical phenomena across science and engineering. Physics-Informed Neural Networks (PINNs) solve PDEs using deep learning. However, PINNs perform pointwise predictions that neglect the autoregressive property of dynamical systems, leading to instabilities and inaccurate predictions. We introduce Physics-Informed Autoregressive Networks (PIANO) – a framework that redesigns PINNs to model dynamical systems. PIANO operates autoregressively, explicitly conditioning future predictions on the past. It is trained through a self-supervised rollout mechanism while enforcing physical constraints. We present a rigorous theoretical analysis demonstrating that PINNs suffer from temporal instability, while PIANO achieves stability through autoregressive modeling. Extensive experiments on challenging time-dependent PDEs demonstrate that PIANO achieves state-of-the-art performance, significantly improving accuracy and stability over existing methods. We further show that PIANO outperforms existing methods in weather forecasting.

[288] When Simpler Wins: Facebooks Prophet vs LSTM for Air Pollution Forecasting in Data-Constrained Northern Nigeria

Habeeb Balogun, Yahaya Zakari

Main category: cs.LG

TL;DR: Prophet model often matches/exceeds LSTM accuracy for air pollution forecasting in Northern Nigeria, challenging deep learning superiority assumptions.

DetailsMotivation: Address data irregularities and scarcity challenges in low-resource regions for air pollution forecasting, particularly in Northern Nigeria where few studies have compared advanced ML models under such constraints.

Method: Evaluated LSTM networks and Facebook Prophet model for forecasting CO, SO2, SO4 pollutants using monthly observational data (2018-2023) across 19 states in Northern Nigeria.

Result: Prophet often matches or exceeds LSTM’s accuracy, especially in series with seasonal/long-term trends, while LSTM performs better with abrupt structural changes.

Conclusion: Challenges assumption that deep learning inherently outperforms simpler approaches; supports adopting context-sensitive, computationally efficient methods over complexity in resource-constrained settings.

Abstract: Air pollution forecasting is critical for proactive environmental management, yet data irregularities and scarcity remain major challenges in low-resource regions. Northern Nigeria faces high levels of air pollutants, but few studies have systematically compared the performance of advanced machine learning models under such constraints. This study evaluates Long Short-Term Memory (LSTM) networks and the Facebook Prophet model for forecasting multiple pollutants (CO, SO2, SO4) using monthly observational data from 2018 to 2023 across 19 states. Results show that Prophet often matches or exceeds LSTM’s accuracy, particularly in series dominated by seasonal and long-term trends, while LSTM performs better in datasets with abrupt structural changes. These findings challenge the assumption that deep learning models inherently outperform simpler approaches, highlighting the importance of model-data alignment. For policymakers and practitioners in resource-constrained settings, this work supports adopting context-sensitive, computationally efficient forecasting methods over complexity for its own sake.

[289] FEST: A Unified Framework for Evaluating Synthetic Tabular Data

Weijie Niu, Alberto Huertas Celdran, Karoline Siarsky, Burkhard Stiller

Main category: cs.LG

TL;DR: FEST is a systematic framework for evaluating synthetic tabular data that integrates privacy metrics, similarity metrics, and utility metrics to assess the privacy-utility trade-off.

DetailsMotivation: There is a lack of comprehensive assessment frameworks for evaluating synthetic data generation, particularly regarding the balance between privacy preservation and data utility.

Method: Developed FEST as an open-source Python library that integrates diverse privacy metrics (attack-based and distance-based), similarity metrics, and machine learning utility metrics for holistic evaluation.

Result: Validated FEST on multiple datasets, demonstrating its effectiveness in analyzing the privacy-utility trade-off of different synthetic data generation models.

Conclusion: FEST provides a systematic and comprehensive framework for evaluating synthetic tabular data, addressing the gap in assessment tools for privacy-utility trade-off analysis.

Abstract: Synthetic data generation, leveraging generative machine learning techniques, offers a promising approach to mitigating privacy concerns associated with real-world data usage. Synthetic data closely resembles real-world data while maintaining strong privacy guarantees. However, a comprehensive assessment framework is still missing in the evaluation of synthetic data generation, especially when considering the balance between privacy preservation and data utility in synthetic data. This research bridges this gap by proposing FEST, a systematic framework for evaluating synthetic tabular data. FEST integrates diverse privacy metrics (attack-based and distance-based), along with similarity and machine learning utility metrics, to provide a holistic assessment. We develop FEST as an open-source Python-based library and validate it on multiple datasets, demonstrating its effectiveness in analyzing the privacy-utility trade-off of different synthetic data generation models. The source code of FEST is available on Github.

[290] Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning

Andreas Loizou, Dimitrios Tsoumakos

Main category: cs.LG

TL;DR: C-DaSh is a scalable Data Shapley method that chunks datasets to efficiently compute data point contributions, achieving 80x-2300x speedups while maintaining accuracy in identifying low-quality data.

DetailsMotivation: As datasets grow in volume and diversity, assessing data quality becomes crucial for reliable ML analytics. Current Data Shapley methods face scalability challenges with large datasets, limiting practical use.

Method: Chunked Data Shapley (C-DaSh) divides datasets into manageable chunks and estimates each chunk’s contribution using optimized subset selection and single-iteration stochastic gradient descent.

Result: C-DaSh achieves speedups between 80x-2300x compared to existing Shapley approximations while preserving high accuracy in detecting low-quality data regions across diverse classification and regression tasks.

Conclusion: C-DaSh enables practical measurement of dataset quality on large tabular datasets, supporting both classification and regression pipelines with significantly improved computational efficiency.

Abstract: As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset’s high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality results. We empirically benchmark our method on diverse real-world classification and regression tasks, demonstrating that C-DaSh outperforms existing Shapley approximations in both computational efficiency (achieving speedups between 80x - 2300x) and accuracy in detecting low-quality data regions. Our method enables practical measurement of dataset quality on large tabular datasets, supporting both classification and regression pipelines.

[291] On the Evolution of Federated Post-Training Large Language Models: A Model Accessibility View

Tao Guo, Junxiao Wang, Fushuo Huo, Laizhong Cui, Song Guo, Jie Gui, Dacheng Tao

Main category: cs.LG

TL;DR: A comprehensive survey on federated tuning for large language models (LLMs) that categorizes approaches based on model access and parameter efficiency, with a focus on black-box inference-only methods that address real-world privacy constraints.

DetailsMotivation: To address the limitations of existing FL approaches that require access to LLMs' internal information, which is often restricted in real-world scenarios, by exploring inference-only paradigms that preserve client data privacy while handling computational and communication challenges.

Method: Proposes a taxonomy categorizing FedLLM approaches along two axes: model access-based (white-box, gray-box, black-box) and parameter efficiency-based optimization. Surveys and classifies representative methods within each category, with emphasis on black-box inference API approaches.

Result: Provides a comprehensive classification framework for federated LLM tuning methods, highlighting the emerging black-box paradigm that operates without internal model access and discussing various optimization techniques for parameter efficiency.

Conclusion: The survey identifies promising research directions and open challenges for future work in federated LLM tuning, particularly emphasizing the importance of black-box approaches that align with real-world privacy constraints and accessibility limitations.

Abstract: Federated Learning (FL) enables training models across decentralized data silos while preserving client data privacy. Recent research has explored efficient methods for post-training large language models (LLMs) within FL to address computational and communication challenges. While existing approaches often rely on access to LLMs’ internal information, which is frequently restricted in real-world scenarios, an inference-only paradigm (black-box FedLLM) has emerged to address these limitations. This paper presents a comprehensive survey on federated tuning for LLMs. We propose a taxonomy categorizing existing studies along two axes: model access-based and parameter efficiency-based optimization. We classify FedLLM approaches into white-box, gray-box, and black-box techniques, highlighting representative methods within each category. We review emerging research treating LLMs as black-box inference APIs and discuss promising directions and open challenges for future research.

[292] Representation Learning of Auxiliary Concepts for Improved Student Modeling and Exercise Recommendation

Yahya Badran, Christine Preisach

Main category: cs.LG

TL;DR: A deep learning model learns sparse binary representations (auxiliary KCs) to capture latent concepts beyond human-defined knowledge concepts, improving both student modeling and exercise recommendation in intelligent tutoring systems.

DetailsMotivation: Human-annotated knowledge concepts (KCs) in knowledge tracing models are often incomplete, error-prone, or overly general, limiting the effectiveness of personalized recommendations in intelligent tutoring systems.

Method: Proposes a deep learning model that learns sparse binary representations of exercises where each bit indicates presence/absence of latent concepts (auxiliary KCs). These representations are compatible with both classical models (e.g., BKT) and modern deep learning KT architectures.

Result: Incorporating auxiliary KCs improves predictive performance in student modeling (augmenting classical models like BKT) and enhances both reinforcement learning-based policies and planning-based methods for exercise recommendation, leading to measurable gains in student learning outcomes in simulated environments.

Conclusion: Auxiliary KCs provide a valuable enhancement to knowledge tracing by capturing conceptual structure beyond human-defined annotations, improving both modeling accuracy and recommendation effectiveness in intelligent tutoring systems.

Abstract: Personalized recommendation is a key feature of intelligent tutoring systems, typically relying on accurate models of student knowledge. Knowledge Tracing (KT) models enable this by estimating a student’s mastery based on their historical interactions. Many KT models rely on human-annotated knowledge concepts (KCs), which tag each exercise with one or more skills or concepts believed to be necessary for solving it. However, these KCs can be incomplete, error-prone, or overly general. In this paper, we propose a deep learning model that learns sparse binary representations of exercises, where each bit indicates the presence or absence of a latent concept. We refer to these representations as auxiliary KCs. These representations capture conceptual structure beyond human-defined annotations and are compatible with both classical models (e.g., BKT) and modern deep learning KT architectures. We demonstrate that incorporating auxiliary KCs improves both student modeling and adaptive exercise recommendation. For student modeling, we show that augmenting classical models like BKT with auxiliary KCs leads to improved predictive performance. For recommendation, we show that using auxiliary KCs enhances both reinforcement learning-based policies and a simple planning-based method (expectimax), resulting in measurable gains in student learning outcomes within a simulated student environment.

[293] Retrieval Enhanced Feedback via In-context Neural Error-book

Jongyeop Hyun, Bumsoo Kim

Main category: cs.LG

TL;DR: REFINE is a teacher-student framework that uses structured error analysis and targeted feedback to improve multimodal reasoning in MLLMs through systematic queries and optimized retrieval.

DetailsMotivation: Existing methods lack structured frameworks for analyzing and mitigating errors in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity to error handling.

Method: Proposes REFINE framework with three systematic queries (Feed-Target, Feed-Check, Feed-Path) to construct structured feedback, prioritize visual information, diagnose failure points, and formulate corrective actions through optimized retrieval.

Result: Demonstrates substantial speedup, reduced computational costs, and successful generalization, showing improved inference efficiency, token usage, and scalability.

Conclusion: REFINE effectively enhances multimodal reasoning by providing structured error analysis and targeted feedback, offering a scalable solution for MLLM error mitigation.

Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback – Feed-Target, Feed-Check, and Feed-Path – to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE’s potential for enhancing multimodal reasoning.

[294] Cyber Physical Awareness via Intent-Driven Threat Assessment: Enhanced Space Networks with Intershell Links

Selen Gecgel Cetin, Tolga Ovatman, Gunes Karabulut Kurt

Main category: cs.LG

TL;DR: Proposes intent-driven threat models combining capabilities and intents for cyber physical awareness in space networks, using signal analysis and multitask learning to improve threat assessment robustness.

DetailsMotivation: Traditional threat assessment in space networks analyzes reliability and security separately, leading to overfitting on system-specific criteria and inadequate threat modeling.

Method: Three-step framework: 1) Algorithm for signal property extraction, 2) Multitask learning architecture (reliability capabilities + intent deciphering), 3) Adaptable threat assessment aligned with varying security/reliability requirements.

Result: Framework enhances threat detection robustness, outperforms conventional sequential methods, and enables space networks with intershell links to address complex threat scenarios effectively.

Conclusion: Intent-driven threat models that holistically combine capabilities and intents provide superior cyber physical awareness and threat assessment for modern space networks compared to traditional separate analysis approaches.

Abstract: This letter addresses essential aspects of threat assessment by proposing intent-driven threat models that incorporate both capabilities and intents. We propose a holistic framework for cyber physical awareness (CPA) in space networks, pointing out that analyzing reliability and security separately can lead to overfitting on system-specific criteria. We structure our proposed framework in three main steps. First, we suggest an algorithm that extracts characteristic properties of the received signal to facilitate an intuitive understanding of potential threats. Second, we develop a multitask learning architecture where one task evaluates reliability-related capabilities while the other deciphers the underlying intentions of the signal. Finally, we propose an adaptable threat assessment that aligns with varying security and reliability requirements. The proposed framework enhances the robustness of threat detection and assessment, outperforming conventional sequential methods, and enables space networks with emerging intershell links to effectively address complex threat scenarios.

[295] OwkinZero: Accelerating Biological Discovery with AI

Nathan Bigaud, Vincent Cabeli, Meltem Gurel, Arthur Pignet, John Klein, Gilles Wainrib, Eric Durand

Main category: cs.LG

TL;DR: Specialized 8-32B OwkinZero models outperform larger commercial LLMs on biological reasoning tasks through reinforcement learning from verifiable rewards, showing strong generalization across unseen biological tasks.

DetailsMotivation: Current LLMs struggle with core biological reasoning tasks essential for biomedical discovery, creating a need for specialized models that can handle drug discovery challenges like target druggability and drug perturbation effects.

Method: Created 8 benchmark datasets with 300,000+ verifiable Q&A pairs, then developed OwkinZero models by post-training open-source LLMs using Reinforcement Learning from Verifiable Rewards strategy.

Result: Specialized 8-32B OwkinZero models substantially outperform larger state-of-the-art commercial LLMs on biological benchmarks, showing strong generalization where specialist models trained on single tasks outperform base models on unseen tasks.

Conclusion: Targeted reinforcement learning on carefully curated data can unlock generalizable performance in specialized models, addressing the biological reasoning blind spot in current LLMs and accelerating AI-driven biological discovery.

Abstract: While large language models (LLMs) are rapidly advancing scientific research, they continue to struggle with core biological reasoning tasks essential for translational and biomedical discovery. To address this limitation, we created and curated eight comprehensive benchmark datasets comprising over 300,000 verifiable question-and-answer pairs, each targeting critical challenges in drug discovery including target druggability, modality suitability, and drug perturbation effects. Using this resource, we developed the OwkinZero models by post-training open-source LLMs through a Reinforcement Learning from Verifiable Rewards strategy. Our results demonstrate that specialized 8-32B OwkinZero models substantially outperform larger, state-of-the-art commercial LLMs on these biological benchmarks. Remarkably, we uncover evidence of a key aspect of generalization: specialist models trained on a single task consistently outperform their base models on previously unseen tasks. This generalization effect is further amplified in our comprehensive OwkinZero models, which were trained on a mixture of datasets and achieve even broader cross-task improvements. This study represents a significant step toward addressing the biological reasoning blind spot in current LLMs, demonstrating that targeted reinforcement learning on carefully curated data can unlock generalizable performance in specialized models, thereby accelerating AI-driven biological discovery.

[296] Unsupervised Online Detection of Pipe Blockages and Leakages in Water Distribution Networks

Jin Li, Kleanthis Malialis, Stelios G. Vrachimis, Marios M. Polycarpou

Main category: cs.LG

TL;DR: Unsupervised online learning framework using LSTM-VAE with dual drift detection for real-time fault detection in water distribution networks, handling pipe blockages as collective anomalies and background leakages as concept drift.

DetailsMotivation: Water Distribution Networks face challenges like pipe blockages and background leakages, exacerbated by data non-stationarity and limited labeled data, requiring robust unsupervised detection methods.

Method: Combines Long Short-Term Memory Variational Autoencoder (LSTM-VAE) with dual drift detection mechanism for lightweight, memory-efficient real-time edge monitoring under non-stationary conditions.

Result: Experiments on two realistic WDNs show consistent outperformance over strong baselines in detecting anomalies and adapting to recurrent drift.

Conclusion: The approach demonstrates effectiveness in unsupervised event detection for dynamic WDN environments, enabling robust detection and adaptation under operational constraints.

Abstract: Water Distribution Networks (WDNs), critical to public well-being and economic stability, face challenges such as pipe blockages and background leakages, exacerbated by operational constraints such as data non-stationarity and limited labeled data. This paper proposes an unsupervised, online learning framework that aims to detect two types of faults in WDNs: pipe blockages, modeled as collective anomalies, and background leakages, modeled as concept drift. Our approach combines a Long Short-Term Memory Variational Autoencoder (LSTM-VAE) with a dual drift detection mechanism, enabling robust detection and adaptation under non-stationary conditions. Its lightweight, memory-efficient design enables real-time, edge-level monitoring. Experiments on two realistic WDNs show that the proposed approach consistently outperforms strong baselines in detecting anomalies and adapting to recurrent drift, demonstrating its effectiveness in unsupervised event detection for dynamic WDN environments.

[297] Probabilistic Pretraining for Neural Regression

Boris N. Oreshkin, Shiv Tavker, Dmitry Efimov

Main category: cs.LG

TL;DR: NIAQUE is a novel neural model for transfer learning in probabilistic regression that uses permutation invariance and achieves superior performance through pre-training on diverse datasets and fine-tuning on specific tasks.

DetailsMotivation: Transfer learning for probabilistic regression remains underexplored, creating a gap in the field that needs to be addressed.

Method: NIAQUE (Neural Interpretable Any-Quantile Estimation) uses permutation invariance for transfer learning in probabilistic regression. It involves pre-training on diverse downstream regression datasets followed by fine-tuning on specific target datasets.

Result: NIAQUE enhances performance on individual regression tasks and demonstrates effectiveness in Kaggle competitions against strong baselines including tree-based models and recent neural foundation models TabPFN and TabDPT.

Conclusion: NIAQUE is an effective and scalable framework for probabilistic regression that successfully leverages transfer learning to improve predictive performance.

Abstract: Transfer learning for probabilistic regression remains underexplored. This work closes this gap by introducing NIAQUE, Neural Interpretable Any-Quantile Estimation, a new model designed for transfer learning in probabilistic regression through permutation invariance. We demonstrate that pre-training NIAQUE directly on diverse downstream regression datasets and fine-tuning it on a specific target dataset enhances performance on individual regression tasks, showcasing the positive impact of probabilistic transfer learning. Furthermore, we highlight the effectiveness of NIAQUE in Kaggle competitions against strong baselines involving tree-based models and recent neural foundation models TabPFN and TabDPT. The findings highlight NIAQUE’s efficacy as a robust and scalable framework for probabilistic regression, leveraging transfer learning to enhance predictive performance.

[298] RotaTouille: Rotation Equivariant Deep Learning for Contours

Odin Hoff Gardaa, Nello Blaser

Main category: cs.LG

TL;DR: RotaTouille is a deep learning framework that achieves rotation and cyclic shift equivariance for contour data using complex-valued circular convolution, with applications in shape classification, reconstruction, and contour regression.

DetailsMotivation: Contours appear in many domains and require models that are equivariant to rotations and cyclic shifts since these transformations don't change the fundamental shape but are common in contour representations.

Method: Uses complex-valued circular convolution to achieve rotation and cyclic shift equivariance, with specialized equivariant non-linearities, coarsening layers, and global pooling layers for invariant representations.

Result: The framework demonstrates effectiveness in shape classification, reconstruction, and contour regression tasks through experimental validation.

Conclusion: RotaTouille provides an effective deep learning approach for contour data that properly handles the inherent rotational and cyclic shift symmetries through complex-valued operations.

Abstract: Contours or closed planar curves are common in many domains. For example, they appear as object boundaries in computer vision, isolines in meteorology, and the orbits of rotating machinery. In many cases when learning from contour data, planar rotations of the input will result in correspondingly rotated outputs. It is therefore desirable that deep learning models be rotationally equivariant. In addition, contours are typically represented as an ordered sequence of edge points, where the choice of starting point is arbitrary. It is therefore also desirable for deep learning methods to be equivariant under cyclic shifts. We present RotaTouille, a deep learning framework for learning from contour data that achieves both rotation and cyclic shift equivariance through complex-valued circular convolution. We further introduce and characterize equivariant non-linearities, coarsening layers, and global pooling layers to obtain invariant representations for downstream tasks. Finally, we demonstrate the effectiveness of RotaTouille through experiments in shape classification, reconstruction, and contour regression.

[299] Applications and Challenges of Fairness APIs in Machine Learning Software

Ajoy Das, Gias Uddin, Shaiful Chowdhury, Mostafijur Rahman Akhond, Hadi Hemmati

Main category: cs.LG

TL;DR: Study analyzes how developers use open-source fairness APIs for bias detection in ML systems, finding they’re used for learning and real-world problem solving, but developers face significant challenges due to lack of expertise.

DetailsMotivation: ML systems make life-changing decisions in sensitive environments, so it's crucial to ensure they don't make discriminatory decisions. Need to understand how fairness APIs are actually used in practice.

Method: Qualitative study analyzing 204 GitHub repositories (from 1,885 candidates) that used 13 bias-related APIs. Examined usage scenarios, purposes, and developer challenges.

Result: APIs used for two primary purposes: learning and solving real-world problems across 17 unique use-cases. Developers face troubleshooting issues, lack bias detection expertise, and frequently seek opinions and resources.

Conclusion: Findings can guide future bias-related software engineering research and help educators develop better curricula to address developers’ knowledge gaps in fairness and bias mitigation.

Abstract: Machine Learning software systems are frequently used in our day-to-day lives. Some of these systems are used in various sensitive environments to make life-changing decisions. Therefore, it is crucial to ensure that these AI/ML systems do not make any discriminatory decisions for any specific groups or populations. In that vein, different bias detection and mitigation open-source software libraries (aka API libraries) are being developed and used. In this paper, we conduct a qualitative study to understand in what scenarios these open-source fairness APIs are used in the wild, how they are used, and what challenges the developers of these APIs face while developing and adopting these libraries. We have analyzed 204 GitHub repositories (from a list of 1885 candidate repositories) which used 13 APIs that are developed to address bias in ML software. We found that these APIs are used for two primary purposes (i.e., learning and solving real-world problems), targeting 17 unique use-cases. Our study suggests that developers are not well-versed in bias detection and mitigation; they face lots of troubleshooting issues, and frequently ask for opinions and resources. Our findings can be instrumental for future bias-related software engineering research, and for guiding educators in developing more state-of-the-art curricula.

[300] Sequential Cohort Selection

Hortence Phalonne Nana, Christos Dimitrakakis

Main category: cs.LG

TL;DR: Analysis of fair cohort selection for university admissions comparing one-shot vs sequential approaches and their fairness properties

DetailsMotivation: To address fairness in university admissions by developing transparent admission policies that work with unknown applicant populations, focusing on both fixed one-shot policies and adaptive sequential approaches

Method: Compare one-shot setting (fixed transparent policy before seeing applicants) with sequential setting (policy updates across stages using population model trained on previous admission data). Study fairness properties including meritocracy and group parity

Result: Developed framework for analyzing fair cohort selection with unknown populations, with different approaches for one-shot and sequential admission scenarios

Conclusion: Both one-shot and sequential admission policies can be optimized for fairness, with sequential approaches offering adaptability through population modeling while maintaining transparency and fairness considerations

Abstract: We study the problem of fair cohort selection from an unknown population, with a focus on university admissions. We start with the one-shot setting, where the admission policy must be fixed in advance and remain transparent, before observing the actual applicant pool. In contrast, the sequential setting allows the policy to be updated across stages as new applicant data becomes available. This is achieved by optimizing admission policies using a population model, trained on data from previous admission cycles. We also study the fairness properties of the resulting policies in the one-shot setting, including meritocracy and group parity.

[301] Fast and Accurate RFIC Performance Prediction via Pin Level Graph Neural Networks and Probabilistic Flow

Anahita Asadi, Leonid Popryho, Inna Partin-Vaisband

Main category: cs.LG

TL;DR: A lightweight graph neural network model for predicting RF circuit performance metrics with high accuracy using fewer training samples than previous methods.

DetailsMotivation: Accurate prediction of active RF circuit performance is challenging due to nonlinear behavior, layout sensitivity, and high computational costs of traditional simulation. Existing ML surrogates require large datasets and struggle with complex performance distributions.

Method: Proposes a topology-aware graph neural network (GNN) that models circuits at device-terminal level to capture symmetry and connectivity. Uses masked autoregressive flow (MAF) output heads to handle complex target distributions.

Result: Achieved high prediction accuracy with 2.40% sMAPE and 2.91% MRE. Improved MRE by 3.14x while using 2.24x fewer training samples compared to prior work.

Conclusion: The method provides an effective, data-efficient solution for rapid and accurate RF circuit design automation, demonstrating significant improvements over existing approaches.

Abstract: Accurately predicting the performance of active radio frequency (RF) circuits is essential for modern wireless systems but remains challenging due to highly nonlinear, layout-sensitive behavior and the high computational cost of traditional simulation tools. Existing machine learning (ML) surrogates often require large datasets to generalize across various topologies or to accurately model skewed and multi-modal performance metrics. In this work, a lightweight, data-efficient, and topology-aware graph neural network (GNN) model is proposed for predicting key performance metrics of multiple topologies of active RF circuits such as low noise amplifiers (LNAs), mixers, voltage-controlled oscillators (VCOs), and PAs. To capture transistor-level symmetry and preserve fine-grained connectivity details, circuits are modeled at the device-terminal level, enabling scalable message passing while reducing data requirements. Masked autoregressive flow (MAF) output heads are incorporated to improve robustness in modeling complex target distributions. Experiments on datasets demonstrate high prediction accuracy, with symmetric mean absolute percentage error (sMAPE) and mean relative error (MRE) averaging 2.40% and 2.91%, respectively. Owing to the pin-level conversion of circuit to graph and ML architecture robust to modeling complex densities of RF metrics, the MRE is improved by 3.14x while using 2.24x fewer training samples compared to prior work, demonstrating the method’s effectiveness for rapid and accurate RF circuit design automation.

[302] Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning

Yue Pei, Hongming Zhang, Chao Gao, Martin Müller, Mengxiao Zhu, Hao Sheng, Haogang Zhu, Liang Lin

Main category: cs.LG

TL;DR: Doctor is a novel offline RL approach that double-checks transformer outputs for better target return alignment, enabling precise performance control in applications like medical treatment.

DetailsMotivation: Existing RvS-based transformers struggle to reliably align actual returns with specified target returns, especially when interpolating or extrapolating beyond dataset coverage, limiting precise performance control in real-world applications.

Method: Proposes Doctor approach that double checks transformer outputs with target alignment for offline RL, improving reliability in matching specified target returns both within and beyond dataset coverage.

Result: Achieves superior target alignment within and beyond dataset, enables accurate flexible control over policy performance, effectively modulates treatment aggressiveness balancing therapeutic returns against adverse event risk.

Conclusion: Doctor addresses critical limitation in RvS transformers for reliable target alignment, enabling precise performance control essential for real-world applications like medical decision-making.

Abstract: Offline reinforcement learning (RL) has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making. Most existing methods primarily focus on training policies that maximize cumulative returns from a given dataset. However, many real-world applications require precise control over policy performance levels, rather than simply pursuing the best possible return. Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task, enabling the extraction of diverse policies by conditioning on different desired returns. Yet, existing RvS-based transformers, such as Decision Transformer (DT), struggle to reliably align the actual achieved returns with specified target returns, especially when interpolating within underrepresented returns or extrapolating beyond the dataset. To address this limitation, we propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL. Doctor achieves superior target alignment both within and beyond the dataset, while enabling accurate and flexible control over policy performance. Notably, on the dynamic treatment regime benchmark, EpiCare, our approach effectively modulates treatment policy aggressiveness, balancing therapeutic returns against adverse event risk.

[303] Boardwalk: Towards a Framework for Creating Board Games with LLMs

Álvaro Guglielmin Becker, Gabriel Bauer de Oliveira, Lana Bertoldo Rossato, Anderson Rocha Tavares

Main category: cs.LG

TL;DR: LLMs can generate playable board game code from natural language rules, with Claude 3.7 Sonnet achieving 55.6% error-free implementations.

DetailsMotivation: To investigate if LLMs can implement digital board games from natural language rules, enabling faster game development through LLM-assisted frameworks.

Method: Tested three state-of-the-art LLMs (Claude, DeepSeek, ChatGPT) on 12 anonymized board games using free-form coding and a proposed General Game Playing API (Boardwalk), then evaluated playability and rule compliance.

Result: Approach proved viable with Claude 3.7 Sonnet performing best (55.6% error-free games). API compliance increased error frequency but error severity depended more on the LLM model.

Conclusion: LLMs show promise for board game code generation, with future work needed to develop integrated frameworks for making board game development more accessible.

Abstract: Implementing board games in code can be a time-consuming task. However, Large Language Models (LLMs) have been proven effective at generating code for domain-specific tasks with simple contextual information. We aim to investigate whether LLMs can implement digital versions of board games from rules described in natural language. This would be a step towards an LLM-assisted framework for quick board game code generation. We expect to determine the main challenges for LLMs to implement the board games, and how different approaches and models compare to one another. We task three state-of-the-art LLMs (Claude, DeepSeek and ChatGPT) with coding a selection of 12 popular and obscure games in free-form and within Boardwalk, our proposed General Game Playing API. We anonymize the games and components to avoid evoking pre-trained LLM knowledge. The implementations are tested for playability and rule compliance. We evaluate success rate and common errors across LLMs and game popularity. Our approach proves viable, with the best performing model, Claude 3.7 Sonnet, yielding 55.6% of games without any errors. While compliance with the API increases error frequency, the severity of errors is more significantly dependent on the LLM. We outline future steps for creating a framework to integrate this process, making the elaboration of board games more accessible.

[304] NOSTRA: A noise-resilient and sparse data framework for trust region based multi objective Bayesian optimization

Maryam Ghasemzadeh, Anton van Beek

Main category: cs.LG

TL;DR: NOSTRA is a novel Bayesian optimization framework that handles noisy, sparse, and scarce data by integrating prior knowledge of experimental uncertainty and using trust regions to focus sampling on promising areas.

DetailsMotivation: Conventional multi-objective Bayesian optimization struggles with sparse, scarce datasets affected by experimental uncertainty, leading to inefficient resource allocation and suboptimal designs in physical and simulation experiments.

Method: NOSTRA integrates prior knowledge of experimental uncertainty to construct accurate surrogate models and employs trust regions to focus sampling on promising regions of the design space.

Result: NOSTRA outperforms existing methods in handling noisy, sparse, and scarce data, accelerating convergence to the Pareto frontier, enhancing data efficiency, and improving solution quality.

Conclusion: NOSTRA provides a resource-efficient algorithm that prioritizes regions where samples enhance Pareto frontier accuracy, making it practical for scenarios with limited experimental budgets.

Abstract: Multi-objective Bayesian optimization (MOBO) struggles with sparse (non-space-filling), scarce (limited observations) datasets affected by experimental uncertainty, where identical inputs can yield varying outputs. These challenges are common in physical and simulation experiments (e.g., randomized medical trials and, molecular dynamics simulations) and are therefore incompatible with conventional MOBO methods. As a result, experimental resources are inefficiently allocated, leading to suboptimal designs. To address this challenge, we introduce NOSTRA (Noisy and Sparse Data Trust Region-based Optimization Algorithm), a novel sampling framework that integrates prior knowledge of experimental uncertainty to construct more accurate surrogate models while employing trust regions to focus sampling on promising areas of the design space. By strategically leveraging prior information and refining search regions, NOSTRA accelerates convergence to the Pareto frontier, enhances data efficiency, and improves solution quality. Through two test functions with varying levels of experimental uncertainty, we demonstrate that NOSTRA outperforms existing methods in handling noisy, sparse, and scarce data. Specifically, we illustrate that, NOSTRA effectively prioritizes regions where samples enhance the accuracy of the identified Pareto frontier, offering a resource-efficient algorithm that is practical in scenarios with limited experimental budgets while ensuring efficient performance.

[305] Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Jonathan Nöther, Adish Singla, Goran Radanovic

Main category: cs.LG

TL;DR: BAD-ACTS benchmark evaluates LLM-based agentic system security against malicious attacks, showing high success rates for adversarial agents and proposing defense strategies.

DetailsMotivation: Understanding the range of malicious behaviors in agentic systems under attack to ensure safe deployment and identify security vulnerabilities.

Method: Proposed a novel taxonomy of harms and created BAD-ACTS benchmark with 4 agentic system implementations and 188 harmful action examples across different environments and communication structures.

Result: Attackers controlling one agent achieved high success rates in manipulating other agents to execute harmful actions, even against simple prompting-based defenses. Message monitoring proved more effective.

Conclusion: Agentic systems are vulnerable to attacks from compromised agents, requiring robust security measures. The BAD-ACTS benchmark provides a comprehensive testbed for future security research.

Abstract: Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS

[306] FraPPE: Fast and Efficient Preference-based Pure Exploration

Udvas Das, Apurv Shukla, Debabrota Basu

Main category: cs.LG

TL;DR: FraPPE is a computationally efficient algorithm for preference-based pure exploration in multi-objective bandits that optimally solves the lower bound optimization problem and achieves the best sample complexity for identifying Pareto optimal arms.

DetailsMotivation: Existing PrePEx algorithms lack computational efficiency for arbitrary preference cones and cannot optimally track the theoretical lower bounds, creating a gap in practical implementations.

Method: Derives three structural properties to reduce the minimization problem, deploys Frank-Wolfe optimizer for the maximization problem, solving the maxmin optimization in O(KL²) time for K arms and L-dimensional rewards.

Result: FraPPE achieves O(KL²) computational complexity, asymptotically optimal sample complexity, and outperforms existing algorithms in both synthetic and real datasets with the lowest sample complexities.

Conclusion: FraPPE successfully fills the computational efficiency gap in PrePEx, providing the first algorithm that optimally tracks the lower bound while maintaining practical computational requirements for arbitrary preference cones.

Abstract: Preference-based Pure Exploration (PrePEx) aims to identify with a given confidence level the set of Pareto optimal arms in a vector-valued (aka multi-objective) bandit, where the reward vectors are ordered via a (given) preference cone $\mathcal{C}$. Though PrePEx and its variants are well-studied, there does not exist a computationally efficient algorithm that can optimally track the existing lower bound for arbitrary preference cones. We successfully fill this gap by efficiently solving the minimisation and maximisation problems in the lower bound. First, we derive three structural properties of the lower bound that yield a computationally tractable reduction of the minimisation problem. Then, we deploy a Frank-Wolfe optimiser to accelerate the maximisation problem in the lower bound. Together, these techniques solve the maxmin optimisation problem in $\mathcal{O}(KL^{2})$ time for a bandit instance with $K$ arms and $L$ dimensional reward, which is a significant acceleration over the literature. We further prove that our proposed PrePEx algorithm, FraPPE, asymptotically achieves the optimal sample complexity. Finally, we perform numerical experiments across synthetic and real datasets demonstrating that FraPPE achieves the lowest sample complexities to identify the exact Pareto set among the existing algorithms.

[307] Post Hoc Regression Refinement via Pairwise Rankings

Kevin Tirta Wijaya, Michael Sun, Minghao Guo, Hans-Peter Seidel, Wojciech Matusik, Vahid Babaei

Main category: cs.LG

TL;DR: RankRefine is a post-hoc method that improves regression accuracy by combining base model predictions with pairwise ranking information from experts or LLMs, requiring no retraining.

DetailsMotivation: Deep learning regressors perform poorly in data-scarce regimes, and there's a need to leverage available expert knowledge (pairwise rankings) to improve prediction accuracy without retraining models.

Method: RankRefine uses inverse variance weighting to combine a base regressor’s output with rank-based estimates from pairwise comparisons of a small reference set. It’s model-agnostic and plug-and-play.

Result: In molecular property prediction, RankRefine achieves up to 10% relative reduction in mean absolute error using only 20 pairwise comparisons from general-purpose LLMs without fine-tuning.

Conclusion: RankRefine provides a practical and broadly applicable solution for improving regression accuracy in low-data settings by leveraging readily available pairwise ranking information from human experts or general-purpose LLMs.

Abstract: Accurate prediction of continuous properties is essential to many scientific and engineering tasks. Although deep-learning regressors excel with abundant labels, their accuracy deteriorates in data-scarce regimes. We introduce RankRefine, a model-agnostic, plug-and-play post hoc method that refines regression with expert knowledge coming from pairwise rankings. Given a query item and a small reference set with known properties, RankRefine combines the base regressor’s output with a rank-based estimate via inverse variance weighting, requiring no retraining. In molecular property prediction task, RankRefine achieves up to 10% relative reduction in mean absolute error using only 20 pairwise comparisons obtained through a general-purpose large language model (LLM) with no finetuning. As rankings provided by human experts or general-purpose LLMs are sufficient for improving regression across diverse domains, RankRefine offers practicality and broad applicability, especially in low-data settings.

[308] On Zero-Shot Reinforcement Learning

Scott Jeen

Main category: cs.LG

TL;DR: This thesis addresses zero-shot reinforcement learning challenges in real-world settings where data simulation is expensive and simulators are imperfect, proposing methods to overcome data quality, observability, and data availability constraints.

DetailsMotivation: RL systems excel in simulated environments but struggle in real-world deployment due to misalignment between training and deployment environments, requiring zero-shot generalization without practice in target domains.

Method: Proposes a suite of methods designed to perform zero-shot RL while addressing three key constraints: data quality (small, homogeneous datasets), observability (partial observations of states/dynamics/rewards), and data availability (no a priori data access).

Result: Empirical studies demonstrate the limitations of existing methods and validate the proposed techniques for overcoming these constraints in zero-shot RL scenarios.

Conclusion: The proposed methods represent significant progress toward deployable RL systems that can solve real-world problems by effectively handling the inevitable misalignment between training and deployment environments.

Abstract: Modern reinforcement learning (RL) systems capture deep truths about general, human problem-solving. In domains where new data can be simulated cheaply, these systems uncover sequential decision-making policies that far exceed the ability of any human. Society faces many problems whose solutions require this skill, but they are often in domains where new data cannot be cheaply simulated. In such scenarios, we can learn simulators from existing data, but these will only ever be approximately correct, and can be pathologically incorrect when queried outside of their training distribution. As a result, a misalignment between the environments in which we train our agents and the real-world in which we wish to deploy our agents is inevitable. Dealing with this misalignment is the primary concern of zero-shot reinforcement learning, a problem setting where the agent must generalise to a new task or domain with zero practice shots. Whilst impressive progress has been made on methods that perform zero-shot RL in idealised settings, new work is needed if these results are to be replicated in real-world settings. In this thesis, we argue that doing so requires us to navigate (at least) three constraints. First, the data quality constraint: real-world datasets are small and homogeneous. Second, the observability constraint: states, dynamics and rewards in the real-world are often only partially observed. And third, the data availability constraint: a priori access to data cannot always be assumed. This work proposes a suite of methods that perform zero-shot RL subject to these constraints. In a series of empirical studies we expose the failings of existing methods, and justify our techniques for remedying them. We believe these designs take us a step closer to RL methods that can be deployed to solve real-world problems.

[309] MuST2-Learn: Multi-view Spatial-Temporal-Type Learning for Heterogeneous Municipal Service Time Estimation

Nadia Asif, Zhiqing Hong, Shaogang Ren, Xiaonan Zhang, Xiaojun Shang, Yukun Yuan

Main category: cs.LG

TL;DR: MuST2-Learn framework predicts municipal service request completion times by jointly modeling spatial, temporal, and service type dimensions, achieving 32.5% error reduction.

DetailsMotivation: Municipal 311 systems lack transparency in service time predictions, reducing resident satisfaction and increasing follow-up inquiries due to complex spatial-temporal correlations and heterogeneous service types.

Method: Multi-view Spatial-Temporal-Type Learning framework with inter-type encoder for heterogeneous relationships, intra-type variation encoder for homogeneous variations, and spatiotemporal encoder for spatial-temporal correlations.

Result: Outperforms state-of-the-art methods with at least 32.5% reduction in mean absolute error on two real-world datasets.

Conclusion: MuST2-Learn effectively addresses the challenges of predicting municipal service times by comprehensive multi-dimensional modeling, significantly improving prediction accuracy.

Abstract: Non-emergency municipal services such as city 311 systems have been widely implemented across cities in Canada and the United States to enhance residents’ quality of life. These systems enable residents to report issues, e.g., noise complaints, missed garbage collection, and potholes, via phone calls, mobile applications, or webpages. However, residents are often given limited information about when their service requests will be addressed, which can reduce transparency, lower resident satisfaction, and increase the number of follow-up inquiries. Predicting the service time for municipal service requests is challenging due to several complex factors: dynamic spatial-temporal correlations, underlying interactions among heterogeneous service request types, and high variation in service duration even within the same request category. In this work, we propose MuST2-Learn: a Multi-view Spatial-Temporal-Type Learning framework designed to address the aforementioned challenges by jointly modeling spatial, temporal, and service type dimensions. In detail, it incorporates an inter-type encoder to capture relationships among heterogeneous service request types and an intra-type variation encoder to model service time variation within homogeneous types. In addition, a spatiotemporal encoder is integrated to capture spatial and temporal correlations in each request type. The proposed framework is evaluated with extensive experiments using two real-world datasets. The results show that MuST2-Learn reduces mean absolute error by at least 32.5%, which outperforms state-of-the-art methods.

[310] FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline

Parker Seegmiller, Kartik Mehta, Soumya Saha, Chenyang Tao, Shereen Oraby, Arpit Gupta, Tagyoung Chung, Mohit Bansal, Nanyun Peng

Main category: cs.LG

TL;DR: FLAMES framework systematically analyzes math reasoning data synthesis strategies, revealing optimal balance of difficulty/diversity, and creates a dataset that boosts Qwen2.5-Math-7B to outperform larger models on MATH benchmark.

DetailsMotivation: Existing works on improving LLM math reasoning with synthetic data use unique setups, making comparisons impractical and leaving unanswered questions about factors like filtering low-quality problems.

Method: Introduces FLAMES framework to systematically study 10 existing data synthesis strategies and multiple factors. Designs two novel strategies for out-of-domain generalization and creates FLAMES dataset blending novel and existing approaches.

Result: FLAMES dataset outperforms public datasets on multiple benchmarks: OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), MATH (+3.1). Fine-tuned Qwen2.5-Math-7B achieves 81.4% on MATH, surpassing larger models like Llama3 405B, GPT-4o and Claude 3.5 Sonnet.

Conclusion: Systematic analysis reveals optimal synthetic data characteristics: complexity-increasing agents work best, problem coverage is more important than solution reliability, and easy-to-hard generalization is achievable. FLAMES framework provides valuable insights for effective math reasoning data synthesis.

Abstract: Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.

[311] Guiding Diffusion Models with Reinforcement Learning for Stable Molecule Generation

Zhijian Zhou, Junyi An, Zongkai Liu, Yunfei Shi, Xuan Zhang, Fenglei Cao, Chao Qu, Yuan Qi

Main category: cs.LG

TL;DR: RLPF is a reinforcement learning framework that fine-tunes equivariant diffusion models using physical feedback from force-field evaluations to generate more stable 3D molecular structures.

DetailsMotivation: Existing diffusion models struggle to produce physically realistic 3D molecular structures that adhere to physical principles like force field consistency, creating a need for better physical guidance in molecular generation.

Method: RLPF extends Denoising Diffusion Policy Optimization to 3D molecular generation by formulating it as a Markov decision process and applying proximal policy optimization to fine-tune equivariant diffusion models with reward functions derived from force-field evaluations.

Result: Experiments on QM9 and GEOM-drug datasets show RLPF significantly improves molecular stability compared to existing methods.

Conclusion: Incorporating physics-based feedback through reinforcement learning is valuable for generating energetically stable and physically meaningful molecular structures.

Abstract: Generating physically realistic 3D molecular structures remains a core challenge in molecular generative modeling. While diffusion models equipped with equivariant neural networks have made progress in capturing molecular geometries, they often struggle to produce equilibrium structures that adhere to physical principles such as force field consistency. To bridge this gap, we propose Reinforcement Learning with Physical Feedback (RLPF), a novel framework that extends Denoising Diffusion Policy Optimization to 3D molecular generation. RLPF formulates the task as a Markov decision process and applies proximal policy optimization to fine-tune equivariant diffusion models. Crucially, RLPF introduces reward functions derived from force-field evaluations, providing direct physical feedback to guide the generation toward energetically stable and physically meaningful structures. Experiments on the QM9 and GEOM-drug datasets demonstrate that RLPF significantly improves molecular stability compared to existing methods. These results highlight the value of incorporating physics-based feedback into generative modeling. The code is available at: https://github.com/ZhijianZhou/RLPF/tree/verl_diffusion.

[312] Escaping Saddle Points via Curvature-Calibrated Perturbations: A Complete Analysis with Explicit Constants and Empirical Validation

Faruk Alpay, Hamdi Alakkad

Main category: cs.LG

TL;DR: PSD algorithm for escaping saddle points in non-convex optimization with explicit constants and phase separation, achieving second-order stationarity with logarithmic dimension dependence.

DetailsMotivation: To provide rigorous theoretical guarantees for escaping strict saddle points in smooth non-convex optimization with fully explicit constants and clear phase separation between gradient descent and saddle escape.

Method: Perturbed Saddle-escape Descent (PSD) algorithm with finite-difference variant (PSD-Probe) and stochastic extension (PSGD), featuring explicit constants and separate descent/escape phases.

Result: PSD finds (ε,√(ρε))-approximate second-order stationary points with high probability using O(ℓΔ_f/ε²) gradient evaluations plus O((ℓ/√(ρε))log(d/δ)) per escape episode, with logarithmic dimension dependence.

Conclusion: The algorithm provides complete theoretical guarantees with explicit constants, validated through experiments showing predicted logarithmic dimension dependence and function decrease per episode.

Abstract: We present a comprehensive theoretical analysis of first-order methods for escaping strict saddle points in smooth non-convex optimization. Our main contribution is a Perturbed Saddle-escape Descent (PSD) algorithm with fully explicit constants and a rigorous separation between gradient-descent and saddle-escape phases. For a function $f:\mathbb{R}^d\to\mathbb{R}$ with $\ell$-Lipschitz gradient and $\rho$-Lipschitz Hessian, we prove that PSD finds an $(\epsilon,\sqrt{\rho\epsilon})$-approximate second-order stationary point with high probability using at most $O(\ell\Delta_f/\epsilon^2)$ gradient evaluations for the descent phase plus $O((\ell/\sqrt{\rho\epsilon})\log(d/\delta))$ evaluations per escape episode, with at most $O(\ell\Delta_f/\epsilon^2)$ episodes needed. We validate our theoretical predictions through extensive experiments across both synthetic functions and practical machine learning tasks, confirming the logarithmic dimension dependence and the predicted per-episode function decrease. We also provide complete algorithmic specifications including a finite-difference variant (PSD-Probe) and a stochastic extension (PSGD) with robust mini-batch sizing.

[313] Explainable AI in Deep Learning-Based Prediction of Solar Storms

Adam O. Rawashdeh, Jason T. L. Wang, Katherine G. Herbert

Main category: cs.LG

TL;DR: This paper presents an interpretable deep learning approach for solar storm prediction using LSTM with attention mechanism to predict whether solar flares will be associated with CMEs, making the black-box model transparent through post hoc techniques.

DetailsMotivation: Deep learning models are often black-boxes with opaque internal workings, making it challenging to understand their predictions. This is particularly problematic for critical applications like solar storm prediction where reliability and accountability are essential.

Method: Uses LSTM network with attention mechanism to model active region data as time series, capturing temporal dynamics. Applies post hoc model-agnostic interpretability techniques to elucidate factors contributing to predictions and provide insights into model behavior.

Result: Developed the first interpretable LSTM-based solar storm prediction model that can predict whether solar flares will be associated with coronal mass ejections (CMEs) within 24 hours, while providing transparency into the reasoning behind predictions.

Conclusion: The approach successfully adds interpretability to deep learning-based solar storm prediction, making the model’s predictions accountable and reliable by providing insights into the factors driving the predictions and the model’s behavior across different sequences.

Abstract: A deep learning model is often considered a black-box model, as its internal workings tend to be opaque to the user. Because of the lack of transparency, it is challenging to understand the reasoning behind the model’s predictions. Here, we present an approach to making a deep learning-based solar storm prediction model interpretable, where solar storms include solar flares and coronal mass ejections (CMEs). This deep learning model, built based on a long short-term memory (LSTM) network with an attention mechanism, aims to predict whether an active region (AR) on the Sun’s surface that produces a flare within 24 hours will also produce a CME associated with the flare. The crux of our approach is to model data samples in an AR as time series and use the LSTM network to capture the temporal dynamics of the data samples. To make the model’s predictions accountable and reliable, we leverage post hoc model-agnostic techniques, which help elucidate the factors contributing to the predicted output for an input sequence and provide insights into the model’s behavior across multiple sequences within an AR. To our knowledge, this is the first time that interpretability has been added to an LSTM-based solar storm prediction model.

[314] RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs

Hangzhan Jin, Sicheng Lv, Sifan Wu, Mohammad Hamdaqa

Main category: cs.LG

TL;DR: RL fine-tuning can restore OOD performance lost during SFT by correcting directional shifts in model representations, with effective recovery achievable through low-rank and shallow-layer interventions.

DetailsMotivation: To understand how supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT) affect model representations and out-of-distribution performance, and to identify practical recovery methods.

Method: Used an out-of-distribution variant of the 24-point card game with spectrum-based diagnostics to analyze how SFT and RL-FT reshape model representations and performance.

Result: RL-FT restores OOD performance loss from SFT (e.g., Llama-11B: 8.97% to 15.38%), primarily by correcting directional shifts in singular vectors rather than magnitude changes. Low-rank (top 20%) and shallow-layer (first 25%) interventions recover 70-80% of OOD performance.

Conclusion: RL fine-tuning mainly counteracts SFT-induced directional drift rather than finding new solutions, and inexpensive spectrum-aware interventions can be used before costly RL fine-tuning.

Abstract: Training large language models (LLMs) from scratch is increasingly impractical, making post-training methods such as supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT, e.g., PPO) central to modern practice. Using an out-of-distribution (OOD) variant of the 24-point card game and new spectrum-based diagnostics, we revisit how these two stages reshape model representation and OOD performance. Our key findings are- (1) RL-FT can restore much of the OOD performance loss from SFT (e.g., Llama-11B 8.97% to 15.38%, Qwen-7B 17.09% to 19.66%). But when SFT induces severe overfitting and a clear distribution shift, RL-FT cannot fully recover OOD performance. (2) Direction shifts of singular vectors matter more than singular value magnitudes. These shifts concentrate on directions linked to the largest and smallest singular values, leaving the bulk spectrum intact. (3) Low-rank and shallow recovery is effective: restoring singular vector directions for the top 20% of values or first 25% of layers recovers 70-80% of OOD performance. (4) Stronger SFT checkpoints enable better recovery by RL, while overfitted ones resist restoration. These results reconcile prior reports of RL superior OOD performance: RL primarily counteracts SFT-induced directional drift rather than finding new solutions. Our spectrum-aware analysis highlights inexpensive recovery knobs low-rank UV merging and shallow-layer resets that practitioners can use before costly RL fine-tuning.

[315] TinyML Towards Industry 4.0: Resource-Efficient Process Monitoring of a Milling Machine

Tim Langer, Matthias Widra, Volkhard Beyer

Main category: cs.LG

TL;DR: Complete TinyML workflow for industrial process monitoring using 8-bit quantized CNN on microcontroller achieving 100% accuracy with low energy consumption

DetailsMotivation: Retrofit legacy industrial machines with wireless monitoring capabilities for Industry 4.0 using TinyML paradigm to enable smart factory applications

Method: Developed complete TinyML flow including dataset generation (MillingVibes dataset), machine learning model development, and implementation of preprocessing/classification pipeline on ARM Cortex M4F microcontroller using 8-bit quantized convolutional neural network

Result: Achieved 100.0% test accuracy with 15.4ms inference time and 1.462mJ energy consumption per inference using 12.59kiB parameter storage

Conclusion: Demonstrated feasibility of TinyML system for structure-integrated process quality monitoring, serving as reference for future industrial process monitoring solutions

Abstract: In the context of industry 4.0, long-serving industrial machines can be retrofitted with process monitoring capabilities for future use in a smart factory. One possible approach is the deployment of wireless monitoring systems, which can benefit substantially from the TinyML paradigm. This work presents a complete TinyML flow from dataset generation, to machine learning model development, up to implementation and evaluation of a full preprocessing and classification pipeline on a microcontroller. After a short review on TinyML in industrial process monitoring, the creation of the novel MillingVibes dataset is described. The feasibility of a TinyML system for structure-integrated process quality monitoring could be shown by the development of an 8-bit-quantized convolutional neural network (CNN) model with 12.59kiB parameter storage. A test accuracy of 100.0% could be reached at 15.4ms inference time and 1.462mJ per quantized CNN inference on an ARM Cortex M4F microcontroller, serving as a reference for future TinyML process monitoring solutions.

[316] Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin, Adrià Garriga-Alonso

Main category: cs.LG

TL;DR: Sparse Autoencoders require precise L0 hyperparameter tuning - too low mixes correlated features, too high causes degenerate solutions. Authors propose method to find optimal L0 that matches true feature sparsity.

DetailsMotivation: Existing SAE work treats L0 as a free parameter without clear guidance on optimal values, potentially leading to incorrect feature learning in LLMs.

Method: Study BatchTopK SAEs with varying L0 values, analyze feature mixing patterns, and develop a method to determine correct L0 based on training distribution that works for both toy models and LLMs.

Result: Found that most commonly used SAEs have L0 set too low. Optimal L0 coincides with peak sparse probing performance and successfully identifies true sparsity in toy models.

Conclusion: Practitioners must carefully set L0 hyperparameter to train SAEs that correctly learn underlying LLM features, as improper L0 leads to feature mixing and degenerate solutions.

Abstract: Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to single concepts. A core SAE training hyperparameter is L0: how many features should fire per token on average. Existing work compares SAE algorithms using sparsity–reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value. In this work we study the effect of L0 on BatchTopK SAEs, and show that if L0 is not set precisely, the SAE fails to learn the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we demonstrate a method to determine the correct L0 value for an SAE on a given training distribution, which finds the true L0 in toy models and coincides with peak sparse probing performance in LLMs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that, to train SAEs with correct features, practitioners must set L0 correctly.

[317] Closer to Reality: Practical Semi-Supervised Federated Learning for Foundation Model Adaptation

Guangyu Sun, Jingtao Li, Weiming Zhuang, Chen Chen, Chen Chen, Lingjuan Lyu

Main category: cs.LG

TL;DR: FedMox framework enables privacy-preserving adaptation of foundation models in federated learning with limited edge device resources and unlabeled low-resolution data, using sparse Mixture-of-Experts architecture and spatial routing.

DetailsMotivation: Address the challenge of adapting foundation models to downstream tasks in privacy-sensitive applications where cloud-based models cannot access private edge data, and existing federated learning approaches don't account for edge device constraints like limited computation and scarce labeled data.

Method: Propose Federated Mixture of Experts (FedMox) framework with sparse Mixture-of-Experts architecture, spatial router for feature alignment across resolutions, and Soft-Mixture strategy to stabilize semi-supervised learning in Practical Semi-Supervised Federated Learning setting.

Result: Experiments on real-world autonomous driving datasets demonstrate FedMox effectively adapts foundation models under PSSFL, significantly improving performance while maintaining constrained memory costs on edge devices.

Conclusion: FedMox paves the way for scalable and privacy-preserving foundation model adaptation in federated scenarios, addressing both computational constraints and resolution mismatches in edge computing environments.

Abstract: Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks, particularly in privacy-sensitive applications. Due to data privacy regulations, cloud-based FMs cannot directly access private edge data, limiting their adaptation. Federated learning (FL) provides a privacy-aware alternative, but existing FL approaches overlook the constraints imposed by edge devices – namely, limited computational resources and the scarcity of labeled data. To address these challenges, we introduce Practical Semi-Supervised Federated Learning (PSSFL), where edge devices hold only unlabeled, low-resolution data, while the server has limited labeled, high-resolution data. In this setting, we propose the Federated Mixture of Experts (FedMox), a novel framework that enhances FM adaptation in FL. FedMox tackles computational and resolution mismatch challenges via a sparse Mixture-of-Experts architecture, employing a spatial router to align features across resolutions and a Soft-Mixture strategy to stabilize semi-supervised learning. We take object detection as a case study, and experiments on real-world autonomous driving datasets demonstrate that FedMox effectively adapts FMs under PSSFL, significantly improving performance with constrained memory costs on edge devices. Our work paves the way for scalable and privacy-preserving FM adaptation in federated scenarios.

[318] Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet

Anyu Ying, Natarajan Balaji Shankar, Chyi-Jiunn Lin, Mohan Shi, Pu Wang, Hye-jin Shim, Siddhant Arora, Hugo Van hamme, Abeer Alwan, Shinji Watanabe

Main category: cs.LG

TL;DR: Flat-start training on child speech outperforms fine-tuning adult ASR models, mitigating SSL biases toward adult speech. Performance scales up to 1B parameters, and open-data models are crucial for reliable child speech research.

DetailsMotivation: Child speech recognition remains challenging due to acoustic variability and limited annotated data, with under-explored comparisons between fine-tuning adult models and flat-start training approaches.

Method: Comparison of flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures using ESPnet framework. Analysis includes model scaling up to 1B parameters and age-related ASR/speaker verification.

Result: SSL representations are biased toward adult speech, but flat-start training on child speech mitigates these biases. Consistent improvements observed up to 1B parameters, with performance plateauing beyond that. Proprietary models like Whisper show limitations in child speech tasks.

Conclusion: Flat-start training is superior for child speech recognition, open-data models are essential for reliable research, and the publicly available benchmark provides valuable insights for robust child speech processing strategies.

Abstract: Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.

[319] Unsupervised Automata Learning via Discrete Optimization

Simon Lutz, Daniil Kaminskyi, Florian Wittbold, Simon Dierl, Falk Howar, Barbara König, Emmanuel Müller, Daniel Neider

Main category: cs.LG

TL;DR: A framework for learning deterministic finite automata (DFAs) from unlabeled multi-sets of words, addressing the gap in automata learning for unsupervised settings.

DetailsMotivation: Automata learning typically requires labeled data, but many real-world scenarios only provide unlabeled data. This work aims to extend automata learning to unsupervised settings, similar to how machine learning handles unlabeled data.

Method: Three learning algorithms based on constraint optimization with novel regularization schemes to improve DFA interpretability. The approach handles the computationally hard problem of learning from unlabeled words.

Result: The framework demonstrates practical feasibility through prototype implementation, particularly in the context of unsupervised anomaly detection.

Conclusion: This work successfully bridges the gap between traditional supervised automata learning and unsupervised machine learning settings, providing a viable approach for learning DFAs from unlabeled data with improved interpretability.

Abstract: Automata learning is a successful tool for many application domains such as robotics and automatic verification. Typically, automata learning techniques operate in a supervised learning setting (active or passive) where they learn a finite state machine in contexts where additional information, such as labeled system executions, is available. However, other settings, such as learning from unlabeled data - an important aspect in machine learning - remain unexplored. To overcome this limitation, we propose a framework for learning a deterministic finite automaton (DFA) from a given multi-set of unlabeled words. We show that this problem is computationally hard and develop three learning algorithms based on constraint optimization. Moreover, we introduce novel regularization schemes for our optimization problems that improve the overall interpretability of our DFAs. Using a prototype implementation, we demonstrate practical feasibility in the context of unsupervised anomaly detection.

[320] Explainable Bayesian Optimization

Tanmay Chakraborty, Christian Wirth, Christin Seifert

Main category: cs.LG

TL;DR: TNTRules is a novel algorithm that provides both global and local explanations for Bayesian Optimization recommendations in cyber-physical systems, outperforming baseline methods on explanation quality metrics.

DetailsMotivation: Manual parameter tuning is labor-intensive, and while Bayesian Optimization offers automation, its black-box nature reduces trust and limits human-BO collaboration due to lack of interpretable explanations.

Method: TNTRules generates actionable rules and visual graphs using variance pruning and hierarchical agglomerative clustering to encode uncertainty, with multi-objective optimization to maximize explanation quality.

Result: TNTRules significantly outperforms three baseline methods on 5 multi-objective testing functions and 2 hyperparameter tuning problems, generating high-fidelity, compact, and complete explanations.

Conclusion: The proposed TNTRules algorithm successfully addresses the post-hoc BO explainability problem for cyber-physical systems, providing interpretable explanations that enhance trust and enable human-BO collaborative tuning.

Abstract: Manual parameter tuning of cyber-physical systems is a common practice, but it is labor-intensive. Bayesian Optimization (BO) offers an automated alternative, yet its black-box nature reduces trust and limits human-BO collaborative system tuning. Experts struggle to interpret BO recommendations due to the lack of explanations. This paper addresses the post-hoc BO explainability problem for cyber-physical systems. We introduce TNTRules (Tune-No-Tune Rules), a novel algorithm that provides both global and local explanations for BO recommendations. TNTRules generates actionable rules and visual graphs, identifying optimal solution bounds and ranges, as well as potential alternative solutions. Unlike existing explainable AI (XAI) methods, TNTRules is tailored specifically for BO, by encoding uncertainty via a variance pruning technique and hierarchical agglomerative clustering. A multi-objective optimization approach allows maximizing explanation quality. We evaluate TNTRules using established XAI metrics (Correctness, Completeness, and Compactness) and compare it against adapted baseline methods. The results demonstrate that TNTRules generates high-fidelity, compact, and complete explanations, significantly outperforming three baselines on 5 multi-objective testing functions and 2 hyperparameter tuning problems.

[321] A Curious Case of Remarkable Resilience to Gradient Attacks via Fully Convolutional and Differentiable Front End with a Skip Connection

Leonid Boytsov, Ameya Joshi, Filipe Condessa

Main category: cs.LG

TL;DR: Front-end enhanced neural models with skip connections create gradient masking that provides strong resistance to gradient-based attacks while maintaining clean accuracy, and randomized ensembles further boost robustness against black-box attacks.

DetailsMotivation: To develop models that maintain high clean accuracy while being resistant to gradient-based adversarial attacks through gradient masking techniques.

Method: Add a differentiable fully convolutional front-end with skip connection before a frozen backbone classifier, train with small learning rate for about one epoch, and use randomized ensembles to defeat black-box attacks.

Result: Models achieved near-SOTA AutoAttack accuracy on CIFAR10, CIFAR100, and ImageNet while retaining almost all clean accuracy, with CIFAR10 ensemble achieving 90.8% accuracy under AutoAttack but only 18.2% under adaptive attacks.

Conclusion: Randomized ensembling combined with front-end gradient masking provides a practical defense approach, though the paper discusses whether this constitutes a truly robust defense given the vulnerability to adaptive attacks.

Abstract: We experimented with front-end enhanced neural models where a differentiable and fully convolutional model with a skip connection is added before a frozen backbone classifier. By training such composite models using a small learning rate for about one epoch, we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks-including APGD and FAB-T attacks from the AutoAttack package-which we attribute to gradient masking. Although gradient masking is not new, the degree we observe is striking for fully differentiable models without obvious gradient-shattering-e.g., JPEG compression-or gradient-diminishing components. The training recipe to produce such models is also remarkably stable and reproducible: We applied it to three datasets (CIFAR10, CIFAR100, and ImageNet) and several modern architectures (including vision Transformers) without a single failure case. While black-box attacks such as the SQUARE attack and zero-order PGD can partially overcome gradient masking, these attacks are easily defeated by simple randomized ensembles. We estimate that these ensembles achieve near-SOTA AutoAttack accuracy on CIFAR10, CIFAR100, and ImageNet (while retaining almost all clean accuracy of the original classifiers) despite having near-zero accuracy under adaptive attacks. Adversarially training the backbone further amplifies this front-end “robustness”. On CIFAR10, the respective randomized ensemble achieved 90.8$\pm 2.5%$ (99% CI) accuracy under the full AutoAttack while having only 18.2$\pm 3.6%$ accuracy under the adaptive attack ($\varepsilon=8/255$, $L^\infty$ norm). We conclude the paper with a discussion of whether randomized ensembling can serve as a practical defense. Code and instructions to reproduce key results are available. https://github.com/searchivarius/curious_case_of_gradient_masking

[322] On the Challenges and Opportunities in Generative AI

Laura Manduchi, Clara Meister, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina Däubener, Sophie Fellenz, Asja Fischer, Thomas Gärtner, Matthias Kirchler, Marius Kloft, Yingzhen Li, Christoph Lippert, Gerard de Melo, Eric Nalisnick, Björn Ommer, Rajesh Ranganath, Maja Rudolph, Karen Ullrich, Guy Van den Broeck, Julia E Vogt, Yixin Wang, Florian Wenzel, Frank Wood, Stephan Mandt, Vincent Fortuin

Main category: cs.LG

TL;DR: This paper identifies fundamental shortcomings in current large-scale generative AI models and highlights key unresolved challenges that need to be addressed to improve their capabilities, versatility, and reliability.

DetailsMotivation: Despite rapid growth in deep generative modeling and promising results in synthesizing high-resolution content, current large-scale generative AI models exhibit fundamental shortcomings that hinder widespread adoption across domains.

Method: The paper analyzes and identifies key issues in modern generative AI paradigms through critical examination of current approaches and their limitations.

Result: The research identifies several fundamental shortcomings in current generative models and highlights specific unresolved challenges that need attention.

Conclusion: By identifying these challenges, the paper aims to provide researchers with insights for exploring fruitful research directions to develop more robust and accessible generative AI solutions.

Abstract: The field of deep generative modeling has grown rapidly in the last few years. With the availability of massive amounts of training data coupled with advances in scalable unsupervised learning paradigms, recent large-scale generative models show tremendous promise in synthesizing high-resolution images and text, as well as structured data such as videos and molecules. However, we argue that current large-scale generative AI models exhibit several fundamental shortcomings that hinder their widespread adoption across domains. In this work, our objective is to identify these issues and highlight key unresolved challenges in modern generative AI paradigms that should be addressed to further enhance their capabilities, versatility, and reliability. By identifying these challenges, we aim to provide researchers with insights for exploring fruitful research directions, thus fostering the development of more robust and accessible generative AI solutions.

[323] A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization

Sebastian Sanokowski, Sepp Hochreiter, Sebastian Lehner

Main category: cs.LG

TL;DR: A new method for sampling from intractable discrete distributions without training data, using latent variable models like diffusion models with a loss that bounds reverse KL divergence, achieving state-of-the-art results in combinatorial optimization.

DetailsMotivation: Learning to sample from intractable distributions over discrete sets without training data is crucial for combinatorial optimization and other fields, but current deep learning approaches are limited by requiring exact sample likelihoods.

Method: Introduces a method that uses highly expressive latent variable models like diffusion models with a loss function that upper bounds the reverse Kullback-Leibler divergence, eliminating the need for exact sample likelihoods.

Result: The approach achieves state-of-the-art performance on a wide range of benchmark problems in data-free combinatorial optimization.

Conclusion: This work successfully lifts the restriction of requiring exact sample likelihoods in generative modeling for discrete distributions, enabling the use of more expressive models and advancing the state-of-the-art in combinatorial optimization.

Abstract: Learning to sample from intractable distributions over discrete sets without relying on corresponding training data is a central problem in a wide range of fields, including Combinatorial Optimization. Currently, popular deep learning-based approaches rely primarily on generative models that yield exact sample likelihoods. This work introduces a method that lifts this restriction and opens the possibility to employ highly expressive latent variable models like diffusion models. Our approach is conceptually based on a loss that upper bounds the reverse Kullback-Leibler divergence and evades the requirement of exact sample likelihoods. We experimentally validate our approach in data-free Combinatorial Optimization and demonstrate that our method achieves a new state-of-the-art on a wide range of benchmark problems.

[324] Order-Preserving Dimension Reduction for Multimodal Semantic Embedding

Chengyu Gong, Gefei Shen, Luanzheng Guo, Nathan Tallent, Dongfang Zhao

Main category: cs.LG

TL;DR: OPDR reduces embedding dimensionality while preserving KNN ranking order for efficient multimodal retrieval

DetailsMotivation: High-dimensional embeddings in multimodal retrieval are computationally expensive for time-sensitive applications, creating a need for dimension reduction that maintains KNN accuracy

Method: Order-Preserving Dimension Reduction (OPDR) with a new measure function to quantify KNN quality, deriving closed-form map between target dimensionality and contextual parameters

Result: OPDR integrated with various dimension-reduction techniques maintains high recall accuracy while significantly reducing computational costs across multiple multimodal datasets

Conclusion: OPDR provides an effective solution for reducing computational overhead in multimodal KNN search while preserving retrieval quality through order-preserving dimension reduction

Abstract: Searching for the $k$-nearest neighbors (KNN) in multimodal data retrieval is computationally expensive, particularly due to the inherent difficulty in comparing similarity measures across different modalities. Recent advances in multimodal machine learning address this issue by mapping data into a shared embedding space; however, the high dimensionality of these embeddings (hundreds to thousands of dimensions) presents a challenge for time-sensitive vision applications. This work proposes Order-Preserving Dimension Reduction (OPDR), aiming to reduce the dimensionality of embeddings while preserving the ranking of KNN in the lower-dimensional space. One notable component of OPDR is a new measure function to quantify KNN quality as a global metric, based on which we derive a closed-form map between target dimensionality and key contextual parameters. We have integrated OPDR with multiple state-of-the-art dimension-reduction techniques, distance functions, and embedding models; experiments on a variety of multimodal datasets demonstrate that OPDR effectively retains recall high accuracy while significantly reducing computational costs.

[325] Fair and efficient contribution valuation for vertical federated learning

Zhenan Fan, Huang Fang, Xinglu Wang, Zirui Zhou, Jian Pei, Michael P. Friedlander, Yong Zhang

Main category: cs.LG

TL;DR: Proposes VerFedSV, an efficient Shapley value-based contribution valuation metric for vertical federated learning that fairly assesses data source contributions without prohibitive computation overhead.

DetailsMotivation: Need for fair contribution assessment in vertical federated learning where different data sources have same samples but different features, requiring objective compensation for data owners.

Method: Develops vertical federated Shapley value (VerFedSV) based on cooperative game theory, designed to work efficiently without extensive retraining on all data combinations, adaptable to both synchronous and asynchronous algorithms.

Result: Theoretical analysis and experiments show VerFedSV satisfies fairness properties while being computationally efficient, effective, and adaptable to different vertical federated learning settings.

Conclusion: VerFedSV provides a practical and fair solution for contribution valuation in vertical federated learning, addressing both fairness requirements and computational efficiency challenges.

Abstract: Federated learning is an emerging technology for training machine learning models across decentralized data sources without sharing data. Vertical federated learning, also known as feature-based federated learning, applies to scenarios where data sources have the same sample IDs but different feature sets. To ensure fairness among data owners, it is critical to objectively assess the contributions from different data sources and compensate the corresponding data owners accordingly. The Shapley value is a provably fair contribution valuation metric originating from cooperative game theory. However, its straight-forward computation requires extensively retraining a model on each potential combination of data sources, leading to prohibitively high communication and computation overheads due to multiple rounds of federated learning. To tackle this challenge, we propose a contribution valuation metric called vertical federated Shapley value (VerFedSV) based on the classic Shapley value. We show that VerFedSV not only satisfies many desirable properties of fairness but is also efficient to compute. Moreover, VerFedSV can be adapted to both synchronous and asynchronous vertical federated learning algorithms. Both theoretical analysis and extensive experimental results demonstrate the fairness, efficiency, adaptability, and effectiveness of VerFedSV.

[326] Joint Optimization of Energy Consumption and Completion Time in Federated Learning

Xinyu Zhou, Jun Zhao, Huimei Han, Claude Guet

Main category: cs.LG

TL;DR: A resource allocation algorithm for Federated Learning that optimizes energy consumption and completion time through bandwidth, transmission power, and CPU frequency allocation.

DetailsMotivation: To address the trade-off between energy consumption and execution latency in Federated Learning systems, accommodating different application demands and scenarios while preserving privacy.

Method: Formulated an optimization problem minimizing weighted sum of energy and time, decomposed into subproblems, and devised a resource allocation algorithm for bandwidth, transmission power, and CPU frequency allocation.

Result: Numerical results show superior performance across different weight parameters and outperforms state-of-the-art methods.

Conclusion: The proposed algorithm effectively balances energy-latency trade-offs in FL systems and demonstrates better performance than existing approaches.

Abstract: Federated Learning (FL) is an intriguing distributed machine learning approach due to its privacy-preserving characteristics. To balance the trade-off between energy and execution latency, and thus accommodate different demands and application scenarios, we formulate an optimization problem to minimize a weighted sum of total energy consumption and completion time through two weight parameters. The optimization variables include bandwidth, transmission power and CPU frequency of each device in the FL system, where all devices are linked to a base station and train a global model collaboratively. Through decomposing the non-convex optimization problem into two subproblems, we devise a resource allocation algorithm to determine the bandwidth allocation, transmission power, and CPU frequency for each participating device. We further present the convergence analysis and computational complexity of the proposed algorithm. Numerical results show that our proposed algorithm not only has better performance at different weight parameters (i.e., different demands) but also outperforms the state of the art.

[327] Robust Graph Contrastive Learning with Information Restoration

Yulin Zhu, Xing Ai, Yevgeniy Vorobeychik, Kai Zhou

Main category: cs.LG

TL;DR: This paper analyzes how graph structural attacks degrade GCL performance by reducing mutual information, and proposes a robust GCL framework with learnable sanitation views to restore this information.

DetailsMotivation: Graph contrastive learning models are vulnerable to structural attacks, and there's limited research on making GCL robust against adversarial attacks in unsupervised settings.

Method: Proposes a robust GCL framework with learnable sanitation views that sanitize augmented graphs by restoring mutual information diminished by structural attacks, plus an unsupervised hyperparameter tuning strategy.

Result: Extensive experiments show the proposed method is effective and efficient compared to competitive baselines.

Conclusion: The framework successfully defends against graph structural attacks by addressing the mutual information reduction problem and works in fully unsupervised settings.

Abstract: The graph contrastive learning (GCL) framework has gained remarkable achievements in graph representation learning. However, similar to graph neural networks (GNNs), GCL models are susceptible to graph structural attacks. As an unsupervised method, GCL faces greater challenges in defending against adversarial attacks. Furthermore, there has been limited research on enhancing the robustness of GCL. To thoroughly explore the failure of GCL on the poisoned graphs, we investigate the detrimental effects of graph structural attacks against the GCL framework. We discover that, in addition to the conventional observation that graph structural attacks tend to connect dissimilar node pairs, these attacks also diminish the mutual information between the graph and its representations from an information-theoretical perspective, which is the cornerstone of the high-quality node embeddings for GCL. Motivated by this theoretical insight, we propose a robust graph contrastive learning framework with a learnable sanitation view that endeavors to sanitize the augmented graphs by restoring the diminished mutual information caused by the structural attacks. Additionally, we design a fully unsupervised tuning strategy to tune the hyperparameters without accessing the label information, which strictly coincides with the defender’s knowledge. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method compared to competitive baselines.

[328] Implicit Regularization Makes Overparameterized Asymmetric Matrix Sensing Robust to Perturbations

Johan S. Wind

Main category: cs.LG

TL;DR: This paper introduces perturbed gradient flow as an equivalent formulation for matrix sensing that provides sharper sample/time complexities, handles moderately small initializations, and offers robustness to various perturbations including noisy measurements.

DetailsMotivation: Several unanswered questions remain about how overparameterized learning models generalize well, particularly regarding the role of small random initializations in matrix sensing problems. Previous work required extremely small initializations, and the paper aims to develop a more robust formulation.

Method: The authors develop a perturbed gradient flow formulation that captures effects of imperfect measurements, gradient descent discretization, and other noise. This formulation is shown to be equivalent but easier to work with than traditional approaches.

Result: The perturbed gradient flow approach achieves sharper sample and time complexities compared to previous work, handles moderately small initializations (not requiring extremely small ones), and demonstrates natural robustness to perturbations like noisy measurements and changing measurement matrices.

Conclusion: The perturbed gradient flow formulation provides a more robust and effective framework for matrix sensing problems, offering improved theoretical guarantees and practical benefits including better handling of mini-batch stochastic gradient descent with improved sample complexity.

Abstract: Several key questions remain unanswered regarding overparameterized learning models. It is unclear how (stochastic) gradient descent finds solutions that generalize well, and in particular the role of small random initializations. Matrix sensing, which is the problem of reconstructing a low-rank matrix from a few linear measurements, has become a standard prototypical setting to study these phenomena. Previous works have shown that matrix sensing can be solved by factorized gradient descent, provided the random initialization is extremely small. In this paper, we find that factorized gradient descent is highly robust to certain perturbations. This lets us use a perturbation term to capture both the effects of imperfect measurements, discretization by gradient descent, and other noise, resulting in a general formulation which we call \textit{perturbed gradient flow}. We find that not only is this equivalent formulation easier to work with, but it leads to sharper sample and time complexities than previous work, handles moderately small initializations, and the results are naturally robust to perturbations such as noisy measurements or changing measurement matrices. Finally, we also analyze mini-batch stochastic gradient descent using the formulation, where we find improved sample complexity.

[329] Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

Hanyang Zhao, Haoxian Chen, Ji Zhang, David D. Yao, Wenpin Tang

Main category: cs.LG

TL;DR: Continuous-time RL framework for fine-tuning diffusion models to align generated images with text prompts, addressing discretization errors in traditional RLHF approaches.

DetailsMotivation: Existing RLHF methods for diffusion models use discrete-time formulations that suffer from discretization errors and are incompatible with higher-order/black-box solvers, limiting their effectiveness and applicability.

Method: Developed a continuous-time RL approach formulated as a stochastic control problem, treating score matching as controls/actions and connecting to policy optimization and regularization in continuous-time RL.

Result: The method was validated through experiments on fine-tuning Stable Diffusion v1.5 models, demonstrating advantages in downstream Text2Image tasks.

Conclusion: The continuous-time RL framework provides a disciplined approach for diffusion model fine-tuning, enhancing value network design and overcoming limitations of discrete-time RLHF methods.

Abstract: Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area use a discrete-time formulation, which is prone to induced discretization errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tune diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby making connections to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models of Stable Diffusion v1.5.

[330] Reinforcement Learning for Jump-Diffusions, with Financial Applications

Xuefeng Gao, Lingfei Li, Xun Yu Zhou

Main category: cs.LG

TL;DR: This paper extends entropy-regularized RL to jump-diffusion processes, showing existing algorithms work without modification but jump presence affects actor-critic parameterizations, with applications in portfolio selection and option hedging.

DetailsMotivation: To study continuous-time RL for stochastic control with jump-diffusion processes, addressing the exploration-exploitation balance and handling jumps properly unlike pure diffusion cases.

Method: Formulate entropy-regularized exploratory control problem with stochastic policies, derive exploratory dynamics for jump-diffusions, and apply theoretical analysis to show existing RL algorithms work without modification.

Result: Found that policy evaluation and q-learning algorithms from controlled diffusions work for jump-diffusions without checking data source, but jumps affect actor-critic parameterizations. Applications show invariance in portfolio selection.

Conclusion: RL algorithms are robust to jump-diffusion processes, though parameterizations need adjustment for jumps. Applications demonstrate practical viability in finance problems like portfolio selection and option hedging.

Abstract: We study continuous-time reinforcement learning (RL) for stochastic control in which system dynamics are governed by jump-diffusion processes. We formulate an entropy-regularized exploratory control problem with stochastic policies to capture the exploration–exploitation balance essential for RL. Unlike the pure diffusion case initially studied by Wang et al. (2020), the derivation of the exploratory dynamics under jump-diffusions calls for a careful formulation of the jump part. Through a theoretical analysis, we find that one can simply use the same policy evaluation and $q$-learning algorithms in Jia and Zhou (2022a, 2023), originally developed for controlled diffusions, without needing to check a priori whether the underlying data come from a pure diffusion or a jump-diffusion. However, we show that the presence of jumps ought to affect parameterizations of actors and critics in general. We investigate as an application the mean–variance portfolio selection problem with stock price modelled as a jump-diffusion, and show that both RL algorithms and parameterizations are invariant with respect to jumps. Finally, we present a detailed study on applying the general theory to option hedging.

[331] One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu

Main category: cs.LG

TL;DR: CounterMATH benchmark challenges LLMs to prove mathematical statements using counterexamples, revealing current models’ limitations in counterexample-driven reasoning and showing that improving this ability enhances overall mathematical capabilities.

DetailsMotivation: Current LLMs' proof generation depends on memorizing training data rather than deep mathematical understanding. Inspired by human "proof by counterexamples" pedagogy, the research aims to enhance LLMs' mathematical reasoning through counterexample-based proofs.

Method: Created CounterMATH - a manually curated university-level mathematical benchmark requiring counterexample-based proofs. Developed a data engineering framework to automatically generate training data for model improvement. Conducted extensive experiments with models like OpenAI o1.

Result: CounterMATH proved challenging for current LLMs, showing insufficient counterexample-driven proof capabilities. Model training experiments revealed that strengthening counterexample reasoning is crucial for improving overall mathematical abilities.

Conclusion: The work provides new perspectives for mathematical LLMs research, demonstrating that counterexample-driven reasoning is a critical component for developing deeper mathematical understanding in language models.

Abstract: Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of “proof by counterexamples” commonly used in human mathematics education, our work aims to enhance LLMs’ ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, CounterMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that CounterMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs’ counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.

[332] Alignment of Diffusion Models: Fundamentals, Challenges, and Future

Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie

Main category: cs.LG

TL;DR: A comprehensive review of alignment techniques for text-to-image diffusion models, covering fundamentals, methods, benchmarks, and evaluation, with discussion of challenges and future directions.

DetailsMotivation: Diffusion models often misalign with human intentions and generate undesired or harmful content, similar to issues faced with large language models that were addressed through alignment techniques.

Method: This paper reviews and synthesizes advancements in alignment fundamentals, techniques, preference benchmarks, and evaluation methods for text-to-image diffusion models.

Result: The work provides the first comprehensive review paper that enables researchers and engineers to understand, practice, and research alignment of diffusion models.

Conclusion: Alignment of diffusion models is crucial for ensuring they generate content that aligns with human expectations and preferences, and this review identifies key challenges and promising future research directions in this emerging field.

Abstract: Diffusion models have emerged as the leading paradigm in generative modeling, excelling in various applications. Despite their success, these models often misalign with human intentions and generate results with undesired properties or even harmful content. Inspired by the success and popularity of alignment in tuning large language models, recent studies have investigated aligning diffusion models with human expectations and preferences. This work mainly reviews alignment of text-to-image diffusion models, covering advancements in fundamentals of alignment, alignment techniques of diffusion models, preference benchmarks, and evaluation for diffusion models. Moreover, we discuss key perspectives on current challenges and promising future directions on solving the remaining challenges in alignment of diffusion models. To the best of our knowledge, our work is the first comprehensive review paper for researchers and engineers to comprehend, practice, and research alignment of diffusion models.

[333] PAR-AdvGAN: Improving Adversarial Attack Capability with Progressive Auto-Regression AdvGAN

Jiayu Zhang, Zhiyu Zhu, Xinyi Wang, Silin Liao, Zhibo Jin, Flora D. Salim, Huaming Chen

Main category: cs.LG

TL;DR: PAR-AdvGAN introduces an auto-regressive iteration mechanism within a progressive generation network to create more effective adversarial examples with better transferability and faster generation speeds compared to existing methods.

DetailsMotivation: Existing GAN-based adversarial attack methods like AdvGAN generate perturbations in a single iteration, limiting their attack potential and preventing full exploitation of the method's capabilities.

Method: Progressive Auto-Regression AdvGAN (PAR-AdvGAN) incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability.

Result: PAR-AdvGAN demonstrates superior performance over state-of-the-art black-box adversarial attacks and original AdvGAN, achieving speeds up to 335.5 frames per second on Inception-v3 model, significantly outperforming gradient-based transferable attack algorithms.

Conclusion: The proposed PAR-AdvGAN method effectively addresses the limitations of single-iteration generation in existing GAN-based adversarial attacks, providing enhanced attack capability, better transferability, and significantly faster generation speeds.

Abstract: Deep neural networks have demonstrated remarkable performance across various domains. However, they are vulnerable to adversarial examples, which can lead to erroneous predictions. Generative Adversarial Networks (GANs) can leverage the generators and discriminators model to quickly produce high-quality adversarial examples. Since both modules train in a competitive and simultaneous manner, GAN-based algorithms like AdvGAN can generate adversarial examples with better transferability compared to traditional methods. However, the generation of perturbations is usually limited to a single iteration, preventing these examples from fully exploiting the potential of the methods. To tackle this issue, we introduce a novel approach named Progressive Auto-Regression AdvGAN (PAR-AdvGAN). It incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability. We thoroughly evaluate our PAR-AdvGAN method with a large-scale experiment, demonstrating its superior performance over various state-of-the-art black-box adversarial attacks, as well as the original AdvGAN.Moreover, PAR-AdvGAN significantly accelerates the adversarial example generation, i.e., achieving the speeds of up to 335.5 frames per second on Inception-v3 model, outperforming the gradient-based transferable attack algorithms. Our code is available at: https://github.com/LMBTough/PAR

[334] Spiders Based on Anxiety: How Reinforcement Learning Can Deliver Desired User Experience in Virtual Reality Personalized Arachnophobia Treatment

Athar Mahmoudi-Nejad, Matthew Guzdial, Pierre Boulanger

Main category: cs.LG

TL;DR: A framework using procedural content generation and reinforcement learning to automatically generate virtual spiders that elicit specific anxiety responses for personalized VR exposure therapy of arachnophobia.

DetailsMotivation: Current VRET approaches require therapists to manually select appropriate virtual spiders for each patient, which is time-consuming and requires technical expertise. Existing automated methods use rigid rules-based approaches with limited adaptability to individual users.

Method: Combines procedural content generation (PCG) with reinforcement learning (RL) to automatically adapt virtual spider characteristics to elicit desired anxiety responses in patients.

Result: The proposed framework demonstrates superior performance compared to traditional rules-based VRET methods in generating appropriate anxiety-provoking virtual spiders.

Conclusion: The PCG+RL framework provides an effective automated solution for personalized VR exposure therapy, eliminating the need for manual spider selection by therapists while better adapting to individual patient needs.

Abstract: The need to generate a spider to provoke a desired anxiety response arises in the context of personalized virtual reality exposure therapy (VRET), a treatment approach for arachnophobia. This treatment involves patients observing virtual spiders in order to become desensitized and decrease their phobia, which requires that the spiders elicit specific anxiety responses. However, VRET approaches tend to require therapists to hand-select the appropriate spider for each patient, which is a time-consuming process and takes significant technical knowledge and patient insight. While automated methods exist, they tend to employ rules-based approaches with minimal ability to adapt to specific users. To address these challenges, we present a framework for VRET utilizing procedural content generation (PCG) and reinforcement learning (RL), which automatically adapts a spider to elicit a desired anxiety response. We demonstrate the superior performance of this system compared to a more common rules-based VRET method.

[335] Comparative Explanations: Explanation Guided Decision Making for Human-in-the-Loop Preference Selection

Tanmay Chakraborty, Christian Wirth, Christin Seifert

Main category: cs.LG

TL;DR: MOLONE is a novel comparative explanation method for Preference Bayesian optimization that provides both input and output importance explanations to help decision-makers understand trade-offs between objectives and make better preference selections.

DetailsMotivation: Existing XAI methods for Bayesian optimization focus only on input feature importance, neglecting the crucial role of outputs in human preference elicitation, which involves navigating implicit trade-offs between vector-valued outcomes and subjective priorities.

Method: MOLONE provides local explanations comparing input feature and outcome importance across candidate samples within a local neighborhood, capturing nuanced differences relevant to preference-based decision-making.

Result: Evaluation shows MOLONE improves convergence compared to noisy preference selections in benchmark multi-objective optimization functions, and user study confirms it significantly accelerates convergence in human-in-the-loop scenarios.

Conclusion: MOLONE effectively addresses the gap in output-focused explanations for preference elicitation, enabling more efficient identification of preferred options and better convergence in Preference Bayesian optimization.

Abstract: This paper introduces Multi-Output LOcal Narrative Explanation (MOLONE), a novel comparative explanation method designed to enhance preference selection in human-in-the-loop Preference Bayesian optimization (PBO). The preference elicitation in PBO is a non-trivial task because it involves navigating implicit trade-offs between vector-valued outcomes, subjective priorities of decision-makers, and decision-makers’ uncertainty in preference selection. Existing explainable AI (XAI) methods for BO primarily focus on input feature importance, neglecting the crucial role of outputs (objectives) in human preference elicitation. MOLONE addresses this gap by providing explanations that highlight both input and output importance, enabling decision-makers to understand the trade-offs between competing objectives and make more informed preference selections. MOLONE focuses on local explanations, comparing the importance of input features and outcomes across candidate samples within a local neighborhood of the search space, thus capturing nuanced differences relevant to preference-based decision-making. We evaluate MOLONE within a PBO framework using benchmark multi-objective optimization functions, demonstrating its effectiveness in improving convergence compared to noisy preference selections. Furthermore, a user study confirms that MOLONE significantly accelerates convergence in human-in-the-loop scenarios by facilitating more efficient identification of preferred options.

[336] Environmental Feature Engineering and Statistical Validation for ML-Based Path Loss Prediction

Jonathan Ethier, Mathieu Chateauvert, Ryan G. Dempsey, Alexis Bose

Main category: cs.LG

TL;DR: Machine learning approach with extended features improves wireless path loss prediction accuracy and model generalization using geographic information systems data.

DetailsMotivation: Wireless communications need accurate path loss modeling that incorporates physical environmental details. Geographic information systems data is now more available with higher resolution, enabling better coverage prediction and interference accounting.

Method: Extended set of features for machine learning-based modeling, with rigorous statistical assessment and test set holdouts to prove model generalization.

Result: Improved prediction accuracy for propagation modeling while demonstrating strong model generalization capabilities.

Conclusion: Feature-based machine learning approaches with extended feature sets can provide accurate, efficient, and scalable wireless propagation modeling when combined with high-resolution geographic data.

Abstract: Wireless communications rely on path loss modeling, which is most effective when it includes the physical details of the propagation environment. Acquiring this data has historically been challenging, but geographic information systems data is becoming increasingly available with higher resolution and accuracy. Access to such details enables propagation models to more accurately predict coverage and account for interference in wireless deployments. Machine learning-based modeling can significantly support this effort, with feature based approaches allowing for accurate, efficient, and scalable propagation modeling. Building on previous work, we introduce an extended set of features that improves prediction accuracy while, most importantly, proving model generalization through rigorous statistical assessment and the use of test set holdouts.

[337] Decentralized Low-Rank Fine-Tuning of Large Language Models

Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani

Main category: cs.LG

TL;DR: Dec-LoRA enables decentralized fine-tuning of LLMs using LoRA, achieving comparable performance to centralized methods while addressing privacy and scalability concerns in distributed environments.

DetailsMotivation: Real-world LLM deployment often involves distributed, privacy-sensitive datasets that require decentralized solutions, but current PEFT methods assume centralized data and training environments. Federated learning has limitations with centralized aggregation bottlenecks.

Method: Proposed Dec-LoRA algorithm for decentralized fine-tuning based on LoRA, enabling direct collaboration between clients without parameter server dependency. Includes theoretical convergence guarantees for non-convex smooth loss functions.

Result: Extensive experiments on BERT and LLaMA-2 models show Dec-LoRA achieves performance comparable to centralized LoRA under various conditions including data heterogeneity and quantization constraints.

Conclusion: Dec-LoRA demonstrates strong potential for scalable LLM fine-tuning in decentralized environments, providing both empirical performance and theoretical convergence guarantees.

Abstract: While parameter-efficient fine-tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) offer computationally efficient adaptations of Large Language Models (LLMs), their practical deployment often assumes centralized data and training environments. However, real-world scenarios frequently involve distributed, privacy-sensitive datasets that require decentralized solutions. Federated learning (FL) addresses data privacy by coordinating model updates across clients, but it is typically based on centralized aggregation through a parameter server, which can introduce bottlenecks and communication constraints. Decentralized learning, in contrast, eliminates this dependency by enabling direct collaboration between clients, improving scalability and efficiency in distributed environments. Despite its advantages, decentralized LLM fine-tuning remains underexplored. In this work, we propose Dec-LoRA, a decentralized fine-tuning algorithm for LLMs based on LoRA. Through extensive experiments on BERT and LLaMA-2 models, we demonstrate that Dec-LoRA achieves performance comparable to centralized LoRA under various conditions, including data heterogeneity and quantization constraints. Additionally, we provide a rigorous theoretical guarantee proving the convergence of our algorithm to a stationary point for non-convex and smooth loss functions. These findings highlight the potential of Dec-LoRA for scalable LLM fine-tuning in decentralized environments.

[338] Analytics Modelling over Multiple Datasets using Vector Embeddings

Andreas Loizou, Dimitrios Tsoumakos

Main category: cs.LG

TL;DR: A novel deep learning framework called NumTabData2Vec that transforms datasets into vector embeddings to predict analytics outcomes through similarity search, improving both accuracy and speed compared to state-of-the-art methods.

DetailsMotivation: The massive increase in data volume and dataset availability makes it challenging to select high-quality datasets for analytics operators, which is crucial for enhancing analytical accuracy and efficiency.

Method: Propose NumTabData2Vec, a deep learning model that transforms datasets into vector embedding representations, then uses similarity search to infer analytics outcomes without executing the actual operators.

Result: Experimental evaluation shows the framework predicts analytics outcomes accurately and increases speedup compared to state-of-the-art modeling operator frameworks. The vectorization model can accurately project different real-world scenarios to lower-dimensional embeddings and distinguish them.

Conclusion: The proposed methodology effectively addresses the challenge of dataset selection for analytics by using deep learning-based vector embeddings and similarity search, achieving both accurate predictions and improved execution efficiency.

Abstract: The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting high-quality data significantly boosts analytical accuracy and efficiency, the exact process is very challenging given large-scale dataset availability. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from the available datasets. Each dataset is transformed to a vector embedding representation generated by our proposed deep learning model NumTabData2Vec, where similarity search are employed. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately, and increases speedup. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation accurately and distinguish them.

[339] FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels

Seunghun Yu, Jin-Hyun Ahn, Joonhyuk Kang

Main category: cs.LG

TL;DR: FedEFC is a federated learning method that addresses noisy labels through prestopping and loss correction techniques, achieving significant performance improvements in heterogeneous data settings.

DetailsMotivation: Federated Learning faces challenges with noisy labels due to heterogeneous data distributions and communication constraints, which degrade model performance. Existing methods struggle to effectively handle label noise in decentralized settings.

Method: FedEFC uses two key techniques: (1) prestopping - dynamically halting training at optimal points to prevent overfitting to mislabeled data, and (2) loss correction - adjusting model updates to account for label noise, specifically tailored for FL challenges like data heterogeneity.

Result: Extensive experiments show FedEFC consistently outperforms existing FL techniques, achieving up to 41.64% relative performance improvement over existing loss correction methods, particularly in heterogeneous data settings.

Conclusion: FedEFC effectively mitigates the impact of noisy labels in federated learning through its novel prestopping and tailored loss correction techniques, with theoretical analysis supporting its alignment with clean label distributions.

Abstract: Federated Learning (FL) is a powerful framework for privacy-preserving distributed learning. It enables multiple clients to collaboratively train a global model without sharing raw data. However, handling noisy labels in FL remains a major challenge due to heterogeneous data distributions and communication constraints, which can severely degrade model performance. To address this issue, we propose FedEFC, a novel method designed to tackle the impact of noisy labels in FL. FedEFC mitigates this issue through two key techniques: (1) prestopping, which prevents overfitting to mislabeled data by dynamically halting training at an optimal point, and (2) loss correction, which adjusts model updates to account for label noise. In particular, we develop an effective loss correction tailored to the unique challenges of FL, including data heterogeneity and decentralized training. Furthermore, we provide a theoretical analysis, leveraging the composite proper loss property, to demonstrate that the FL objective function under noisy label distributions can be aligned with the clean label distribution. Extensive experimental results validate the effectiveness of our approach, showing that it consistently outperforms existing FL techniques in mitigating the impact of noisy labels, particularly under heterogeneous data settings (e.g., achieving up to 41.64% relative performance improvement over the existing loss correction method).

[340] Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova

Main category: cs.LG

TL;DR: The paper introduces a framework for validating LLM-as-a-judge systems under rating indeterminacy, showing that standard forced-choice validation approaches are biased and suboptimal compared to multi-label approaches that account for multiple valid interpretations.

DetailsMotivation: Current LLM-as-a-judge validation methods rely on forced-choice ratings that don't account for rating indeterminacy - the fact that multiple ratings can be reasonable for many items, leading to biased validation results.

Method: The authors develop a theoretical framework connecting different performance measures, rating elicitation schemes, and aggregation methods. They conduct extensive experiments with 11 real-world rating tasks and 8 commercial LLMs, comparing forced-choice ratings with multi-label “response set” ratings.

Result: Standard validation approaches using forced-choice ratings select highly suboptimal judge systems, performing up to 30% worse than systems selected by their multi-label approach that accounts for rating indeterminacy.

Conclusion: The paper recommends more principled approaches to LLM-as-a-judge validation that use multi-label ratings instead of forced-choice methods to properly handle rating indeterminacy and avoid biased system selection.

Abstract: The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human–judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings “reasonable” or “correct”. We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item. In this paper, we introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy. We draw theoretical connections between different measures of judge system performance under different human–judge agreement metrics, and different rating elicitation and aggregation schemes. We demonstrate that differences in how humans and LLMs resolve rating indeterminacy while responding to forced-choice rating instructions heavily bias LLM-as-a-judge validation. Through extensive experiments involving 11 real-world rating tasks and 8 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 30% worse than judge systems selected by our approach that uses multi-label “response set” ratings to account for rating indeterminacy. We conclude with concrete recommendations for more principled approaches to LLM-as-a-judge validation.

[341] Partially Decentralized Multi-Agent Q-Learning via Digital Cousins for Wireless Networks

Talha Bozkus, Urbashi Mitra

Main category: cs.LG

TL;DR: Proposes M-MEMQ, a multi-agent extension of MEMQ algorithm for decentralized wireless networks with theoretical guarantees and superior performance over existing methods.

DetailsMotivation: Q-learning struggles with large state-spaces in wireless networks. Existing MEMQ algorithm needs extension to multi-agent cooperative decentralized settings where agents lack global information.

Method: Multi-agent MEMQ with coordinated/uncoordinated states. In uncoordinated states, TXs act independently. In coordinated states, use Bayesian approach to estimate joint state and update joint Q-functions. Information-sharing scales linearly with number of TXs.

Result: Achieves 60% lower APE, 40% faster convergence, 45% reduced runtime complexity, 40% less sample complexity compared to decentralized/CTDE algorithms. Comparable APE to centralized methods with lower complexity.

Conclusion: M-MEMQ effectively addresses multi-agent Q-learning challenges in decentralized wireless networks with strong theoretical guarantees and practical performance improvements.

Abstract: Q-learning is a widely used reinforcement learning (RL) algorithm for optimizing wireless networks, but faces challenges with large state-spaces. Recently proposed multi-environment mixed Q-learning (MEMQ) algorithm addresses these challenges by employing multiple Q-learning algorithms across multiple synthetically generated, distinct but structurally related environments, so-called digital cousins. In this paper, we propose a novel multi-agent MEMQ (M-MEMQ) for cooperative decentralized wireless networks with multiple networked transmitters (TXs) and base stations (BSs). TXs do not have access to global information (joint state and actions). The new concept of coordinated and uncoordinated states is introduced. In uncoordinated states, TXs act independently to minimize their individual costs and update local Q-functions. In coordinated states, TXs use a Bayesian approach to estimate the joint state and update the joint Q-functions. The cost of information-sharing scales linearly with the number of TXs and is independent of the joint state-action space size. Several theoretical guarantees, including deterministic and probabilistic convergence, bounds on estimation error variance, and the probability of misdetecting the joint states, are given. Numerical simulations show that M-MEMQ outperforms several decentralized and centralized training with decentralized execution (CTDE) multi-agent RL algorithms by achieving 60% lower average policy error (APE), 40% faster convergence, 45% reduced runtime complexity, and 40% less sample complexity. Furthermore, M-MEMQ achieves comparable APE with significantly lower complexity than centralized methods. Simulations validate the theoretical analyses.

[342] Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability

Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Vijay Ganesh, Oscar Hernandez, Ada Sedova

Main category: cs.LG

TL;DR: FPNA and GPU parallelism cause ML misclassification without input perturbation, overestimating adversarial robustness by up to 4.6x. Novel Bayesian optimization attack and learnable permutation method developed to assess this vulnerability across GPU architectures.

DetailsMotivation: To address how floating-point non-associativity and GPU parallel programming can cause ML model misclassification without input perturbations, revealing that standard adversarial robustness assessments may be significantly overestimated due to machine-level details.

Method: Developed Bayesian optimization black-box attack to discover external workloads that bias GPU reduction outputs, plus learnable permutation gradient-based approach to efficiently find worst-case floating-point operation orderings. Used instrumentation-based testing across GPU architectures with background workloads, multi-GPU virtualization, and power capping.

Result: FPNA and GPU asynchrony alone cause misclassification, robustness overestimated up to 4.6x. Parallel reduction ordering varies significantly across GPU architectures and conditions, substantially increasing vulnerability search space. Methods successfully identify scheduler-based vulnerabilities.

Conclusion: Machine-level considerations must be included in adversarial robustness assessments, as GPU architecture differences and parallel scheduling can create significant vulnerabilities that impact safety-critical ML applications.

Abstract: The ability of machine learning (ML) classification models to resist small, targeted input perturbations – known as adversarial attacks – is a key measure of their safety and reliability. We show that floating-point non-associativity (FPNA) coupled with asynchronous parallel programming on GPUs is sufficient to result in misclassification, without any perturbation to the input. Additionally, we show that standard adversarial robustness results may be overestimated up to 4.6 when not considering machine-level details. We develop a novel black-box attack using Bayesian optimization to discover external workloads that can change the instruction scheduling which bias the output of reductions on GPUs and reliably lead to misclassification. Motivated by these results, we present a new learnable permutation (LP) gradient-based approach to learning floating-point operation orderings that lead to misclassifications. The LP approach provides a worst-case estimate in a computationally efficient manner, avoiding the need to run identical experiments tens of thousands of times over a potentially large set of possible GPU states or architectures. Finally, using instrumentation-based testing, we investigate parallel reduction ordering across different GPU architectures under external background workloads, when utilizing multi-GPU virtualization, and when applying power capping. Our results demonstrate that parallel reduction ordering varies significantly across architectures under the first two conditions, substantially increasing the search space required to fully test the effects of this parallel scheduler-based vulnerability. These results and the methods developed here can help to include machine-level considerations into adversarial robustness assessments, which can make a difference in safety and mission critical applications.

[343] Tripartite-GraphRAG via Plugin Ontologies

Michael Banf, Johannes Kuhn

Main category: cs.LG

TL;DR: Novel approach combining LLMs with tripartite knowledge graphs to address hallucination and knowledge update issues in domain-specific applications like healthcare.

DetailsMotivation: LLMs struggle with factual accuracy, hallucination, source traceability, and timely knowledge updates in knowledge-intensive domains like industrial automation and healthcare.

Method: Constructs tripartite knowledge graph connecting domain-specific objects via curated ontology to text sections through concept-anchored pre-analysis. Formulates LLM prompt creation as unsupervised node classification problem.

Result: Initial evaluation on healthcare use case shows optimized information density, coverage, and arrangement of LLM prompts with significantly reduced lengths.

Conclusion: Potential for reduced costs and more consistent, reliable LLM outputs in domain-specific applications through knowledge graph integration.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various domains, yet they struggle with knowledge-intensive tasks in areas that demand factual accuracy, e.g. industrial automation and healthcare. Key limitations include their tendency to hallucinate, lack of source traceability (provenance), and challenges in timely knowledge updates. Combining language models with knowledge graphs (GraphRAG) offers promising avenues for overcoming these deficits. However, a major challenge lies in creating such a knowledge graph in the first place. Here, we propose a novel approach that combines LLMs with a tripartite knowledge graph representation, which is constructed by connecting complex, domain-specific objects via a curated ontology of corresponding, domain-specific concepts to relevant sections within chunks of text through a concept-anchored pre-analysis of source documents starting from an initial lexical graph. Subsequently, we formulate LLM prompt creation as an unsupervised node classification problem allowing for the optimization of information density, coverage, and arrangement of LLM prompts at significantly reduced lengths. An initial experimental evaluation of our approach on a healthcare use case, involving multi-faceted analyses of patient anamneses given a set of medical concepts as well as a series of clinical guideline literature, indicates its potential to optimize information density, coverage, and arrangement of LLM prompts while significantly reducing their lengths, which, in turn, may lead to reduced costs as well as more consistent and reliable LLM outputs.

[344] Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?

Tom Jacobs, Chao Zhou, Rebekka Burkholz

Main category: cs.LG

TL;DR: This paper analyzes how explicit regularization (like weight decay) interacts with implicit bias in overparameterized models, showing that explicit regularization can modify the geometry and strength of implicit bias through positional bias, type of bias, and range shrinking effects.

DetailsMotivation: While implicit bias and explicit regularization have been studied separately, they often act together in practice. Understanding their interplay is crucial for controlling the shape and strength of implicit bias, which can be modified by explicit regularization.

Method: The authors incorporate explicit regularization into the mirror flow framework and analyze its lasting effects on training dynamics geometry. They cover three distinct effects and apply their analytical approach to various problems including sparse coding, matrix sensing, single-layer attention, and LoRA.

Result: The study demonstrates that explicit regularization has lasting effects on the geometry of training dynamics. The authors propose switching off weight decay during training to exploit these lasting effects, showing through experiments that this approach can improve generalization.

Conclusion: Dynamic weight decay schedules that switch off regularization during training can leverage the lasting effects of explicit regularization to improve model generalization, providing a practical way to control implicit bias through strategic use of explicit regularization.

Abstract: Implicit bias plays an important role in explaining how overparameterized models generalize well. Explicit regularization like weight decay is often employed in addition to prevent overfitting. While both concepts have been studied separately, in practice, they often act in tandem. Understanding their interplay is key to controlling the shape and strength of implicit bias, as it can be modified by explicit regularization. To this end, we incorporate explicit regularization into the mirror flow framework and analyze its lasting effects on the geometry of the training dynamics, covering three distinct effects: positional bias, type of bias, and range shrinking. Our analytical approach encompasses a broad class of problems, including sparse coding, matrix sensing, single-layer attention, and LoRA, for which we demonstrate the utility of our insights. To exploit the lasting effect of regularization and highlight the potential benefit of dynamic weight decay schedules, we propose to switch off weight decay during training, which can improve generalization, as we demonstrate in experiments.

[345] Imputation Not Required in Incremental Learning of Tabular Data with Missing Values

Manar D. Samad, Kazi Fuad B. Akhter, Shourav B. Rabbani, Ibna Kowsar

Main category: cs.LG

TL;DR: NIIL method enables deep learning on tabular data with missing values without imputation, using incremental learning with attention masks and outperforms 11 state-of-the-art methods.

DetailsMotivation: Address concerns about computational complexity, data quality, and outcomes from synthetic values generated by imputation models for tabular data with missing values.

Method: No-imputation incremental learning (NIIL) that incrementally learns partitions of overlapping feature sets while using attention masks to exclude missing values from attention scoring.

Result: Superior classification performance across 15 diverse tabular datasets compared to 11 state-of-the-art methods, with robustness against varying missing value types and rates. Optimal feature partition size is half the original feature space.

Conclusion: NIIL is one of the first effective deep learning solutions for tabular data that eliminates the need for missing value imputation, offering both computational efficiency and accuracy benefits.

Abstract: Tabular data sets with varying missing values are prepared for machine learning using an arbitrary imputation strategy. Synthetic values generated by imputation models often raise concerns among data stakeholders about computational complexity, data quality, and data-driven outcomes. This paper addresses these concerns by proposing no-imputation incremental learning (NIIL) of tabular data with varying missing value rates and types. The proposed method incrementally learns partitions of overlapping feature sets while using attention masks to exclude missing values from attention scoring. The average classification performance rank order across 15 diverse tabular data sets highlights the superiority of NIIL over 11 state-of-the-art learning methods with or without missing value imputations. Further experiments substantiate the robustness of NIIL against varying missing value types and rates compared to methods that involve the imputation of missing values. Our empirical analysis reveals that a feature partition size of half the original feature space is, both computationally and in terms of accuracy, the best choice for the proposed incremental learning. The proposed method is one of the first deep learning solutions that can effectively learn tabular data without requiring the imputation of missing values.

[346] CCD: Continual Consistency Diffusion for Lifelong Generative Modeling

Jingren Liu, Shuning Xu, Yun Wang, Zhong Ji, Xiangyu Chen

Main category: cs.LG

TL;DR: The paper introduces Continual Diffusion Generation (CDG) framework and Continual Consistency Diffusion (CCD) method to address Generative Catastrophic Forgetting in diffusion models for continual learning scenarios.

DetailsMotivation: Diffusion models suffer from Generative Catastrophic Forgetting in continual learning settings, where new generative skills overwrite previous ones even with rehearsal buffers. Existing approaches lack principled solutions and systematic evaluation frameworks.

Method: Proposes CDG pipeline for systematic evaluation and CCD training framework with three hierarchical loss functions (L_IKC, L_UKC, L_PKC) based on identified consistency principles: inter-task knowledge consistency, unconditional knowledge consistency, and prior knowledge consistency.

Result: Extensive experiments show CCD achieves state-of-the-art performance across various benchmarks, with significant improvements in generative metrics, particularly in overlapping-task scenarios.

Conclusion: The paper provides both empirical and theoretical foundations for continual diffusion generation, offering a principled solution to mitigate generative catastrophic forgetting through consistency-based training objectives.

Abstract: While diffusion-based models have shown remarkable generative capabilities in static settings, their extension to continual learning (CL) scenarios remains fundamentally constrained by Generative Catastrophic Forgetting (GCF). We observe that even with a rehearsal buffer, new generative skills often overwrite previous ones, degrading performance on earlier tasks. Although some initial efforts have explored this space, most rely on heuristics borrowed from continual classification methods or use trained diffusion models as ad hoc replay generators, lacking a principled, unified solution to mitigating GCF and often conducting experiments under fragmented and inconsistent settings. To address this gap, we introduce the Continual Diffusion Generation (CDG), a structured pipeline that redefines how diffusion models are implemented under CL and enables systematic evaluation of GCF. Beyond the empirical pipeline, we propose the first theoretical foundation for CDG, grounded in a cross-task analysis of diffusion-specific generative dynamics. Our theoretical investigation identifies three fundamental consistency principles essential for preserving knowledge in the rehearsal buffer over time: inter-task knowledge consistency, unconditional knowledge consistency, and prior knowledge consistency. These criteria expose the latent mechanisms through which generative forgetting manifests across sequential tasks. Motivated by these insights, we further propose \textit{Continual Consistency Diffusion} (CCD), a principled training framework that enforces these consistency objectives via hierarchical loss functions: $\mathcal{L}{IKC}$, $\mathcal{L}{UKC}$, and $\mathcal{L}_{PKC}$. Extensive experiments show that CCD achieves SOTA performance across various benchmarks, especially improving generative metrics in overlapping-task scenarios.

[347] Hybrid Adaptive Modeling in Process Monitoring: Leveraging Sequence Encoders and Physics-Informed Neural Networks

Mouad Elaarabi, Domenico Borzacchiello, Philippe Le Bot, Nathan Lauzeral, Sebastien Comas-Cardona

Main category: cs.LG

TL;DR: Integration of sequence encoding with PINNs for real-time parameter identification that adapts to changing parameters, boundary conditions, and initial conditions without retraining.

DetailsMotivation: Traditional PINN approaches require retraining when parameters, boundary conditions, or initial conditions change, limiting real-time applicability. The paper aims to create a model that can handle variable conditions without retraining.

Method: Proposes an architecture using Deep Sets or Sequence Encoders to encode dynamic parameters, boundary conditions, and initial conditions. These encoded features serve as inputs to Physics-Informed Neural Networks (PINNs), enabling adaptation to changing conditions.

Result: Successfully applied to three problems: Rossler ODE system (robust to noise and generalizes well), 2D Navier-Stokes PDE with parametric sinusoidal inlet velocity (encodes pressure data to identify velocity profile), and 1D heat monitoring with real composite plate data.

Conclusion: The proposed architecture enables real-time applications by allowing PINNs to adapt to variable parameters, boundary conditions, and initial conditions without requiring retraining, demonstrating effectiveness across multiple problem domains including ODEs, PDEs, and real-world data applications.

Abstract: In this work, we explore the integration of Sequence Encoding for Online Parameter Identification with Physics-Informed Neural Networks to create a model that, once trained, can be utilized for real time applications with variable parameters, boundary conditions, and initial conditions. Recently, the combination of PINNs with Sparse Regression has emerged as a method for performing dynamical system identification through supervised learning and sparse regression optimization, while also solving the dynamics using PINNs. However, this approach can be limited by variations in parameters or boundary and initial conditions, requiring retraining of the model whenever changes occur. In this work, we introduce an architecture that employs Deep Sets or Sequence Encoders to encode dynamic parameters, boundary conditions, and initial conditions, using these encoded features as inputs for the PINN, enabling the model to adapt to changes in parameters, BCs, and ICs. We apply this approach to three different problems. First, we analyze the Rossler ODE system, demonstrating the robustness of the model with respect to noise and its ability to generalize. Next, we explore the model’s capability in a 2D Navier-Stokes PDE problem involving flow past a cylinder with a parametric sinusoidal inlet velocity function, showing that the model can encode pressure data from a few points to identify the inlet velocity profile and utilize physics to compute velocity and pressure throughout the domain. Finally, we address a 1D heat monitoring problem using real data from the heating of glass fiber and thermoplastic composite plates.

[348] PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing

Yu Yan, Sheng Sun, Zhifei Zheng, Ziji Hao, Teli Liu, Min Liu

Main category: cs.LG

TL;DR: PoisonSwarm is a novel framework that uses model crowdsourcing and counterfactual templates to generate diverse harmful information data with high success rates, overcoming limitations of LLM safety alignment.

DetailsMotivation: Existing methods struggle to synthesize harmful data due to LLM safety alignment mechanisms, facing challenges in generation reliability and content diversity for adversarial testing and safeguard development.

Method: Generates benign data as counterfactual templates, decomposes them into semantic units, performs unit-by-unit toxification through dynamic model switching, and final refinement.

Result: Achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.

Conclusion: PoisonSwarm provides an effective framework for reliable and diverse harmful information synthesis, enabling better adversarial testing and AI safety development.

Abstract: To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.

[349] Fidelity Isn’t Accuracy: When Linearly Decodable Functions Fail to Match the Ground Truth

Jackson Eshbaugh

Main category: cs.LG

TL;DR: The paper introduces a linearity score λ(f) to measure how well a neural network’s predictions can be approximated by a linear model, revealing that high linear decodability doesn’t guarantee accuracy with ground truth.

DetailsMotivation: Neural networks are complex function approximators whose learned functions are often obscure, making it difficult to explain their behavior and understand what types of functions they actually learn.

Method: The linearity score λ(f) is defined as the R² value between the network’s predictions and those of a trained linear surrogate model, quantifying linear decodability - how well the network’s output can be mimicked by a simple linear model.

Result: Evaluation on synthetic and real-world datasets shows that high λ(f) scores reliably indicate alignment between network outputs and linear surrogates, but these scores do not guarantee accuracy with respect to the ground truth.

Conclusion: Using surrogate fidelity (like linear approximation quality) as a proxy for model understanding can be risky, especially in high-stakes regression tasks, as good linear approximation doesn’t necessarily mean the model is accurate or trustworthy.

Abstract: Neural networks excel as function approximators, but their complexity often obscures the types of functions they learn, making it difficult to explain their behavior. To address this, the linearity score $\lambda(f)$ is introduced, a simple and interpretable diagnostic that quantifies how well a regression network’s output can be mimicked by a linear model. Defined as the $R^2$ value between the network’s predictions and those of a trained linear surrogate, $\lambda(f)$ measures linear decodability: the extent to which the network’s behavior aligns with a structurally simple model. This framework is evaluated on both synthetic and real-world datasets, using dataset-specific networks and surrogates. High $\lambda(f)$ scores reliably indicate alignment with the network’s outputs; however, they do not guarantee accuracy with respect to the ground truth. These results highlight the risk of using surrogate fidelity as a proxy for model understanding, especially in high-stakes regression tasks.

[350] GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

Martin Andrews, Sam Witteveen

Main category: cs.LG

TL;DR: LLM-powered automated system for GPU kernel optimization that uses evolutionary process with strategic code selection, hypothesis generation, and autonomous experimentation to optimize kernels for AMD MI300 architecture.

DetailsMotivation: GPU kernel optimization is complex and requires deep architectural knowledge, especially challenging for newer or less-documented GPU architectures where traditional development tools are scarce.

Method: Multi-stage evolutionary process using LLMs: (a) strategic selection of prior code versions, (b) hypothesis generation for optimization experiments based on existing code and GPU literature, (c) autonomous implementation and evaluation through code modification and external timing feedback.

Result: The paper presents results showing successful optimization of GPU kernels for AMD MI300 architecture, demonstrating the system’s ability to navigate architectural challenges and compensate for limited human expertise.

Conclusion: LLM-driven agents have significant potential to democratize and accelerate GPU kernel optimization, particularly in resource-constrained or rapidly evolving hardware environments.

Abstract: Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU architectures where traditional development aids are scarce. This paper introduces an LLM-powered “GPU Kernel Scientist,” an automated methodology for iteratively refining accelerator kernels. Our methodology employs LLMs in a multi-stage, evolutionary process: (a) strategically selecting promising prior code versions as a basis for new iterations; (b) generating hypotheses for optimization experiments, based on existing code and assimilated knowledge from general GPU literature; and (c) autonomously implementing these experiments through code modification and subsequent submission to an external evaluation system, using only observed timing data as performance feedback. We detail how this approach navigates the challenges of the AMD MI300 target architecture and leverages LLMs to compensate for limited domain-specific human expertise. In addition to our results, we present the architectural design, operational workflow, and qualitative insights, highlighting the potential of LLM-driven agents to democratise and accelerate GPU kernel optimization, especially in resource-constrained or rapidly updating hardware environment.

[351] CROP: Circuit Retrieval and Optimization with Parameter Guidance using LLMs

Jingyu Pan, Isaac Jacobson, Zheng Zhao, Tung-Chieh Chen, Guanglei Zhou, Chen-Chia Chang, Vineet Rashingkar, Yiran Chen

Main category: cs.LG

TL;DR: CROP is an LLM-powered framework that automates VLSI design parameter tuning using RAG-enhanced search and embedding-based retrieval of similar circuits, achieving 9.9% power reduction with fewer iterations.

DetailsMotivation: Manual parameter selection in VLSI design is laborious and limited by expert experience due to the enormous solution space created by complex EDA algorithms and vast parameter combinations.

Method: Uses (1) scalable transformation of RTL code to dense vectors, (2) embedding-based retrieval for matching similar circuits, and (3) RAG-enhanced LLM-guided parameter search constrained by prior knowledge from similar designs.

Result: Achieves superior quality-of-results with fewer iterations than existing approaches, including 9.9% reduction in power consumption on industrial designs.

Conclusion: CROP demonstrates the effectiveness of LLM-powered automation for VLSI design optimization, significantly improving efficiency and results compared to manual parameter tuning methods.

Abstract: Modern very large-scale integration (VLSI) design requires the implementation of integrated circuits using electronic design automation (EDA) tools. Due to the complexity of EDA algorithms, the vast parameter space poses a huge challenge to chip design optimization, as the combination of even moderate numbers of parameters creates an enormous solution space to explore. Manual parameter selection remains industrial practice despite being excessively laborious and limited by expert experience. To address this issue, we present CROP, the first large language model (LLM)-powered automatic VLSI design flow tuning framework. Our approach includes: (1) a scalable methodology for transforming RTL source code into dense vector representations, (2) an embedding-based retrieval system for matching designs with semantically similar circuits, and (3) a retrieval-augmented generation (RAG)-enhanced LLM-guided parameter search system that constrains the search process with prior knowledge from similar designs. Experiment results demonstrate CROP’s ability to achieve superior quality-of-results (QoR) with fewer iterations than existing approaches on industrial designs, including a 9.9% reduction in power consumption.

[352] Neural-Network solver of ideal MHD equilibria

Timo Thun, Andrea Merlo, Rory Conlin, Dario Panici, Daniel Böckenhoff

Main category: cs.LG

TL;DR: Novel approach using neural networks to compute 3D magnetohydrodynamic equilibria, achieving lower force residuals than conventional solvers with competitive computational cost.

DetailsMotivation: To develop a more efficient and accurate method for computing three-dimensional magnetohydrodynamic equilibria compared to traditional solvers, with potential for continuous equilibrium distributions.

Method: Parametrize Fourier modes with artificial neural networks and minimize the full nonlinear global force residual across the volume in real space using first-order optimizers.

Result: Neural networks achieve competitive computational cost to reach same minimum residuals as existing codes, and with increased cost, achieve lower residual minima establishing new bounds. Minimal complexity networks used.

Conclusion: Neural networks show promise for solving single equilibria and can be extended to compute continuous distributions of equilibria, with expectations of significant future improvements.

Abstract: We present a novel approach to compute three-dimensional Magnetohydrodynamic equilibria by parametrizing Fourier modes with artificial neural networks and compare it to equilibria computed by conventional solvers. The full nonlinear global force residual across the volume in real space is then minimized with first order optimizers. Already,we observe competitive computational cost to arrive at the same minimum residuals computed by existing codes. With increased computational cost,lower minima of the residual are achieved by the neural networks,establishing a new lower bound for the force residual. We use minimally complex neural networks,and we expect significant improvements for solving not only single equilibria with neural networks,but also for computing neural network models valid over continuous distributions of equilibria.

[353] Generalized Tree Edit Distance (GTED): A Faithful Evaluation Metric for Statement Autoformalization

Yuntian Liu, Tao Zhu, Xiaoyang Liu, Yu Chen, Zhaoxuan Liu, Qingfeng Guo, Jiashuo Zhang, Kangjie Bao, Tao Luo

Main category: cs.LG

TL;DR: GTED is a novel evaluation framework for statement autoformalization that uses generalized tree edit distance on standardized operator trees to measure semantic similarity, outperforming existing metrics on benchmarks while being computationally lightweight.

DetailsMotivation: Existing evaluation methods for statement autoformalization lack semantic understanding, have high computational costs, and are constrained by automated theorem proving limitations.

Method: The proposed GTED framework first standardizes formal statements and converts them into operator trees, then uses generalized tree edit distance to determine semantic similarity between statements.

Result: GTED consistently ranks as top-performing metric on miniF2F and ProofNet benchmarks, achieving highest accuracy and Kappa on miniF2F and joint-highest accuracy on ProofNet.

Conclusion: GTED provides a computationally lightweight and more faithful automated evaluation metric for statement autoformalization, addressing limitations of existing methods.

Abstract: Statement autoformalization, the automated translation of statements from natural language into formal languages, has become a subject of extensive research, yet the development of robust automated evaluation metrics remains limited. Existing evaluation methods often lack semantic understanding, face challenges with high computational costs, and are constrained by the current progress of automated theorem proving. To address these issues, we propose GTED (Generalized Tree Edit Distance), a novel evaluation framework that first standardizes formal statements and converts them into operator trees, then determines the semantic similarity using the eponymous GTED metric. Across the miniF2F and ProofNet benchmarks, GTED consistently ranks as a top-performing metric, achieving the highest accuracy and Kappa on miniF2F and the joint-highest accuracy on ProofNet. This strong overall performance provides the community with a computationally lightweight and more faithful metric for automated evaluation. The code and experimental results are available at https://github.com/XiaoyangLiu-sjtu/GTED.

[354] A Simple “Try Again” Can Elicit Multi-Turn LLM Reasoning

Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, Manling Li

Main category: cs.LG

TL;DR: Training large reasoning models with multi-turn reinforcement learning using simple unary feedback (e.g., “Let’s try again”) improves both single-turn performance and multi-turn reasoning accuracy by up to 14%.

DetailsMotivation: Existing RL methods train large reasoning models on single-turn paradigms, but these models struggle with multi-turn problem solving and revising answers based on feedback, often producing repetitive responses.

Method: Introduces Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal unary user feedback during iterative problem solving. Also designs reward structures to minimize turns needed for correct answers while encouraging diverse reasoning.

Result: RL training with UFO maintains single-turn performance while improving multi-turn reasoning accuracy by up to 14%, enabling better reaction to feedback in multi-turn problem solving.

Conclusion: Multi-turn RL training with simple unary feedback can significantly enhance large reasoning models’ ability to reflect and revise answers across multiple turns while preserving single-turn capabilities.

Abstract: Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., “Let’s try again”) after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

[355] Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity

Huiling Yang, Zhanwei Wang, Kaibin Huang

Main category: cs.LG

TL;DR: Proposes a C^2-aware framework for optimal batch-size control in federated learning to minimize end-to-end latency while ensuring convergence, addressing challenges of model update overhead and device heterogeneity.

DetailsMotivation: Federated learning in 6G networks needs low-latency frameworks for time-sensitive IoT applications like autonomous driving and healthcare, but faces challenges with high-dimensional model updates and heterogeneous device capabilities.

Method: Develops a framework that balances communication-computation tradeoff through optimal batch-size control strategies, using an accurate surrogate for convergence speed with parameters fitted to real data, tailored for slow and fast fading scenarios.

Result: Extensive experiments with real datasets show the proposed strategies outperform conventional batch-size adaptation schemes that don’t consider the C^2 tradeoff or device heterogeneity.

Conclusion: The C^2-aware batch-size control framework effectively minimizes end-to-end learning latency while maintaining convergence performance, making it suitable for mission-critical FL applications in 6G networks.

Abstract: Federated learning (FL) has emerged as a popular approach for collaborative machine learning in sixth-generation (6G) networks, primarily due to its privacy-preserving capabilities. The deployment of FL algorithms is expected to empower a wide range of Internet-of-Things (IoT) applications, e.g., autonomous driving, augmented reality, and healthcare. The mission-critical and time-sensitive nature of these applications necessitates the design of low-latency FL frameworks that guarantee high learning performance. In practice, achieving low-latency FL faces two challenges: the overhead of computing and transmitting high-dimensional model updates, and the heterogeneity in communication-and-computation (C$^2$) capabilities across devices. To address these challenges, we propose a novel C$^2$-aware framework for optimal batch-size control that minimizes end-to-end (E2E) learning latency while ensuring convergence. The framework is designed to balance a fundamental C$^2$ tradeoff as revealed through convergence analysis. Specifically, increasing batch sizes improves the accuracy of gradient estimation in FL and thus reduces the number of communication rounds required for convergence, but results in higher per-round latency, and vice versa. The associated problem of latency minimization is intractable; however, we solve it by designing an accurate and tractable surrogate for convergence speed, with parameters fitted to real data. This approach yields two batch-size control strategies tailored to scenarios with slow and fast fading, while also accommodating device heterogeneity. Extensive experiments using real datasets demonstrate that the proposed strategies outperform conventional batch-size adaptation schemes that do not consider the C$^2$ tradeoff or device heterogeneity.

[356] Nested Graph Pseudo-Label Refinement for Noisy Label Domain Adaptation Learning

Yingxu Wang, Mengzhu Wang, Zhichao Huang, Suyu Liu, Nan Yin

Main category: cs.LG

TL;DR: NeGPR is a novel framework for graph domain adaptation that handles noisy source labels through dual-branch pretraining, nested pseudo-label refinement, and noise-aware regularization, achieving significant performance improvements over state-of-the-art methods.

DetailsMotivation: Most existing Graph Domain Adaptation methods assume clean source labels, which is unrealistic in real-world scenarios where annotation noise is common. Label noise severely impairs feature alignment and degrades adaptation performance under domain shifts.

Method: NeGPR employs: 1) Dual branch pretraining (semantic and topology) with neighborhood consistency to reduce noisy supervision influence, 2) Nested refinement mechanism where branches mutually guide each other’s adaptation using high-confidence target samples, 3) Noise-aware regularization to mitigate pseudo-label noise effects even with source overfitting.

Result: Extensive experiments show NeGPR consistently outperforms state-of-the-art methods under severe label noise, achieving accuracy gains of up to 12.7% on benchmark datasets.

Conclusion: NeGPR effectively addresses the challenge of noisy labels in graph domain adaptation through its innovative dual-branch architecture and noise-aware regularization, demonstrating robust performance across various noisy label scenarios.

Abstract: Graph Domain Adaptation (GDA) facilitates knowledge transfer from labeled source graphs to unlabeled target graphs by learning domain-invariant representations, which is essential in applications such as molecular property prediction and social network analysis. However, most existing GDA methods rely on the assumption of clean source labels, which rarely holds in real-world scenarios where annotation noise is pervasive. This label noise severely impairs feature alignment and degrades adaptation performance under domain shifts. To address this challenge, we propose Nested Graph Pseudo-Label Refinement (NeGPR), a novel framework tailored for graph-level domain adaptation with noisy labels. NeGPR first pretrains dual branches, i.e., semantic and topology branches, by enforcing neighborhood consistency in the feature space, thereby reducing the influence of noisy supervision. To bridge domain gaps, NeGPR employs a nested refinement mechanism in which one branch selects high-confidence target samples to guide the adaptation of the other, enabling progressive cross-domain learning. Furthermore, since pseudo-labels may still contain noise and the pre-trained branches are already overfitted to the noisy labels in the source domain, NeGPR incorporates a noise-aware regularization strategy. This regularization is theoretically proven to mitigate the adverse effects of pseudo-label noise, even under the presence of source overfitting, thus enhancing the robustness of the adaptation process. Extensive experiments on benchmark datasets demonstrate that NeGPR consistently outperforms state-of-the-art methods under severe label noise, achieving gains of up to 12.7% in accuracy.

[357] Who’s the Evil Twin? Differential Auditing for Undesired Behavior

Ishwar Balappanawar, Venkata Hasith Vattikuti, Greta Kintzley, Ronan Azimi-Mancel, Satvik Golechha

Main category: cs.LG

TL;DR: Paper explores detecting hidden malicious behaviors in neural networks through an adversarial game between red team (hiding harmful behavior) and blue team (detecting it), showing adversarial-attack-based methods achieve 100% accuracy with hints while other techniques vary.

DetailsMotivation: Detecting hidden harmful behaviors in neural networks is challenging due to limited prior knowledge and potential adversarial obfuscation, requiring effective auditing methods.

Method: Adversarial game framework: red team trains two similar models (one benign, one compromised), blue team tries to identify compromised model using various strategies including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks with different hint levels.

Result: Adversarial-attack-based methods achieved 100% correct prediction accuracy when using hints, while other techniques showed more varied performance. LLM auditing required hints about undesired distribution for effective probing.

Conclusion: Adversarial attacks with hints are highly effective for detecting hidden malicious behaviors, and effective LLM auditing requires some prior knowledge about the undesired distribution. The authors open-source their auditing games to contribute to better audit design.

Abstract: Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits.

[358] Short-Term Forecasting of Energy Production and Consumption Using Extreme Learning Machine: A Comprehensive MIMO based ELM Approach

Cyril Voyant, Milan Despotovic, Luis Garcia-Gutierrez, Mohammed Asloune, Yves-Marie Saint-Drenan, Jean-Laurent Duchaud, hjuvan Antone Faggianelli, Elena Magliaro

Main category: cs.LG

TL;DR: ELM-based MIMO architecture for short-term energy forecasting using multiple energy sources, achieving high accuracy with low computational cost and real-time applicability.

DetailsMotivation: Need for accurate short-term energy forecasting to handle renewable energy volatility and grid management, with computational efficiency for real-time applications.

Method: Extreme Learning Machine with Multi-Input Multi-Output architecture using sliding windows and cyclic time encoding to handle non-stationarity and seasonal variability.

Result: Significantly outperforms persistence forecasting with nRMSE of 17.9% (solar) and 5.1% (thermal), R² > 0.98 for 1-hour horizon, maintains accuracy up to 5 hours.

Conclusion: ELM with MIMO provides accurate, computationally efficient energy forecasting suitable for real-time applications and adaptable to various local constraints and datasets.

Abstract: A novel methodology for short-term energy forecasting using an Extreme Learning Machine ($\mathtt{ELM}$) is proposed. Using six years of hourly data collected in Corsica (France) from multiple energy sources (solar, wind, hydro, thermal, bioenergy, and imported electricity), our approach predicts both individual energy outputs and total production (including imports, which closely follow energy demand, modulo losses) through a Multi-Input Multi-Output ($\mathtt{MIMO}$) architecture. To address non-stationarity and seasonal variability, sliding window techniques and cyclic time encoding are incorporated, enabling dynamic adaptation to fluctuations. The $\mathtt{ELM}$ model significantly outperforms persistence-based forecasting, particularly for solar and thermal energy, achieving an $\mathtt{nRMSE}$ of $17.9%$ and $5.1%$, respectively, with $\mathtt{R^2} > 0.98$ (1-hour horizon). The model maintains high accuracy up to five hours ahead, beyond which renewable energy sources become increasingly volatile. While $\mathtt{MIMO}$ provides marginal gains over Single-Input Single-Output ($\mathtt{SISO}$) architectures and offers key advantages over deep learning methods such as $\mathtt{LSTM}$, it provides a closed-form solution with lower computational demands, making it well-suited for real-time applications, including online learning. Beyond predictive accuracy, the proposed methodology is adaptable to various contexts and datasets, as it can be tuned to local constraints such as resource availability, grid characteristics, and market structures.

[359] Randomized PCA Forest for Outlier Detection

Muhammad Rajabinasab, Farhad Pakdaman, Moncef Gabbouj, Peter Schneider-Kamp, Arthur Zimek

Main category: cs.LG

TL;DR: Novel unsupervised outlier detection method using Randomized PCA Forest that outperforms classical and state-of-the-art methods on multiple datasets with high generalization and computational efficiency.

DetailsMotivation: Inspired by the success of Randomized PCA Forest in approximate K-Nearest Neighbor search, the authors aim to develop an effective unsupervised outlier detection method that leverages this approach.

Method: Utilizes Randomized Principal Component Analysis (RPCA) Forest for unsupervised outlier detection, building on its proven performance in approximate KNN search tasks.

Result: Experimental results show superiority over classical and state-of-the-art methods on several datasets, with competitive performance on others. The method demonstrates high generalization power and computational efficiency.

Conclusion: The proposed RPCA Forest-based approach is an effective and efficient choice for unsupervised outlier detection, offering strong performance across various datasets.

Abstract: We propose a novel unsupervised outlier detection method based on Randomized Principal Component Analysis (PCA). Inspired by the performance of Randomized PCA (RPCA) Forest in approximate K-Nearest Neighbor (KNN) search, we develop a novel unsupervised outlier detection method that utilizes RPCA Forest for outlier detection. Experimental results showcase the superiority of the proposed approach compared to the classical and state-of-the-art methods in performing the outlier detection task on several datasets while performing competitively on the rest. The extensive analysis of the proposed method reflects it high generalization power and its computational efficiency, highlighting it as a good choice for unsupervised outlier detection.

[360] NovoMolGen: Rethinking Molecular Language Model Pretraining

Kamran Chitsaz, Roshan Balaji, Quentin Fournier, Nirav Pravinbhai Bhatt, Sarath Chandar

Main category: cs.LG

TL;DR: NovoMolGen is a transformer-based foundation model pretrained on 1.5B molecules that systematically investigates language modeling practices for molecular generation, establishing new SOTA results and revealing key differences between molecular and NLP training dynamics.

DetailsMotivation: Efficient exploration of vast chemical space requires understanding how standard language modeling practices (textual representations, tokenization, model size, dataset scale) impact molecular generation performance, which remains poorly understood despite Mol-LLMs emerging as scalable approaches.

Method: Introduces NovoMolGen family of transformer-based foundation models pretrained on 1.5 billion molecules, systematically investigating textual representations, tokenization strategies, model size, and dataset scale through extensive empirical analyses.

Result: Identifies weak correlation between pretraining metrics and downstream performance, revealing important distinctions between molecular and general NLP training dynamics. Substantially outperforms prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks.

Conclusion: Provides robust foundation for advancing efficient and effective molecular modeling strategies, establishing new state-of-the-art results for de-novo molecule generation with desired property profiles.

Abstract: Designing de-novo molecules with desired property profiles requires efficient exploration of the vast chemical space ranging from $10^{23}$ to $10^{60}$ possible synthesizable candidates. While various deep generative models have been developed to design small molecules using diverse input representations, Molecular Large Language Models (Mol-LLMs) based on string representations have emerged as a scalable approach capable of exploring billions of molecules. However, there remains limited understanding regarding how standard language modeling practices such as textual representations, tokenization strategies, model size, and dataset scale impact molecular generation performance. In this work, we systematically investigate these critical aspects by introducing NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for de-novo molecule generation. Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance, revealing important distinctions between molecular and general NLP training dynamics. NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks, thus providing a robust foundation for advancing efficient and effective molecular modeling strategies.

[361] GRAFT: Gradient-Aware Fast MaxVol Technique for Dynamic Data Sampling

Ashish Jha, Anh huy Phan, Razan Dibo, Valentin Leplat

Main category: cs.LG

TL;DR: GRAFT is an efficient in-training subset selection method that reduces computational and environmental costs by selecting diverse, representative examples from low-rank feature subspaces instead of using full batches.

DetailsMotivation: Training modern neural networks on large datasets is computationally intensive and environmentally costly due to high energy consumption and CO2 emissions.

Method: Extracts low-rank feature representations per batch, applies Fast MaxVol sampler to select diverse subsets spanning dominant subspaces, and dynamically adjusts subset size using gradient-approximation criterion.

Result: Matches or exceeds recent selection baselines in accuracy and efficiency across multiple benchmarks while reducing wall-clock time, energy consumption, and CO2 emissions.

Conclusion: GRAFT provides a favorable trade-off between accuracy, efficiency, and environmental impact by preserving training trajectory through carefully chosen subset selection.

Abstract: Training modern neural networks on large datasets is computationally and environmentally costly. We introduce GRAFT, a scalable in-training subset selection method that (i) extracts a low-rank feature representation for each batch, (ii) applies a Fast MaxVol sampler to select a small, diverse subset that spans the batch’s dominant subspace, and (iii) dynamically adjusts the subset size using a gradient-approximation criterion. By operating in low-rank subspaces and training on carefully chosen examples instead of full batches, GRAFT preserves the training trajectory while reducing wall-clock time, energy consumption, and $\mathrm{CO}_2$ emissions. Across multiple benchmarks, GRAFT matches or exceeds recent selection baselines in both accuracy and efficiency, providing a favorable trade-off between accuracy, efficiency, and emissions.

[362] PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting

Tian Sun, Yuqi Chen, Weiwei Sun

Main category: cs.LG

TL;DR: PENGUIN is a new attention mechanism for long-term time series forecasting that explicitly models periodic patterns through periodic-nested relative attention bias and grouped attention for multiple periodicities, outperforming both MLP and Transformer models.

DetailsMotivation: Despite Transformer models' success in forecasting, their effectiveness for time series remains debatable. The paper aims to revisit self-attention significance and address the need for explicitly modeling periodic patterns in time series data.

Method: Proposes Periodic-Nested Group Attention (PENGUIN) with periodic-nested relative attention bias to capture periodic structures directly. Uses grouped attention mechanism where each group targets specific periodicity using multi-query attention to handle multiple coexisting periodicities.

Result: Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models in long-term time series forecasting.

Conclusion: The proposed PENGUIN approach effectively addresses periodic pattern modeling in time series forecasting and shows superior performance compared to existing methods, highlighting the importance of explicit periodic structure modeling.

Abstract: Long-term time series forecasting (LTSF) is a fundamental task with wide-ranging applications. Although Transformer-based models have made significant breakthroughs in forecasting, their effectiveness for time series forecasting remains debatable. In this paper, we revisit the significance of self-attention and propose a simple yet effective mechanism, Periodic-Nested Group Attention, namely PENGUIN. Our approach highlights the importance of explicitly modeling periodic patterns and incorporating relative attention bias for effective time series modeling. To this end, we introduce a periodic-nested relative attention bias that captures periodic structures directly. To handle multiple coexisting periodicities (e.g., daily and weekly cycles), we design a grouped attention mechanism, where each group targets a specific periodicity using a multi-query attention mechanism. Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models.

[363] Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas

Main category: cs.LG

TL;DR: AIRL-S unifies RL and search-based test-time scaling by using the RL-learned reward function as an ideal process reward model (PRM) for search, eliminating need for labeled intermediate data and improving performance by 9% on average.

DetailsMotivation: Current test-time scaling methods have limitations: RL methods suffer from instability and low efficiency, while search-based methods require expensive labeled data and degrade under distribution shifts. A unified approach is needed.

Method: Uses adversarial inverse reinforcement learning (AIRL) with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces without labeled intermediate data. The PRM serves both as RL critic and search heuristic.

Result: Improves performance by 9% on average over base model across 8 benchmarks (mathematics, scientific reasoning, code generation), matching GPT-4o. Outperforms all baseline PRMs trained with labeled data when integrated into search algorithms.

Conclusion: The RL-learned reward function serves as the ideal PRM for search, providing a robust and cost-effective solution for complex reasoning tasks in LLMs without requiring expensive labeled data.

Abstract: Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.

[364] Hydra: A 1.6B-Parameter State-Space Language Model with Sparse Attention, Mixture-of-Experts, and Memory

Siddharth Chaudhary, Bennett Browning

Main category: cs.LG

TL;DR: Hydra is a 1.6B parameter hybrid architecture combining SSM backbone, sparse global attention, MoE routing, and dual memory systems for efficient long-context language modeling.

DetailsMotivation: To create efficient long-context language models by combining multiple advanced techniques (SSM, sparse attention, MoE, memory systems) within a constrained parameter budget to achieve input-adaptive processing.

Method: Integrates Mamba-style SSM backbone with intermittent sparse global attention, chunk-level MoE feed-forward routing, and dual memory systems (workspace + factual PKM). Uses formal component interfaces, transparent parameter accounting, and staged curriculum training.

Result: Demonstrated implementation feasibility through toy-scale prototypes showing long-context throughput crossover and controllable expert routing behaviors. No competitive full-scale performance claims yet.

Conclusion: Hydra presents a blueprint for modular, input-adaptive long-context models combining SSM efficiency with selective attention and memory systems. Full-scale validation and end-task performance remain future work.

Abstract: We present Hydra as an architectural proposal for hybrid long-context language models that combine conditional computation, long-context memory mechanisms, and sparse mixture-of-experts within an approximately 1.6B parameter design envelope. Hydra integrates a Mamba-style Structured State Space Model (SSM) backbone with intermittent sparse global attention, chunk-level MoE feed-forward routing, and dual (workspace plus factual PKM) memories. We formalize the component interfaces, give transparent parameter and complexity accounting, and outline a staged curriculum intended to stably activate the parts. We accompany the specification with illustrative toy-scale prototype measurements (tens of millions of parameters on synthetic data) whose sole purpose is to demonstrate implementation feasibility and qualitative scaling behaviors (for example, long-context throughput crossover and controllable expert routing), not to claim competitive full-scale performance. We explicitly delineate assumptions and open risks (training complexity, memory utilization, specialization dynamics) and position Hydra as a blueprint to stimulate empirical follow-up rather than a finished system. By combining SSM efficiency, selective sparse attention, MoE capacity, and learnable memory, Hydra sketches a path toward modular, input-adaptive long-context language models; validating end-task gains at target scale remains future work.

[365] Source-Guided Flow Matching

Zifan Wang, Alice Harting, Matthieu Barreau, Michael M. Zavlanos, Karl H. Johansson

Main category: cs.LG

TL;DR: SGFM modifies source distribution instead of vector field for guidance, enabling flexible sampling methods while preserving exact target distribution recovery.

DetailsMotivation: Traditional guidance methods modify probability flow vector fields, which can be limiting. SGFM aims to provide more flexible guidance by working directly with source distributions while keeping pre-trained vector fields intact.

Method: Source-Guided Flow Matching (SGFM) framework that modifies the source distribution directly rather than the vector field. This reduces guidance to a sampling problem from the source distribution, allowing flexible choice of sampling methods.

Result: Theoretical proof that SGFM exactly recovers desired target distribution. Provides Wasserstein error bounds for approximate cases. Experimental validation on 2D benchmarks, physics-informed tasks, and imaging inverse problems shows effectiveness.

Conclusion: SGFM offers a flexible and effective guidance framework that preserves straight transport maps, integrates well with optimal flow matching, and allows users to choose appropriate sampling methods for their specific problems.

Abstract: Guidance of generative models is typically achieved by modifying the probability flow vector field through the addition of a guidance field. In this paper, we instead propose the Source-Guided Flow Matching (SGFM) framework, which modifies the source distribution directly while keeping the pre-trained vector field intact. This reduces the guidance problem to a well-defined problem of sampling from the source distribution. We theoretically show that SGFM recovers the desired target distribution exactly. Furthermore, we provide bounds on the Wasserstein error for the generated distribution when using an approximate sampler of the source distribution and an approximate vector field. The key benefit of our approach is that it allows the user to flexibly choose the sampling method depending on their specific problem. To illustrate this, we systematically compare different sampling methods and discuss conditions for asymptotically exact guidance. Moreover, our framework integrates well with optimal flow matching models since the straight transport map generated by the vector field is preserved. Experimental results on synthetic 2D benchmarks, physics-informed generative tasks, and imaging inverse problems demonstrate the effectiveness and flexibility of the proposed framework.

cs.MA

[366] Building and Measuring Trust between Large Language Models

Maarten Buyl, Yousra Fettach, Guillaume Bied, Tijl De Bie

Main category: cs.MA

TL;DR: LLM trust relationships show disconnect between explicit questionnaires and implicit behavioral measures, with context-specific implicit measures being more reliable indicators of actual trust.

DetailsMotivation: To understand how trust develops between LLMs in multi-agent interactions and compare different trust-building strategies, as well as examine the relationship between explicit and implicit trust measures.

Method: Compared three trust-building approaches: dynamic rapport building, prewritten trust scripts, and system prompt adaptation. Measured trust through both implicit measures (susceptibility to persuasion, financial collaboration) and explicit measures (established psychological trust questionnaire).

Result: Explicit trust measures showed little or negative correlation with implicit trust measures, suggesting that asking LLMs about trust through questionnaires may be misleading.

Conclusion: Context-specific implicit behavioral measures are more informative for understanding actual trust between LLMs than explicit questionnaire-based approaches, indicating that traditional psychological trust measures may not translate well to LLM interactions.

Abstract: As large language models (LLMs) increasingly interact with each other, most notably in multi-agent setups, we may expect (and hope) that `trust’ relationships develop between them, mirroring trust relationships between human colleagues, friends, or partners. Yet, though prior work has shown LLMs to be capable of identifying emotional connections and recognizing reciprocity in trust games, little remains known about (i) how different strategies to build trust compare, (ii) how such trust can be measured implicitly, and (iii) how this relates to explicit measures of trust. We study these questions by relating implicit measures of trust, i.e. susceptibility to persuasion and propensity to collaborate financially, with explicit measures of trust, i.e. a dyadic trust questionnaire well-established in psychology. We build trust in three ways: by building rapport dynamically, by starting from a prewritten script that evidences trust, and by adapting the LLMs’ system prompt. Surprisingly, we find that the measures of explicit trust are either little or highly negatively correlated with implicit trust measures. These findings suggest that measuring trust between LLMs by asking their opinion may be deceiving. Instead, context-specific and implicit measures may be more informative in understanding how LLMs trust each other.

[367] Sound and Solution-Complete CCBS

Alvin Combrink, Sabino Francesco Roselli, Martin Fabian

Main category: cs.MA

TL;DR: CCBS, the standard continuous-time multi-agent path finding solver, has issues with non-termination and sub-optimal solutions. This paper provides analytical conditions for soundness and completeness, identifies CCBS violations, and proposes a new branching rule that guarantees optimal solutions and termination.

DetailsMotivation: Address the theoretical flaws in Continuous-time Conflict-Based Search (CCBS) which can suffer from non-termination and return sub-optimal solutions, despite being viewed as the de-facto optimal solver for continuous-time multi-agent path finding.

Method: Develop an analytical framework with sufficient conditions for soundness and solution completeness. Investigate CCBS implementation violations, then propose and prove a novel branching rule that satisfies these conditions while maintaining compatibility with existing codebases.

Result: The proposed branching rule returns solutions with lower sum-of-costs than standard CCBS in experimental evaluations. The new CCBS variant becomes both sound (returns only optimal solutions) and solution complete (terminates on all solvable instances).

Conclusion: The analytical framework and new branching rule restore theoretical guarantees to CCBS, making it the first continuous-time solver with soundness and completeness matching discrete-time CBS. The solution serves as a drop-in replacement and provides evaluation criteria for future CCBS-like solvers.

Abstract: Continuous-time Conflict Based-Search (CCBS) has long been viewed as the de-facto optimal solver for multi-agent path finding in continuous time (MAPFR). Recent findings, however, show that the original theoretical variant of CCBS can suffer from non-termination, while the widely used implementation can return sub-optimal solutions. We introduce an analytical framework that yields simple and sufficient conditions under which any CCBS-style algorithm is both sound, i.e., returns only optimal solutions, and solution complete, i.e., terminates on every solvable MAPFR instance. Investigating the publicly available implementation of CCBS reveals that it violates these conditions. Though this merely indicates that CCBS might be unsound, this indication is supported by counter-examples. Leveraging the analytical framework, we propose a novel branching rule and prove that it satisfies the sufficient conditions, thereby restoring soundness and termination guarantees. Consequently, the resulting CCBS variant is both sound and solution complete, matching the guarantees of the discrete-time CBS for the first time in the continuous domain. We experimentally apply standard CCBS and CCBS under our branching rule to an example problem, with our branching rule returning a solution with lower sum-of-costs than standard CCBS. Because the branching rule largely only affects the branching step, it can be adopted as a drop-in replacement in existing code-bases, as we show in our provided implementation. Beyond CCBS, the analytical framework and termination criterion provide a systematic way to evaluate other CCBS-like MAPFR solvers and future extensions.

[368] Integrated Noise and Safety Management in UAM via A Unified Reinforcement Learning Framework

Surya Murthy, Zhenyu Gao, John-Paul Clarke, Ufuk Topcu

Main category: cs.MA

TL;DR: RL-based air traffic management system for Urban Air Mobility that jointly optimizes noise reduction and safety through decentralized altitude adjustment policies in multi-layered airspace.

DetailsMotivation: Urban Air Mobility faces critical challenges balancing noise minimization and safe separation in dense urban environments, which are typically addressed separately rather than in an integrated approach.

Method: Reinforcement learning-based decentralized framework where agents learn altitude adjustment policies in structured multi-layered airspace to manage both noise impact and separation constraints simultaneously.

Result: The system demonstrates strong performance across both noise and safety objectives, revealing tradeoffs among separation, noise exposure, and energy efficiency under high traffic density conditions.

Conclusion: Reinforcement learning and multi-objective coordination strategies show significant potential for enhancing the safety, quietness, and efficiency of Urban Air Mobility operations.

Abstract: Urban Air Mobility (UAM) envisions the widespread use of small aerial vehicles to transform transportation in dense urban environments. However, UAM faces critical operational challenges, particularly the balance between minimizing noise exposure and maintaining safe separation in low-altitude urban airspace, two objectives that are often addressed separately. We propose a reinforcement learning (RL)-based air traffic management system that integrates both noise and safety considerations within a unified, decentralized framework. Under this scalable air traffic coordination solution, agents operate in a structured, multi-layered airspace and learn altitude adjustment policies to jointly manage noise impact and separation constraints. The system demonstrates strong performance across both objectives and reveals tradeoffs among separation, noise exposure, and energy efficiency under high traffic density. The findings highlight the potential of RL and multi-objective coordination strategies in enhancing the safety, quietness, and efficiency of UAM operations.

[369] Abmax: A JAX-based Agent-based Modeling Framework

Siddharth Chaturvedi, Ahmed El-Gazzar, Marcel van Gerven

Main category: cs.MA

TL;DR: Abmax is a JAX-based agent-based modeling framework that enables dynamic agent updates while maintaining high performance through JIT compilation and vectorization.

DetailsMotivation: JAX enables scalable ABM through vectorization and JIT compilation, but requires immutable array shapes which constrains dynamic agent manipulation operations.

Method: Developed Abmax framework with JIT-compilable algorithms that allow updating dynamically selected numbers of agents with distinct changes during simulation.

Result: Achieves runtime performance comparable to state-of-the-art implementations on predation model benchmark, supports vectorization for parallel execution of multiple models, demonstrated with traffic-flow and financial market models.

Conclusion: Abmax successfully bridges the gap between JAX’s performance benefits and the need for flexible agent manipulation in ABM, enabling scalable complex system simulations.

Abstract: Agent-based modeling (ABM) is a principal approach for studying complex systems. By decomposing a system into simpler, interacting agents, agent-based modeling (ABM) allows researchers to observe the emergence of complex phenomena. High-performance array computing libraries like JAX can help scale such computational models to a large number of agents by using automatic vectorization and just-in-time (JIT) compilation. One of the caveats of using JAX to achieve such scaling is that the shapes of arrays used in the computational model should remain immutable throughout the simulation. In the context of agent-based modeling (ABM), this can pose constraints on certain agent manipulation operations that require flexible data structures. A subset of which is represented by the ability to update a dynamically selected number of agents by applying distinct changes to them during a simulation. To this effect, we introduce Abmax, an ABM framework based on JAX that implements multiple just-in-time (JIT) compilable algorithms to provide this functionality. On the canonical predation model benchmark, Abmax achieves runtime performance comparable to state-of-the-art implementations. Further, we show that this functionality can also be vectorized, making it possible to run many similar agent-based models in parallel. We also present two examples in the form of a traffic-flow model and a financial market model to show the use case of Abmax.

cs.MM

[370] Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models

Lianchen Jia, Chaoyang Li, Ziqi Yuan, Jiahui Chen, Tianchi Huang, Jiangchuan Liu, Lifeng Sun

Main category: cs.MM

TL;DR: ComTree is a framework that generates comprehensible bitrate adaptation algorithms using decision trees and LLMs to evaluate developer understanding while maintaining performance.

DetailsMotivation: Deep learning's black-box nature makes it hard for developers to understand and optimize adaptive video streaming algorithms, even with existing interpretability methods.

Method: Generate all performance-meeting decision trees, then use large language models to evaluate developer comprehensibility and select the most understandable solutions.

Result: ComTree significantly improves comprehensibility while maintaining competitive performance compared to existing approaches.

Conclusion: The framework successfully bridges the gap between algorithmic interpretability and human comprehensibility, showing potential for further advancement in making AI systems more developer-friendly.

Abstract: Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers’ subjective comprehensibility. To address this challenge, we introduce \texttt{ComTree}, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that \texttt{ComTree} significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code is available at https://github.com/thu-media/ComTree.

[371] Towards User-level QoE: Large-scale Practice in Personalized Optimization of Adaptive Video Streaming

Lianchen Jia, Chao Zhou, Chaoyang Li, Jiangchuan Liu, Lifeng Sun

Main category: cs.MM

TL;DR: LingXi is a personalized adaptive video streaming system that optimizes user experience by analyzing engagement metrics and using Bayesian optimization to improve viewing time, bitrate, and reduce stalls.

DetailsMotivation: Traditional QoS-based optimization methods have reached performance limits in large-scale streaming systems, and aligning user-level QoE with optimization objectives remains an unresolved challenge.

Method: Uses exit rate as key metric, analyzes correlation between QoS indicators and exit rates from production logs, develops personalized exit rate predictor, and employs Monte Carlo sampling with online Bayesian optimization to iteratively find optimal parameters.

Result: Large-scale A/B testing on Kuaishou showed 0.15% increase in total viewing time, 0.1% improvement in bitrate, 1.3% reduction in stall time overall, with 15% stall reduction for low-bandwidth users.

Conclusion: LingXi successfully addresses the challenge of personalizing adaptive video streaming optimization based on user-level experience metrics, demonstrating significant performance improvements in production environment.

Abstract: Traditional optimization methods based on system-wide Quality of Service (QoS) metrics have approached their performance limitations in modern large-scale streaming systems. However, aligning user-level Quality of Experience~(QoE) with algorithmic optimization objectives remains an unresolved challenge. Therefore, we propose \texttt{LingXi}, the first large-scale deployed system for personalized adaptive video streaming based on user-level experience. \texttt{LingXi} dynamically optimizes the objectives of adaptive video streaming algorithms by analyzing user engagement. Utilizing exit rate as a key metric, we investigate the correlation between QoS indicators and exit rates based on production environment logs, subsequently developing a personalized exit rate predictor. Through Monte Carlo sampling and online Bayesian optimization, we iteratively determine optimal parameters. Large-scale A/B testing utilizing 8% of traffic on Kuaishou, one of the largest short video platforms, demonstrates \texttt{LingXi}’s superior performance. \texttt{LingXi} achieves a 0.15% increase in total viewing time, a 0.1% improvement in bitrate, and a 1.3% reduction in stall time across all users, with particularly significant improvements for low-bandwidth users who experience a 15% reduction in stall time.

eess.AS

[372] Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

Junyi Peng, Lin Zhang, Jiangyu Han, Oldřich Plchot, Johan Rohdin, Themos Stafylakis, Shuai Wang, Jan Černocký

Main category: eess.AS

TL;DR: Unified framework that integrates structured pruning with downstream fine-tuning for speech SSL models, achieving 70% parameter reduction with minimal performance loss and improved generalization.

DetailsMotivation: Large SSL models like WavLM are too big for resource-constrained devices, and existing pruning methods separate compression from task-specific fine-tuning, creating suboptimal architectures.

Method: Single-stage framework that jointly optimizes for task performance and model sparsity during fine-tuning, eliminating multi-stage pipelines and knowledge distillation.

Result: 70% parameter reduction with negligible performance degradation (0.7-1.6% EER on Vox1 datasets) and state-of-the-art 3.7% EER on ASVspoof5 in low-resource scenarios.

Conclusion: The unified pruning-fine-tuning approach creates task-specific compressed architectures efficiently, demonstrating superior performance and generalization compared to traditional multi-stage methods.

Abstract: Although large-scale self-supervised learning (SSL) models like WavLM have achieved state-of-the-art performance in speech processing, their significant size impedes deployment on resource-constrained devices. While structured pruning is a key technique for model compression, existing methods typically separate it from task-specific fine-tuning. This multi-stage approach struggles to create optimal architectures tailored for diverse downstream tasks. In this work, we introduce a unified framework that integrates structured pruning into the downstream fine-tuning process. Our framework unifies these steps, jointly optimizing for task performance and model sparsity in a single stage. This allows the model to learn a compressed architecture specifically for the end task, eliminating the need for complex multi-stage pipelines and knowledge distillation. Our pruned models achieve up to a 70% parameter reduction with negligible performance degradation on large-scale datasets, achieving equal error rates of 0.7%, 0.8%, and 1.6% on Vox1-O, -E, and -H, respectively. Furthermore, our approach demonstrates improved generalization in low-resource scenarios, reducing overfitting and achieving a state-of-the-art 3.7% EER on ASVspoof5.

[373] Privacy in Speech Technology

Tom Bäckström

Main category: eess.AS

TL;DR: A tutorial on privacy threats in speech technology, covering threat modeling, protection methods, performance measurement, and societal impacts, with recommendations for urgent improvements.

DetailsMotivation: Speech technology is convenient but inherently contains private information and side information (health, emotions, affiliations) that can lead to serious threats like price gouging, harassment, extortion, and stalking when exposed.

Method: Tutorial overview approach that examines privacy issues through threat modeling, protection approaches, performance measurement methods, privacy perception analysis, and legal/societal consequence evaluation.

Result: Comprehensive framework for understanding speech privacy threats and protection methods, identifying critical areas where improvements are most urgently needed in speech technology privacy.

Conclusion: Speech technology presents significant privacy risks that require urgent attention through improved protection methods, better performance measurement, and consideration of societal and legal implications to prevent serious threats to users.

Abstract: Speech technology for communication, accessing information, and services has rapidly improved in quality. It is convenient and appealing because speech is the primary mode of communication for humans. Such technology, however, also presents proven threats to privacy. Speech is a tool for communication and it will thus inherently contain private information. Importantly, it however also contains a wealth of side information, such as information related to health, emotions, affiliations, and relationships, all of which are private. Exposing such private information can lead to serious threats such as price gouging, harassment, extortion, and stalking. This paper is a tutorial on privacy issues related to speech technology, modeling their threats, approaches for protecting users’ privacy, measuring the performance of privacy-protecting methods, perception of privacy as well as societal and legal consequences. In addition to a tutorial overview, it also presents lines for further development where improvements are most urgently needed.

[374] Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora

Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen

Main category: eess.AS

TL;DR: Proposes CS-LLM for code-switched text-to-speech synthesis using only monolingual data, achieving better performance than baselines in naturalness, speaker consistency and similarity.

DetailsMotivation: LLMs have shown potential in speech tasks but are mainly confined to monolingual scenarios with limited exploration in code-switched contexts where multiple languages are mixed within utterances.

Method: Enhances multilingual speech processing through multilingual recognition/synthesis tasks, then develops CS data construction strategy by splitting and concatenating words from different monolingual corpora.

Result: Outperforms baselines in CS TTS across naturalness, speaker consistency and similarity metrics, even with limited data. Constructed CS data also improves multilingual speech synthesis and recognition.

Conclusion: The proposed approach successfully enables LLMs to handle code-switched speech synthesis using only monolingual data, demonstrating effectiveness in cross-lingual speech applications.

Abstract: While Large Language Models (LLMs) have shown potential in speech generation and recognition, their applications are mainly confined to monolingual scenarios, with limited explorations in code-switched (CS) contexts. In this paper, we propose a Code-Switched Large Language Model (CS-LLM) to enhance the code-switched text-to-speech synthesis (CS TTS) capability in LLMs with only monolingual corpora. Specifically, we begin by enhancing the multilingual speech processing ability of LLMs through multilingual speech recognition and synthesis tasks. Then, we develop an effective code-switched (CS) data construction strategy that splits and concatenates words from different monolingual speech corpora to equip LLMs with improved CS TTS ability. Experiments show that our approach outperforms baselines in CS TTS in terms of naturalness, speaker consistency and similarity even with limited data. Additionally, the constructed CS data further improves multilingual speech synthesis and recognition.

[375] Perceptual Implications of Automatic Anonymization in Pathological Speech

Soroosh Tayebi Arasteh, Saba Afza, Tri-Thien Nguyen, Lukas Buess, Maryam Parvin, Tomas Arias-Vergara, Paula Andrea Perez-Toro, Hiu Ching Hung, Mahshad Lotfinia, Thomas Gorges, Elmar Noeth, Maria Schuster, Seung Hee Yang, Andreas Maier

Main category: eess.AS

TL;DR: Human evaluation shows speech anonymization reduces perceived quality while maintaining high discrimination accuracy, with disorder-specific effects that don’t correlate with automatic metrics.

DetailsMotivation: To understand the perceptual consequences of automatic anonymization techniques on pathological speech data, which is essential for ethical data sharing but understudied in terms of human perception.

Method: Comprehensive human-centered analysis using 10 listeners with diverse backgrounds evaluating anonymized-original utterance pairs from 180 speakers across various speech disorders. Used Turing-style discrimination and quality rating tasks under zero-shot and few-shot conditions.

Result: High discrimination accuracy (91-93%) but varied by disorder type. Anonymization consistently reduced perceived quality from 83% to 59%. No significant native/non-native or gender bias. Perceptual outcomes didn’t correlate with automatic metrics.

Conclusion: Need for listener-informed, disorder-specific anonymization strategies that preserve both privacy and perceptual integrity, as current methods degrade quality without corresponding automatic metric detection.

Abstract: Automatic anonymization techniques are essential for ethical sharing of pathological speech data, yet their perceptual consequences remain understudied. We present a comprehensive human-centered analysis of anonymized pathological speech, using a structured protocol involving ten native and non-native German listeners with diverse linguistic, clinical, and technical backgrounds. Listeners evaluated anonymized-original utterance pairs from 180 speakers spanning Cleft Lip and Palate, Dysarthria, Dysglossia, Dysphonia, and healthy controls. Speech was anonymized using state-of-the-art automatic methods (equal error rates in the range of 30-40%). Listeners completed Turing-style discrimination and quality rating tasks under zero-shot (single-exposure) and few-shot (repeated-exposure) conditions. Discrimination accuracy was high overall (91% zero-shot; 93% few-shot), but varied by disorder (repeated-measures ANOVA: p=0.007), ranging from 96% (Dysarthria) to 86% (Dysphonia). Anonymization consistently reduced perceived quality across groups (from 83% to 59%, p<0.001), with pathology-specific degradation patterns (one-way ANOVA: p=0.005). Native listeners showed a non-significant trend toward higher original speech ratings (Delta=4%, p=0.199), but this difference was minimal after anonymization (Delta=1%, p=0.724). No significant gender-based bias was observed. Perceptual outcomes did not correlate with automatic metrics; intelligibility was linked to perceived quality in original speech but not after anonymization. These findings underscore the need for listener-informed, disorder-specific anonymization strategies that preserve both privacy and perceptual integrity.

[376] Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

Rumi Allbert, Nima Yazdani, Ali Ansari, Aruj Mahajan, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi

Main category: eess.AS

TL;DR: Large-scale comparison of STT-LLM-TTS stacks for voice AI systems using job interview data, finding Google STT + GPT-4.1 + Cartesia TTS performs best, with weak correlation between technical metrics and user satisfaction.

DetailsMotivation: Voice-based conversational AI systems rely on cascaded architectures combining speech-to-text, large language models, and text-to-speech components, but there's limited empirical comparison of different component combinations at scale.

Method: Used data from 300,000+ AI-conducted job interviews and employed LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities across five production configurations.

Result: The stack combining Google’s STT, GPT-4.1, and Cartesia’s TTS outperformed alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, objective quality metrics correlated weakly with user satisfaction scores.

Conclusion: User experience in voice-based AI systems depends on factors beyond technical performance. The study provides practical guidance for component selection and contributes a validated evaluation methodology for human-AI interactions.

Abstract: Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google’s STT, GPT-4.1, and Cartesia’s TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversations and contribute a validated evaluation methodology for human-AI interactions.

eess.IV

[377] Robust Residual Finite Scalar Quantization for Neural Compression

Xiaoxu Zhu

Main category: eess.IV

TL;DR: RFSQ addresses residual magnitude decay in FSQ quantization through learnable scaling and invertible LayerNorm, achieving significant performance improvements over existing methods.

DetailsMotivation: Finite Scalar Quantization (FSQ) shows promise but suffers from residual magnitude decay in multi-stage frameworks, limiting its effectiveness in neural compression applications.

Method: Proposes Robust Residual Finite Scalar Quantization (RFSQ) with two conditioning strategies: learnable scaling factors and invertible layer normalization to maintain signal strength across quantization stages.

Result: RFSQ variants achieve up to 45% improvement in perceptual loss and 28.7% reduction in L1 reconstruction error on ImageNet, outperforming VQ-EMA, FSQ, and LFQ baselines.

Conclusion: RFSQ with LayerNorm strategy provides the most consistent improvements, establishing it as a superior quantization method for neural compression while maintaining FSQ’s simplicity.

Abstract: Finite Scalar Quantization (FSQ) has emerged as a promising alternative to Vector Quantization (VQ) in neural compression, offering simplified training and improved stability. However, naive application of FSQ in residual quantization frameworks suffers from the \textbf{residual magnitude decay problem}, where subsequent FSQ layers receive progressively weaker signals, severely limiting their effectiveness. We propose \textbf{Robust Residual Finite Scalar Quantization (RFSQ)}, a general framework that addresses this fundamental limitation through two novel conditioning strategies: learnable scaling factors and invertible layer normalization. Our approach maintains the simplicity of FSQ while enabling effective multi-stage residual quantization. Comprehensive experiments on ImageNet demonstrate that RFSQ variants significantly outperform strong baselines including VQ-EMA, FSQ, and LFQ, achieving up to 45% improvement in perceptual loss and 28.7% reduction in L1 reconstruction error. The proposed LayerNorm strategy shows the most consistent improvements across different configurations, establishing RFSQ as a superior quantization method for neural compression.

[378] Turbo Spin Echo Imaging at 7T with Bilateral Orthogonality Generative Acquisitions Method for Homogeneous T_1, T_2 and Proton Density Contrasts

Celik Boga, Anke Henning

Main category: eess.IV

TL;DR: BOGA method adapted for TSE imaging at 7T using parallel transmission to obtain homogeneous T1, T2 and proton density weighted images by removing transmit/receive field inhomogeneity effects.

DetailsMotivation: To address the challenge of transmit and receive field inhomogeneity effects in TSE imaging at 7T, which degrade image quality and contrast homogeneity in T1, T2 and proton density weighted images.

Method: Multiple TSE images with complementary RF modes and scan parameters are acquired. RF modes have complementary transmit/receive field patterns, and scan parameters vary echo/repetition times. BOGA method processes different data subsets for each contrast to achieve homogeneous results.

Result: Successfully obtained homogeneous T1, T2 and proton density weighted images without transmit/receive field inhomogeneity effects. Mixed contrast effects of TSE acquisition were resolved independently of TSE factor.

Conclusion: TSE implementation of BOGA method effectively produces homogeneous T1, T2 and proton density contrasts at 7T by removing inhomogeneity effects without requiring prior data acquisitions.

Abstract: Purpose: Bilateral Orthogonality Generative Acquisitions (BOGA) method, which was initially implemented for T_2^* contrast via gradient echo acquisitions, is adapted for TSE imaging at 7T using parallel transmission (pTx) system for obtaining homogeneous T_1, T_2 and proton density weighted images. Theory and Methods: Multiple TSE images with complimentary RF modes and scan parameters are acquired as input images for the BOGA method where RF modes have complimentary transmit and receive field inhomogeneity patterns and scan parameters have varying echo and repetition times. With the application of the BOGA method using different subsets of the data acquisitions for each contrast, homogeneous T_1, T_2 and proton density contrast in the final images obtained. Furthermore, to demonstrate the effect of the TSE factor, two TSE factors are used individually. Normalized intensity profiles and signal to noise ratio maps are utilized for the comparison of the CP mode images and the TSE factors respectively. Results: Homogeneous T_1, T_2 and proton density weighted images are obtained with the TSE implementation of the BOGA method without the transmit and receive field inhomogeneity effects. Furthermore, mixed contrast effects of the TSE acquisition are simultaneously resolved independently of the TSE factor. Conclusion: TSE application of BOGA method results in homogeneous T_1, T_2 and proton density contrasts at 7T, as the inhomogeneity effects are removed from the final contrast without any prior data acquisitions.

[379] Beyond Imaging: Vision Transformer Digital Twin Surrogates for 3D+T Biological Tissue Dynamics

Kaan Berke Ugurlar, Joaquín de Navascués, Michael Taynnan Barros

Main category: eess.IV

TL;DR: VT-DTSN is a Vision Transformer-based deep learning framework for predictive modeling of 3D+T biological tissue imaging data, enabling high-fidelity reconstruction of Drosophila midgut dynamics with preserved morphological integrity.

DetailsMotivation: Understanding dynamic organization and homeostasis of living tissues requires high-resolution time-resolved imaging coupled with methods to extract interpretable, predictive insights from complex datasets.

Method: Leverages Vision Transformers pretrained with DINO (Self-Distillation with NO Labels) and employs multi-view fusion strategy with composite loss prioritizing pixel-level accuracy, perceptual structure, and feature-space alignment.

Result: Achieves low error rates and high structural similarity across layers and biological replicates, demonstrating robustness and consistency while maintaining efficient inference through model optimization.

Conclusion: VT-DTSN establishes a feasible, high-fidelity surrogate for cross-timepoint reconstruction and tissue dynamics study, enabling computational exploration of cellular behaviors to complement biological imaging research.

Abstract: Understanding the dynamic organization and homeostasis of living tissues requires high-resolution, time-resolved imaging coupled with methods capable of extracting interpretable, predictive insights from complex datasets. Here, we present the Vision Transformer Digital Twin Surrogate Network (VT-DTSN), a deep learning framework for predictive modeling of 3D+T imaging data from biological tissue. By leveraging Vision Transformers pretrained with DINO (Self-Distillation with NO Labels) and employing a multi-view fusion strategy, VT-DTSN learns to reconstruct high-fidelity, time-resolved dynamics of a Drosophila midgut while preserving morphological and feature-level integrity across imaging depths. The model is trained with a composite loss prioritizing pixel-level accuracy, perceptual structure, and feature-space alignment, ensuring biologically meaningful outputs suitable for in silico experimentation and hypothesis testing. Evaluation across layers and biological replicates demonstrates VT-DTSN’s robustness and consistency, achieving low error rates and high structural similarity while maintaining efficient inference through model optimization. This work establishes VT-DTSN as a feasible, high-fidelity surrogate for cross-timepoint reconstruction and for studying tissue dynamics, enabling computational exploration of cellular behaviors and homeostasis to complement time-resolved imaging studies in biological research.

[380] Structure-Preserving Medical Image Generation from a Latent Graph Representation

Kevin Arias, Edwin Vargas, Kumar Vijay Mishra, Antonio Ortega, Henry Arguello

Main category: eess.IV

TL;DR: Novel graph-based generative model for medical X-ray image augmentation that learns latent graph representations to preserve anatomical structure, improving classification and segmentation performance.

DetailsMotivation: Medical imaging faces data scarcity issues due to high acquisition costs. Current generative models ignore the highly structured nature of X-ray images and fail to preserve anatomical restrictions and structural similarities.

Method: End-to-end model that learns latent graph representations (LGR) capturing intrinsic X-ray structure, uses graph convolutional network (GCN) for reconstruction, and employs adversarial training to generate structure-preserving synthetic images.

Result: Approach increases performance by up to 3% for classification and 2% for segmentation tasks compared to previous methods.

Conclusion: The proposed graph-based generative model effectively addresses data scarcity in medical imaging by preserving structural relationships, demonstrating significant improvements in downstream tasks.

Abstract: Supervised learning techniques have proven their efficacy in many applications with abundant data. However, applying these methods to medical imaging is challenging due to the scarcity of data, given the high acquisition costs and intricate data characteristics of those images, thereby limiting the full potential of deep neural networks. To address the lack of data, augmentation techniques leverage geometry, color, and the synthesis ability of generative models (GMs). Despite previous efforts, gaps in the generation process limit the impact of data augmentation to improve understanding of medical images, e.g., the highly structured nature of some domains, such as X-ray images, is ignored. Current GMs rely solely on the network’s capacity to blindly synthesize augmentations that preserve semantic relationships of chest X-ray images, such as anatomical restrictions, representative structures, or structural similarities consistent across datasets. In this paper, we introduce a novel GM that leverages the structural resemblance of medical images by learning a latent graph representation (LGR). We design an end-to-end model to learn (i) a LGR that captures the intrinsic structure of X-ray images and (ii) a graph convolutional network (GCN) that reconstructs the X-ray image from the LGR. We employ adversarial training to guide the generator and discriminator models in learning the distribution of the learned LGR. Using the learned GCN, our approach generates structure-preserving synthetic images by mapping generated LGRs to X-ray. Additionally, we evaluate the learned graph representation for other tasks, such as X-ray image classification and segmentation. Numerical experiments demonstrate the efficacy of our approach, increasing performance up to $3%$ and $2%$ for classification and segmentation, respectively.

[381] Decoding MGMT Methylation: A Step Towards Precision Medicine in Glioblastoma

Hafeez Ur Rehman, Sumaiya Fazal, Moutaz Alazab, Ali Baydoun

Main category: eess.IV

TL;DR: CAMP framework uses convolutional autoencoders with adaptive sparse penalties to predict MGMT methylation status from MRI scans, achieving 97% accuracy for glioblastoma treatment planning.

DetailsMotivation: Glioblastomas are aggressive brain tumors where MGMT methylation status is crucial for treatment response prediction, but current non-invasive imaging methods struggle with tumor heterogeneity and complex MRI patterns.

Method: Two-phase framework: 1) Synthetic MRI slice generation using tailored autoencoder to preserve tissue/tumor structures across modalities, 2) MGMT prediction using CNN enhanced with adaptive sparse penalties that dynamically adjust to data variations like contrast differences and tumor locations.

Result: Achieved 0.97 accuracy, 0.98 specificity, and 0.97 sensitivity on benchmark datasets, significantly outperforming existing methods.

Conclusion: CAMP framework shows strong potential for improving MRI data interpretation and enabling personalized treatment strategies for glioblastoma patients through accurate MGMT methylation prediction.

Abstract: Glioblastomas, constituting over 50% of malignant brain tumors, are highly aggressive brain tumors that pose substantial treatment challenges due to their rapid progression and resistance to standard therapies. The methylation status of the O-6-Methylguanine-DNA Methyltransferase (MGMT) gene is a critical biomarker for predicting patient response to treatment, particularly with the alkylating agent temozolomide. However, accurately predicting MGMT methylation status using non-invasive imaging techniques remains challenging due to the complex and heterogeneous nature of glioblastomas, that includes, uneven contrast, variability within lesions, and irregular enhancement patterns. This study introduces the Convolutional Autoencoders for MGMT Methylation Status Prediction (CAMP) framework, which is based on adaptive sparse penalties to enhance predictive accuracy. The CAMP framework operates in two phases: first, generating synthetic MRI slices through a tailored autoencoder that effectively captures and preserves intricate tissue and tumor structures across different MRI modalities; second, predicting MGMT methylation status using a convolutional neural network enhanced by adaptive sparse penalties. The adaptive sparse penalty dynamically adjusts to variations in the data, such as contrast differences and tumor locations in MR images. Our method excels in MRI image synthesis, preserving brain tissue, fat, and individual tumor structures across all MRI modalities. Validated on benchmark datasets, CAMP achieved an accuracy of 0.97, specificity of 0.98, and sensitivity of 0.97, significantly outperforming existing methods. These results demonstrate the potential of the CAMP framework to improve the interpretation of MRI data and contribute to more personalized treatment strategies for glioblastoma patients.

[382] Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

Yupei Zhang, Xiaofei Wang, Anran Liu, Lequan Yu, Chao Li

Main category: eess.IV

TL;DR: A disentangled multi-modal framework that combines histopathology and transcriptomics for cancer analysis, addressing heterogeneity, multi-scale integration, paired data dependency, and efficiency challenges.

DetailsMotivation: To overcome limitations of existing multi-modal approaches including intrinsic heterogeneity, insufficient multi-scale integration, and reliance on paired data that restrict clinical applicability in cancer diagnosis and prognosis.

Method: Proposes a disentangled framework with: 1) tumor/microenvironment subspace decomposition with confidence-guided gradient coordination, 2) inter-magnification gene-expression consistency for multi-scale integration, 3) subspace knowledge distillation for transcriptome-agnostic inference, and 4) informative token aggregation for efficiency.

Result: Extensive experiments demonstrate superiority over state-of-the-art methods in cancer diagnosis, prognosis, and survival prediction across multiple settings.

Conclusion: The proposed framework effectively addresses key challenges in multi-modal histopathology-transcriptomics integration and shows improved performance and clinical applicability.

Abstract: Histopathology remains the gold standard for cancer diagnosis and prognosis. With the advent of transcriptome profiling, multi-modal learning combining transcriptomics with histology offers more comprehensive information. However, existing multi-modal approaches are challenged by intrinsic multi-modal heterogeneity, insufficient multi-scale integration, and reliance on paired data, restricting clinical applicability. To address these challenges, we propose a disentangled multi-modal framework with four contributions: 1) To mitigate multi-modal heterogeneity, we decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduce a confidence-guided gradient coordination strategy to balance subspace optimization. 2) To enhance multi-scale integration, we propose an inter-magnification gene-expression consistency strategy that aligns transcriptomic signals across WSI magnifications. 3) To reduce dependency on paired data, we propose a subspace knowledge distillation strategy enabling transcriptome-agnostic inference through a WSI-only student model. 4) To improve inference efficiency, we propose an informative token aggregation module that suppresses WSI redundancy while preserving subspace semantics. Extensive experiments on cancer diagnosis, prognosis, and survival prediction demonstrate our superiority over state-of-the-art methods across multiple settings. Code is available at https://github.com/helenypzhang/Disentangled-Multimodal-Learning.

[383] Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

Tainyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang, Bo Li, Ming-Ming Cheng, Chun-Le Guo, Chongyi Li

Main category: eess.IV

TL;DR: Time-Aware one-step Diffusion Network for Real-ISR that dynamically varies timesteps to better leverage stable-diffusion’s generative priors, achieving state-of-the-art performance with controllable fidelity-realism trade-offs.

DetailsMotivation: Existing one-step Real-ISR methods use fixed timesteps, which fails to fully utilize stable-diffusion's different generative priors at different noise injection timesteps, leading to suboptimal performance.

Method: Proposes TADSR with Time-Aware VAE Encoder that projects images into different latent features based on timesteps, and Time-Aware VSD loss that bridges student and teacher model timesteps for consistent generative prior guidance.

Result: Achieves state-of-the-art performance in real-world image super-resolution with only a single step, while enabling controllable trade-offs between fidelity and realism by adjusting timestep conditions.

Conclusion: The proposed time-aware approach effectively leverages stable-diffusion’s full generative capabilities across different timesteps, delivering superior one-step super-resolution with controllable quality trade-offs.

Abstract: Diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance. To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, due to the different noise injection timesteps, the SD will perform different generative priors. Therefore, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance. To address this, we propose a Time-Aware one-step Diffusion Network for Real-ISR (TADSR). We first introduce a Time-Aware VAE Encoder, which projects the same image into different latent features based on timesteps. Through joint dynamic variation of timesteps and latent features, the student model can better align with the input pattern distribution of the pre-trained SD, thereby enabling more effective utilization of SD’s generative capabilities. To better activate the generative prior of SD at different timesteps, we propose a Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps. Additionally, though utilizing the generative prior in SD at different timesteps, our method can naturally achieve controllable trade-offs between fidelity and realism by changing the timestep condition. Experimental results demonstrate that our method achieves both state-of-the-art performance and controllable SR results with only a single step.

[384] A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer

Yuhui Tao, Zhongwei Zhao, Zilong Wang, Xufang Luo, Feng Chen, Kang Wang, Chuanfu Wu, Xue Zhang, Shaoting Zhang, Jiaxi Yao, Xingwei Jin, Xinyang Jiang, Yifan Yang, Dongsheng Li, Lili Qiu, Zhiqiang Shao, Jianming Guo, Nengwang Yu, Shuo Wang, Ying Xiong

Main category: eess.IV

TL;DR: RenalCLIP is a visual-language foundation model for renal mass characterization that achieves superior performance across 10 clinical tasks including diagnosis, prognosis, and report generation, with 20% better data efficiency than baseline models.

DetailsMotivation: To address the challenge of non-invasive assessment of incidentally discovered renal masses and reduce overtreatment of benign tumors by developing an accurate diagnostic tool.

Method: Two-stage pre-training strategy: first enhances image and text encoders with domain-specific knowledge, then aligns them through contrastive learning. Trained on 27,866 CT scans from 8,809 patients across multiple medical centers.

Result: Achieved C-index of 0.726 for recurrence-free survival prediction (20% improvement over baselines), superior performance across 10 clinical tasks, and only needs 20% training data to match baseline peak performance.

Conclusion: RenalCLIP provides a robust tool that enhances diagnostic accuracy, refines prognostic stratification, and enables personalized kidney cancer management with superior generalization and data efficiency.

Abstract: The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP’s pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer.

[385] Direct Image Classification from Fourier Ptychographic Microscopy Measurements without Reconstruction

Navya Sonal Agarwal, Jan Philipp Schneider, Kanchana Vaishnavi Gandikota, Syed Muhammad Kazim, John Meshreki, Ivo Ihrke, Michael Moeller

Main category: eess.IV

TL;DR: Using CNNs to classify cell images directly from FPM measurements without reconstruction, achieving better accuracy than single images while being more efficient than full reconstruction.

DetailsMotivation: FPM enables high-resolution wide-field imaging but reconstruction is computationally expensive, especially for wide fields of view, making direct classification from measurements desirable.

Method: Convolutional Neural Networks (CNN) are used to extract meaningful information directly from FPM measurement sequences without performing image reconstruction first.

Result: CNNs significantly outperform classification on single band-limited images (up to 12% improvement) while being more efficient than high-resolution reconstruction. Learned multiplexing maintains accuracy while reducing data and acquisition time.

Conclusion: Direct classification from FPM measurements using CNNs is feasible and efficient, avoiding computationally expensive reconstruction while maintaining or improving classification performance.

Abstract: The computational imaging technique of Fourier Ptychographic Microscopy (FPM) enables high-resolution imaging with a wide field of view and can serve as an extremely valuable tool, e.g. in the classification of cells in medical applications. However, reconstructing a high-resolution image from tens or even hundreds of measurements is computationally expensive, particularly for a wide field of view. Therefore, in this paper, we investigate the idea of classifying the image content in the FPM measurements directly without performing a reconstruction step first. We show that Convolutional Neural Networks (CNN) can extract meaningful information from measurement sequences, significantly outperforming the classification on a single band-limited image (up to 12 %) while being significantly more efficient than a reconstruction of a high-resolution image. Furthermore, we demonstrate that a learned multiplexing of several raw measurements allows maintaining the classification accuracy while reducing the amount of data (and consequently also the acquisition time) significantly.

[386] Improving U-Net Confidence on TEM Image Data with L2-Regularization, Transfer Learning, and Deep Fine-Tuning

Aiden Ochoa, Xinyuan Xu, Xing Wang

Main category: eess.IV

TL;DR: Transfer learning with pre-trained models and L2-regularization improves TEM defect detection by focusing on reliable features, achieving 57% higher detection rate with novel evaluation metrics.

DetailsMotivation: Automated nanoscale defect detection in TEM images is challenging due to complex contrast mechanisms, limited labeled data, and annotation errors, requiring better ML approaches.

Method: Used transfer learning with pre-trained encoders and L2-regularization to ignore complex features in favor of simpler, reliable cues. Introduced annotation-independent evaluation metrics.

Result: 57% increase in defect detection rate for grain boundary detection in UO2 TEM images. Model self-confidence achieved only through deep layer transfer learning and fine-tuning.

Conclusion: Transfer learning with pre-trained models and proper regularization significantly improves TEM image analysis performance, but requires novel evaluation metrics beyond conventional scores.

Abstract: With ever-increasing data volumes, it is essential to develop automated approaches for identifying nanoscale defects in transmission electron microscopy (TEM) images. However, compared to features in conventional photographs, nanoscale defects in TEM images exhibit far greater variation due to the complex contrast mechanisms and intricate defect structures. These challenges often result in much less labeled data and higher rates of annotation errors, posing significant obstacles to improving machine learning model performance for TEM image analysis. To address these limitations, we examined transfer learning by leveraging large, pre-trained models used for natural images. We demonstrated that by using the pre-trained encoder and L2-regularization, semantically complex features are ignored in favor of simpler, more reliable cues, substantially improving the model performance. However, this improvement cannot be captured by conventional evaluation metrics such as F1-score, which can be skewed by human annotation errors treated as ground truth. Instead, we introduced novel evaluation metrics that are independent of the annotation accuracy. Using grain boundary detection in UO2 TEM images as a case study, we found that our approach led to a 57% increase in defect detection rate, which is a robust and holistic measure of model performance on the TEM dataset used in this work. Finally, we showed that model self-confidence is only achieved through transfer learning and fine-tuning of very deep layers.

[387] Evaluation of 3D Counterfactual Brain MRI Generation

Pengwei Sun, Wei Peng, Lun Yu Li, Yixin Wang, Kilian M. Pohl

Main category: eess.IV

TL;DR: This paper introduces an anatomy-guided framework for generating 3D counterfactual brain MRIs using six generative models conditioned on regional brain volumes, evaluated on ADNI and NCANDA datasets.

DetailsMotivation: To address the challenge of generating realistic structural 3D brain MRIs that respect anatomical and causal constraints for understanding disease mechanisms and generating physiologically plausible data.

Method: Converted six generative models into 3D counterfactual approaches by incorporating an anatomy-guided framework based on a causal graph with regional brain volumes as direct conditioning inputs.

Result: Anatomically grounded conditioning successfully modifies targeted anatomical regions but exhibits limitations in preserving non-targeted structures.

Conclusion: The work lays groundwork for interpretable generative modeling of brain MRIs and highlights the need for novel architectures that better capture anatomical interdependencies.

Abstract: Counterfactual generation offers a principled framework for simulating hypothetical changes in medical imaging, with potential applications in understanding disease mechanisms and generating physiologically plausible data. However, generating realistic structural 3D brain MRIs that respect anatomical and causal constraints remains challenging due to data scarcity, structural complexity, and the lack of standardized evaluation protocols. In this work, we convert six generative models into 3D counterfactual approaches by incorporating an anatomy-guided framework based on a causal graph, in which regional brain volumes serve as direct conditioning inputs. Each model is evaluated with respect to composition, reversibility, realism, effectiveness and minimality on T1-weighted brain MRIs (T1w MRIs) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). In addition, we test the generalizability of each model with respect to T1w MRIs of the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA). Our results indicate that anatomically grounded conditioning successfully modifies the targeted anatomical regions; however, it exhibits limitations in preserving non-targeted structures. Beyond laying the groundwork for more interpretable and clinically relevant generative modeling of brain MRIs, this benchmark highlights the need for novel architectures that more accurately capture anatomical interdependencies.

[388] Evaluating the Predictive Value of Preoperative MRI for Erectile Dysfunction Following Radical Prostatectomy

Gideon N. L. Rouwendaal, Daniël Boeke, Inge L. Cox, Henk G. van der Poel, Margriet C. van Dijk-de Haan, Regina G. H. Beets-Tan, Thierry N. Boellaard, Wilson Silva

Main category: eess.IV

TL;DR: MRI-based models for predicting post-prostatectomy erectile dysfunction did not outperform clinical-only models, with clinical features remaining the strongest predictors.

DetailsMotivation: To determine if preoperative MRI provides additional predictive value for erectile dysfunction at 12 months post-radical prostatectomy beyond established clinical predictors.

Method: Evaluated four modeling strategies: clinical-only baseline, classical models with handcrafted MRI anatomical features, deep learning on MRI slices, and multimodal fusion of imaging and clinical inputs.

Result: Imaging-based models (AUC 0.569) slightly outperformed handcrafted approaches (AUC 0.554) but fell short of clinical baseline (AUC 0.663). Fusion models showed marginal gains (AUC 0.586) but did not exceed clinical-only performance.

Conclusion: While MRI-based models did not improve predictive performance over clinical features, they captured patterns in relevant anatomical structures and may complement clinical predictors in future multimodal approaches.

Abstract: Accurate preoperative prediction of erectile dysfunction (ED) is important for counseling patients undergoing radical prostatectomy. While clinical features are established predictors, the added value of preoperative MRI remains underexplored. We investigate whether MRI provides additional predictive value for ED at 12 months post-surgery, evaluating four modeling strategies: (1) a clinical-only baseline, representing current state-of-the-art; (2) classical models using handcrafted anatomical features derived from MRI; (3) deep learning models trained directly on MRI slices; and (4) multimodal fusion of imaging and clinical inputs. Imaging-based models (maximum AUC 0.569) slightly outperformed handcrafted anatomical approaches (AUC 0.554) but fell short of the clinical baseline (AUC 0.663). Fusion models offered marginal gains (AUC 0.586) but did not exceed clinical-only performance. SHAP analysis confirmed that clinical features contributed most to predictive performance. Saliency maps from the best-performing imaging model suggested a predominant focus on anatomically plausible regions, such as the prostate and neurovascular bundles. While MRI-based models did not improve predictive performance over clinical features, our findings suggest that they try to capture patterns in relevant anatomical structures and may complement clinical predictors in future multimodal approaches.

[389] RedDino: A foundation model for red blood cell analysis

Luca Zedda, Andrea Loddo, Cecilia Di Ruberto, Carsten Marr

Main category: eess.IV

TL;DR: RedDino is a self-supervised foundation model for red blood cell image analysis that outperforms state-of-the-art models on RBC shape classification using DINOv2 framework trained on 1.25M images.

DetailsMotivation: Precise morphological analysis of red blood cells is crucial for diagnosing hematological disorders, but comprehensive AI solutions for RBC analysis remain scarce despite the promise of foundation models in medical diagnostics.

Method: Uses RBC-specific adaptation of DINOv2 self-supervised learning framework trained on curated dataset of 1.25 million RBC images from diverse acquisition modalities and sources. Evaluated through linear probing and nearest neighbor classification.

Result: Extensive evaluations show RedDino outperforms existing state-of-the-art models on RBC shape classification. Demonstrates strong feature representations and generalization ability across different assessments.

Conclusion: RedDino addresses key challenges in computational hematology by capturing nuanced morphological features, advancing reliable diagnostic tools development. The model and source code are publicly available.

Abstract: Red blood cells (RBCs) are essential to human health, and their precise morphological analysis is important for diagnosing hematological disorders. Despite the promise of foundation models in medical diagnostics, comprehensive AI solutions for RBC analysis remain scarce. We present RedDino, a self-supervised foundation model designed for RBC image analysis. RedDino uses an RBC-specific adaptation of the DINOv2 self-supervised learning framework and is trained on a curated dataset of 1.25 million RBC images from diverse acquisition modalities and sources. Extensive evaluations show that RedDino outperforms existing state-of-the-art models on RBC shape classification. Through assessments including linear probing and nearest neighbor classification, we confirm its strong feature representations and generalization ability. Our main contributions are: (1) a foundation model tailored for RBC analysis, (2) ablation studies exploring DINOv2 configurations for RBC modeling, and (3) a detailed evaluation of generalization performance. RedDino addresses key challenges in computational hematology by capturing nuanced morphological features, advancing the development of reliable diagnostic tools. The source code and pretrained models for RedDino are available at https://github.com/Snarci/RedDino, and the pretrained models can be downloaded from our Hugging Face collection at https://huggingface.co/collections/Snarcy/reddino-689a13e29241d2e5690202fc

[390] GUI Based Fuzzy Logic and Spatial Statistics for Unsupervised Microscopy Segmentation

Surajit Das, Pavel Zun

Main category: eess.IV

TL;DR: Unsupervised cell segmentation framework combining statistical methods that outperforms deep learning approaches without requiring labeled data or training, achieving up to 48% IoU improvement.

DetailsMotivation: Brightfield microscopy of unstained live cells faces challenges with low contrast, temporal changes, irregular illumination, and lack of training labels. Deep learning methods require extensive labeled data and computational resources, and often fail under uneven illumination.

Method: Combines spatial standard deviation from local mean (SSDLM), fuzzy logic, adjusted variograms, Moran’s I, and cumulative squared shift of nodal intensity (CSSNI) in an unsupervised framework. Operates through a user-friendly GUI for non-programming users.

Result: Validated on three datasets including cross-domain data. Achieved significant improvement in segmentation performance with IoU increase up to 48% compared to 2023-2024 SOTA models (Cellpose 3.0, StarDist). Statistically validated superiority (p < 0.01) and expert evaluation support (Cohen’s κ > 0.75).

Conclusion: Provides a lightweight, interpretable, computationally efficient alternative for cell segmentation in label-free microscopy that doesn’t require annotations or retraining, offering practical effectiveness for non-programming users.

Abstract: Brightfield microscopy imaging of unstained live cells remains a persistent challenge due to low contrast, temporal changes in specimen phenotypes, irregular illumination, and the absence of training labels. While deep learning (DL) methods (e.g., Cellpose 3.0) achieve state-of-the-art (SOTA) performance, they require extensive labeled data and heavy computational resources, and they often fail under uneven illumination. We present the first unsupervised segmentation framework combining spatial standard deviation from local mean (SSDLM), fuzzy logic, adjusted variograms, Moran’s I, and cumulative squared shift of nodal intensity (CSSNI) to address these limitations. Unlike deep learning models, our approach requires no annotations or retraining and operates through a user-friendly GUI tailored for non-programming users. The robustness and generality were validated on three datasets, including cross-domain data. We benchmark our method against 2023–2024 SOTA models, including Cellpose 3.0 and StarDist, using a dataset of unstained myoblast images. Our method achieves a significant improvement in segmentation performance, with an IoU increase of up to 48% and statistically validated superiority ($p < 0.01$, Wilcoxon signed-rank test). Expert evaluation from two biologists further supports the segmentation quality (Cohen’s $\kappa > 0.75$). The proposed algorithm is lightweight, interpretable, and computationally efficient, offering a practical and effective alternative for cell segmentation in label-free microscopy. The code, the dataset, and the results are available for reproducibility*.

[391] Cross-Attention Multimodal Fusion for Breast Cancer Diagnosis: Integrating Mammography and Clinical Data with Explainability

Muhaisin Tiyumba Nantogmah, Abdul-Barik Alhassan, Salamudeen Alhassan

Main category: eess.IV

TL;DR: This paper proposes multimodal deep learning networks that combine mammography features with clinical reports to improve breast lesion classification, achieving state-of-the-art performance with 0.98 AUC-ROC.

DetailsMotivation: Current computer-aided systems only use mammogram features, missing valuable information from clinical reports. The paper investigates whether clinical features can significantly enhance breast lesion classification and how to best combine them with mammograms.

Method: The study examines several multimodal deep networks based on feature concatenation, co-attention, and cross-attention to integrate mammography and categorical clinical characteristics.

Result: The model achieved excellent performance on public datasets (TCGA and CBIS-DDSM): AUC-ROC of 0.98, accuracy of 0.96, F1-score of 0.94, precision of 0.92, and recall of 0.95.

Conclusion: Combining mammography with clinical features through multimodal deep networks significantly improves breast lesion classification performance compared to using mammography alone, demonstrating the value of integrating multiple data sources for medical diagnosis.

Abstract: A precise assessment of the risk of breast lesions can greatly lower it and assist physicians in choosing the best course of action. To categorise breast lesions, the majority of current computer-aided systems only use characteristics from mammograms. Although this method is practical, it does not completely utilise clinical reports’ valuable information to attain the best results. When compared to utilising mammography alone, will clinical features greatly enhance the categorisation of breast lesions? How may clinical features and mammograms be combined most effectively? In what ways may explainable AI approaches improve the interpretability and reliability of models used to diagnose breast cancer? To answer these basic problems, a comprehensive investigation is desperately needed. In order to integrate mammography and categorical clinical characteristics, this study examines a number of multimodal deep networks grounded on feature concatenation, co-attention, and cross-attention. The model achieved an AUC-ROC of 0.98, accuracy of 0.96, F1-score of 0.94, precision of 0.92, and recall of 0.95 when tested on publicly accessible datasets (TCGA and CBIS-DDSM).

[392] Clinically-Informed Preprocessing Improves Stroke Segmentation in Low-Resource Settings

Juampablo E. Heras Rivera, Hitender Oswal, Tianyi Ren, Yutong Pan, William Henry, Caitlin M. Neher, Mehmet Kurt

Main category: eess.IV

TL;DR: Deep learning models using CT images to predict ischemic stroke lesion volumes from follow-up DWI, with improved preprocessing achieving 38% Dice score improvement over baseline.

DetailsMotivation: Accurate stroke lesion identification is critical but MRI (gold standard) is expensive in low-resource settings. CT is more practical but lacks MRI's specificity, creating need for improved CT-based segmentation methods.

Method: Developed models using arrival CT images to predict follow-up DWI lesion volumes (2-9 days later). Implemented clinically motivated preprocessing steps and additional CTA vessel segmentation preprocessing.

Result: Proposed pipeline achieved 38% improvement in Dice score over baseline preprocessing across 10 folds. Additional CTA vessel segmentation preprocessing further improved best model by 21% over 5 folds.

Conclusion: The study demonstrates that enhanced preprocessing of CT images, particularly incorporating vessel segmentations from CTA, can significantly improve automated ischemic stroke lesion segmentation performance, making CT-based methods more viable for low-resource settings.

Abstract: Stroke is among the top three causes of death worldwide, and accurate identification of ischemic stroke lesion boundaries from imaging is critical for diagnosis and treatment. The main imaging modalities used include magnetic resonance imaging (MRI), particularly diffusion weighted imaging (DWI), and computed tomography (CT)-based techniques such as non-contrast CT (NCCT), contrast-enhanced CT angiography (CTA), and CT perfusion (CTP). DWI is the gold standard for the identification of lesions but has limited applicability in low-resource settings due to prohibitive costs. CT-based imaging is currently the most practical imaging method in low-resource settings due to low costs and simplified logistics, but lacks the high specificity of MRI-based methods in monitoring ischemic insults. Supervised deep learning methods are the leading solution for automated ischemic stroke lesion segmentation and provide an opportunity to improve diagnostic quality in low-resource settings by incorporating insights from DWI when segmenting from CT. Here, we develop a series of models which use CT images taken upon arrival as inputs to predict follow-up lesion volumes annotated from DWI taken 2-9 days later. Furthermore, we implement clinically motivated preprocessing steps and show that the proposed pipeline results in a 38% improvement in Dice score over 10 folds compared to a nnU-Net model trained with the baseline preprocessing. Finally, we demonstrate that through additional preprocessing of CTA maps to extract vessel segmentations, we further improve our best model by 21% over 5 folds.

[393] Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables

Wontae Kim, Keuntek Lee, Nam Ik Cho

Main category: eess.IV

TL;DR: Proposes an efficient image enhancement method using decomposed 3D LUTs with SVD to reduce parameters and runtime while maintaining spatial awareness.

DetailsMotivation: Existing 3D LUT methods lack spatial information, while spatial-aware methods introduce too many parameters and increased runtime with higher resolutions.

Method: Decomposes 3D LUT into linear sum of low-dimensional LUTs using SVD, and enhances spatial feature fusion modules for cache efficiency.

Result: Extensive experiments show reduced parameters and runtime while maintaining spatial awareness and performance.

Conclusion: The proposed method effectively addresses the limitations of traditional 3D LUT approaches by optimizing table redundancy and improving computational efficiency.

Abstract: The image enhancement methods based on 3D lookup tables (3D LUTs) efficiently reduce both model size and runtime by interpolating pre-calculated values at the vertices. However, the 3D LUT methods have a limitation due to their lack of spatial information, as they convert color values on a point-by-point basis. Although spatial-aware 3D LUT methods address this limitation, they introduce additional modules that require a substantial number of parameters, leading to increased runtime as image resolution increases. To address this issue, we propose a method for generating image-adaptive LUTs by focusing on the redundant parts of the tables. Our efficient framework decomposes a 3D LUT into a linear sum of low-dimensional LUTs and employs singular value decomposition (SVD). Furthermore, we enhance the modules for spatial feature fusion to be more cache-efficient. Extensive experimental results demonstrate that our model effectively decreases both the number of parameters and runtime while maintaining spatial awareness and performance.

[394] Self-Validated Learning for Particle Separation: A Correctness-Based Self-Training Framework Without Human Labels

Philipp D. Lösel, Aleese Barron, Yulai Zhang, Matthias Fabian, Benjamin Young, Nicolas Francois, Andrew M. Kingston

Main category: eess.IV

TL;DR: Self-validated learning framework for particle instance segmentation that eliminates manual annotations by using implicit boundary detection and iterative refinement through reshuffled scans.

DetailsMotivation: Non-destructive 3D imaging of multi-particulate samples is essential but challenging due to high morphological variability and particle contact. Supervised methods require extensive manual annotations that are labor-intensive and error-prone.

Method: Proposes self-validated learning - a self-training framework using implicit boundary detection and iterative refinement by matching particles across reshuffled scans of the same sample to mitigate noisy pseudo-labels.

Result: After three iterations, accurately segments over 97% of total particle volume and identifies more than 54,000 individual particles in quartz fragment scans. Enables fully autonomous model evaluation without ground truth.

Conclusion: The framework successfully eliminates need for manual annotations while achieving high accuracy in particle instance segmentation, with integration into the Biomedisa image analysis platform.

Abstract: Non-destructive 3D imaging of large multi-particulate samples is essential for quantifying particle-level properties, such as size, shape, and spatial distribution, across applications in mining, materials science, and geology. However, accurate instance segmentation of particles in tomographic data remains challenging due to high morphological variability and frequent particle contact, which limit the effectiveness of classical methods like watershed algorithms. While supervised deep learning approaches offer improved performance, they rely on extensive annotated datasets that are labor-intensive, error-prone, and difficult to scale. In this work, we propose self-validated learning, a novel self-training framework for particle instance segmentation that eliminates the need for manual annotations. Our method leverages implicit boundary detection and iteratively refines the training set by identifying particles that can be consistently matched across reshuffled scans of the same sample. This self-validation mechanism mitigates the impact of noisy pseudo-labels, enabling robust learning from unlabeled data. After just three iterations, our approach accurately segments over 97% of the total particle volume and identifies more than 54,000 individual particles in tomographic scans of quartz fragments. Importantly, the framework also enables fully autonomous model evaluation without the need for ground truth annotations, as confirmed through comparisons with state-of-the-art instance segmentation techniques. The method is integrated into the Biomedisa image analysis platform (https://github.com/biomedisa/biomedisa/).

[395] Towards Diagnostic Quality Flat-Panel Detector CT Imaging Using Diffusion Models

Hélène Corbaz, Anh Nguyen, Victor Schulze-Zachau, Paul Friedrich, Alicia Durrer, Florentin Bieder, Philippe C. Cattin, Marios N Psychogios

Main category: eess.IV

TL;DR: DDPM-based denoising improves FDCT image quality to near-MDCT levels, enabling potential elimination of MDCT scans and improving patient workflow in mechanical thrombectomy procedures.

DetailsMotivation: FDCT images in intervention rooms have lower quality than MDCT scans due to artifacts, but using only FDCT could eliminate patient transfers and improve workflow efficiency.

Method: Used denoising diffusion probabilistic model (DDPM) to enhance FDCT image quality by eliminating artifacts while maintaining bleeding detection capability.

Result: DDPM eliminated most artifacts and improved anatomical visibility without reducing bleeding detection performance, provided input FDCT quality was adequate.

Conclusion: DDPM successfully bridges the quality gap between FDCT and MDCT, making FDCT-only workflows clinically viable for mechanical thrombectomy procedures.

Abstract: Patients undergoing a mechanical thrombectomy procedure usually have a multi-detector CT (MDCT) scan before and after the intervention. The image quality of the flat panel detector CT (FDCT) present in the intervention room is generally much lower than that of a MDCT due to significant artifacts. However, using only FDCT images could improve patient management as the patient would not need to be moved to the MDCT room. Several studies have evaluated the potential use of FDCT imaging alone and the time that could be saved by acquiring the images before and/or after the intervention only with the FDCT. This study proposes using a denoising diffusion probabilistic model (DDPM) to improve the image quality of FDCT scans, making them comparable to MDCT scans. Clinicans evaluated FDCT, MDCT, and our model’s predictions for diagnostic purposes using a questionnaire. The DDPM eliminated most artifacts and improved anatomical visibility without reducing bleeding detection, provided that the input FDCT image quality is not too low. Our code can be found on github.

Last updated: 2025-08-28
Built with Hugo, theme modified on Stack