Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 79]
- cs.CV [Total: 134]
- cs.AI [Total: 33]
- cs.SD [Total: 13]
- cs.LG [Total: 110]
- cs.MA [Total: 5]
- cs.MM [Total: 8]
- eess.AS [Total: 11]
- eess.IV [Total: 9]
cs.CL
[1] Social Bias in Multilingual Language Models: A Survey
Lance Calvin Lim Gamboa, Yue Feng, Mark Lee
Main category: cs.CL
TL;DR: Multilingual pretrained models exhibit similar social biases as English models, requiring systematic evaluation and mitigation approaches across diverse languages and cultures.
Details
Motivation: To analyze emerging research on extending bias evaluation and mitigation from English to multilingual contexts, addressing linguistic diversity and cultural awareness gaps.Method: Systematic review of studies examining bias in multilingual models, focusing on evaluation metrics, mitigation techniques, linguistic diversity, and cultural appropriateness.
Result: Identified methodological gaps including preference for certain languages, scarcity of multilingual mitigation experiments, and cataloged common issues in adapting bias benchmarks across languages and cultures.
Conclusion: Future research should focus on improving inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements in multilingual bias literature.
Abstract: Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field’s dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature’s inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.
[2] Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models
Mohammad Amini, Babak Ahmadi, Xiaomeng Xiong, Yilin Zhang, Christopher Qiao
Main category: cs.CL
TL;DR: This study shows that structured prompting strategies (especially chain-of-thought + sequential) significantly improve midsized language models for automatic generation of multiple choice questions, outperforming larger untuned models and offering a scalable solution for educational assessment development.
Details
Motivation: To reduce the cost and inconsistency of manual test development for morphological assessment by using language models to automatically generate multiple choice questions.Method: Used a two-fold approach: comparing fine-tuned Gemma (2B) vs untuned GPT-3.5 (175B), and evaluating seven structured prompting strategies including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics, expert scoring, and GPT-4.1 simulation of human scoring.
Result: Structured prompting, particularly strategies combining chain-of-thought and sequential design, significantly improved Gemma’s outputs. Gemma produced more construct-aligned and instructionally appropriate items than GPT-3.5’s zero-shot responses, with prompt design being crucial for mid-size model performance.
Conclusion: Structured prompting and efficient fine-tuning can enhance midsized models for automatic item generation under limited data conditions. The workflow combining automated metrics, expert judgment, and large-model simulation offers a practical and scalable way to develop language assessment items for K-12 education.
Abstract: This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma’s outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5’s zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.
[3] Integrating SystemC TLM into FMI 3.0 Co-Simulations with an Open-Source Approach
Andrei Mihai Albu, Giovanni Pollo, Alessio Burrello, Daniele Jahier Pagliari, Cristian Tesconi, Alessandra Neri, Dario Soldi, Fabio Autieri, Sara Vinco
Main category: cs.CL
TL;DR: Open-source methodology for integrating SystemC TLM models into FMI-based co-simulation workflows using FMI 3.0 standards.
Details
Motivation: Growing complexity of cyber-physical systems requires efficient cross-domain co-simulation, but SystemC TLM has limited interoperability with other engineering domains.Method: Encapsulate SystemC TLM components as FMI 3.0 Co Simulation FMUs, develop lightweight open-source toolchain, address time synchronization and data exchange challenges.
Result: Feasibility and effectiveness demonstrated through representative case studies, enabling seamless standardized integration across heterogeneous simulation environments.
Conclusion: The proposed approach successfully bridges SystemC TLM with FMI-based workflows, facilitating cross-domain co-simulation for complex cyber-physical systems.
Abstract: The growing complexity of cyber-physical systems, particularly in automotive applications, has increased the demand for efficient modeling and cross-domain co-simulation techniques. While SystemC Transaction-Level Modeling (TLM) enables effective hardware/software co-design, its limited interoperability with models from other engineering domains poses integration challenges. This paper presents a fully open-source methodology for integrating SystemC TLM models into Functional Mock-up Interface (FMI)-based co-simulation workflows. By encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs), the proposed approach facilitates seamless, standardized integration across heterogeneous simulation environments. We introduce a lightweight open-source toolchain, address key technical challenges such as time synchronization and data exchange, and demonstrate the feasibility and effectiveness of the integration through representative case studies.
[4] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities
Rikuto Kotoge, Mai Nishimura, Jiaxin Ma
Main category: cs.CL
TL;DR: DGPO enables compact language models to achieve sophisticated agentic RAG behaviors through teacher demonstration initialization and continuous guidance during training, outperforming larger models in some cases.
Details
Motivation: Compact language models struggle with agentic RAG behaviors due to poor reasoning ability, sparse rewards, and unstable training, making them unsuitable for resource-constrained environments.Method: Distillation-Guided Policy Optimization (DGPO) uses cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization, with ARC metrics for evaluation.
Result: DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming larger teacher models in some cases, making agentic RAG feasible in resource-constrained environments.
Conclusion: DGPO successfully addresses the challenges of training compact language models for agentic RAG, demonstrating that proper guidance and initialization can overcome reasoning limitations and enable deployment in constrained computing environments.
Abstract: Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.
[5] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang
Main category: cs.CL
TL;DR: GUARD is a testing framework that converts high-level AI ethics guidelines into actionable test questions to verify LLM compliance, including jailbreak diagnostics to identify safety bypass scenarios.
Details
Motivation: Address the gap between high-level government AI ethics guidelines and actionable testing methods to verify LLM compliance and prevent harmful responses.Method: Automated generation of guideline-violating questions based on government guidelines, with jailbreak diagnostics (GUARD-JD) to provoke unethical responses and identify safety mechanism bypasses.
Result: Empirically validated on 7 LLMs including Vicuna-13B, GPT-4, and Claude-3.7, showing effectiveness in testing compliance under three government guidelines and transferring jailbreak diagnostics to vision-language models.
Conclusion: GUARD provides a practical framework for operationalizing AI ethics guidelines into compliance testing, helping promote reliable LLM applications through systematic violation identification and jailbreak scenario detection.
Abstract: As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks’’ to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.
[6] Joint Enhancement of Relational Reasoning for Long-Context LLMs
Zhirui Chen, Wei Shen, Jiashui Huang, Ling Shao
Main category: cs.CL
TL;DR: JERR is a novel framework that enhances long-context comprehension in LLMs through graph-based reasoning, addressing memory limitations and hallucination issues.
Details
Motivation: LLMs struggle with long contexts due to memory constraints, inability to handle complex long-context tasks, lack of transparency, and tendency to produce hallucinations.Method: Three key components: 1) Synopsis extraction through strategic text chunking, 2) Directed acyclic graph (DAG) construction to resolve redundancy and ensure logical consistency, 3) Monte Carlo Tree Search (MCTS) for navigating complex reasoning paths.
Result: JERR consistently outperforms all baselines on ROUGE and F1 metrics, achieving the highest scores on LLM-Rater evaluation.
Conclusion: The framework provides a novel solution enabling LLMs to handle extended contexts and complex reasoning tasks with improved reliability and transparency.
Abstract: Despite significant progress, large language models (LLMs) still struggle with long contexts due to memory limitations and their inability to tackle complex and long-context tasks. Additionally, LLMs often suffer from a lack of transparency and are prone to producing hallucinations. To address these challenges, we propose \textbf{JERR}, a novel framework designed to enhance long-context comprehension via graph-based reasoning in LLMs. JERR integrates three key components: synopsis extraction, graph construction, and relational reasoning. First, synopsis is extracted by chunking text strategically, allowing the model to summarize and understand information more efficiently. Second, we build a directed acyclic graph (DAG) to resolve redundancy, ensuring logical consistency and clarity. Finally, we incorporate Monte Carlo Tree Search (MCTS) to help the model navigate complex reasoning paths, ensuring more accurate and interpretable outputs. This framework provides a novel solution that enables LLMs to handle extended contexts and complex reasoning tasks with improved reliability and transparency. Experimental results show that JERR consistently outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation.
[7] Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li, Qifan Zhang, Jia Li
Main category: cs.CL
TL;DR: Using NP-hard graph problems as synthetic training data to develop long chain-of-thought reasoning in LLMs through supervised fine-tuning and reinforcement learning, achieving strong generalization across multiple domains.
Details
Motivation: Current Long CoT development relies on costly human-curated datasets (math/code), leaving scalable alternatives unexplored. NP-hard graph problems inherently require deep reasoning and extensive exploration, making them ideal for Long CoT training.Method: Two-stage post-training: 1) Long CoT Supervised Fine-Tuning on rejection-sampled NP-hard graph instances to enhance reasoning depth, 2) Reinforcement Learning with fine-grained reward design to sharpen reasoning efficiency.
Result: Graph-R1-7B model demonstrates strong generalization across mathematics, coding, STEM, and logic, surpassing QwQ-32B on NP-hard graph problems in both accuracy and reasoning efficiency.
Conclusion: NP-hard graph problems serve as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training.
Abstract: Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at https://github.com/Graph-Reasoner/Graph-R1, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1.
[8] CAPE: Context-Aware Personality Evaluation Framework for Large Language Models
Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki
Main category: cs.CL
TL;DR: Proposes CAPE framework for context-aware personality evaluation of LLMs, showing conversational history enhances consistency but causes personality shifts, with GPT models showing extreme deviations while being robust to question ordering.
Details
Motivation: Traditional psychometric tests for LLMs use context-free approaches that ignore real-world conversational history, creating an artificial evaluation setting that doesn't reflect actual usage scenarios.Method: Developed Context-Aware Personality Evaluation (CAPE) framework incorporating prior conversational interactions, introduced novel metrics to quantify response consistency, and conducted experiments on 7 LLMs including GPT models, Gemini, and Llama.
Result: Conversational history enhances response consistency via in-context learning but induces personality shifts; GPT-3.5-Turbo and GPT-4-Turbo show extreme deviations; GPT models are robust to question ordering while Gemini-1.5-Flash and Llama-8B are sensitive; context-dependent shifts improve consistency and human alignment in RPAs.
Conclusion: Context-aware evaluation reveals important behavioral patterns in LLMs that context-free approaches miss, with different models showing varying sensitivity to conversational history and personality stability, providing more realistic assessment of LLM behavior.
Abstract: Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior. Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama–8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: https://github.com/jivnesh/CAPE
[9] Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
Xu Guo
Main category: cs.CL
TL;DR: Analyzing reasoning step utility in LLMs shows decreasing conditional entropy correlates with correct answers, while flat/increasing entropy often leads to wrong answers. Incorrect reasoning tends to be longer, suggesting longer reasoning doesn’t guarantee better outcomes.
Details
Motivation: Despite LLMs generating intermediate reasoning steps to improve accuracy, little research examines how reasoning utility contributes to final answer correctness. The stochastic nature of autoregressive generation means more context doesn't ensure increased confidence in answers.Method: Oracle study on MATH dataset using Qwen2.5-32B and GPT-4o to generate reasoning chains, then using Qwen3-8B to quantify utility by measuring conditional entropy (expected negative log-likelihood) on answer span at each reasoning step as context expands.
Result: Clear pattern: decreasing conditional entropy over steps strongly associated with correct answers, while flat or increasing entropy often results in wrong answers. Incorrect reasoning paths tend to be longer than correct ones.
Conclusion: These findings provide foundation for designing efficient reasoning pipelines that can detect and avoid unproductive reasoning early, potentially enabling early stopping or pruning of ineffective reasoning steps.
Abstract: Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer’s correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model’s uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.
[10] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools
Sam Jung, Agustin Garcinuno, Spencer Mateega
Main category: cs.CL
TL;DR: UI-Bench is the first large-scale benchmark for evaluating AI text-to-app tools through expert pairwise comparisons, ranking 10 tools across 300 generated sites with 4000+ expert judgments.
Details
Motivation: There are no public benchmarks that rigorously verify claims about AI text-to-app tools' ability to produce high-quality applications and websites quickly.Method: Created UI-Bench with 10 tools, 30 prompts, 300 generated sites, and collected 4000+ expert pairwise comparisons. Used a TrueSkill-derived model for ranking with calibrated confidence intervals.
Result: Established a reproducible standard for evaluating AI-driven web design tools and created a public leaderboard with comprehensive evaluation framework.
Conclusion: UI-Bench provides the first rigorous benchmark for AI text-to-app tools, enabling objective comparison and advancement of AI-driven web design technologies.
Abstract: AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and \textit{4000+} expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.
[11] DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding
Hengchuan Zhu, Yihuan Xu, Yichen Li, Zijie Meng, Zuozhu Liu
Main category: cs.CL
TL;DR: DentalBench is the first comprehensive bilingual benchmark for evaluating LLMs in dentistry, featuring DentalQA (36,597 questions) and DentalCorpus (337M tokens), revealing performance gaps and showing domain adaptation significantly improves dental knowledge.
Details
Motivation: Existing medical LLMs show strong performance on general medical benchmarks but lack evaluation in specialized fields like dentistry, which requires deeper domain-specific knowledge and has no targeted evaluation resources.Method: Created DentalBench with two components: DentalQA (bilingual QA benchmark with 36,597 questions across 4 tasks and 16 subfields) and DentalCorpus (large-scale corpus with 337.35M tokens for domain adaptation via SFT and RAG). Evaluated 14 LLMs including proprietary, open-source, and medical-specific models.
Result: Significant performance gaps were revealed across task types and languages. Domain adaptation experiments with Qwen-2.5-3B showed substantial performance improvements, particularly on knowledge-intensive and terminology-focused tasks.
Conclusion: Domain-specific benchmarks are crucial for developing trustworthy and effective LLMs tailored to healthcare applications, and domain adaptation significantly enhances performance in specialized medical fields like dentistry.
Abstract: Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.
[12] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval
Chi Minh Bui, Ngoc Mai Thieu, Van Vinh Nguyen, Json J. Jung, Khac-Hoai Nam Bui
Main category: cs.CL
TL;DR: KG-CQR is a framework that enhances retrieval-augmented generation by using knowledge graphs to enrich query contexts, achieving 4-6% mAP and 2-3% Recall@25 improvements over baselines.
Details
Motivation: To improve retrieval phase in RAG systems by addressing query enrichment through structured knowledge graph representations rather than just corpus-level context loss.Method: Proposes KG-CQR framework with subgraph extraction, completion, and contextual generation modules that operates as model-agnostic pipeline without requiring additional training.
Result: Achieves 4-6% improvement in mAP and 2-3% improvement in Recall@25 on RAGBench and MultiHop-RAG datasets, with consistent outperformance on multi-hop QA tasks.
Conclusion: KG-CQR effectively enhances retrieval effectiveness in RAG systems through knowledge graph-based query enrichment and works across various LLM sizes without retraining.
Abstract: The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR’s superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness
[13] CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance
Feng Zhang, Chengjie Pang, Yuehan Zhang, Chenyu Luo
Main category: cs.CL
TL;DR: A specialized benchmark for evaluating LLMs in civil aviation maintenance, addressing the lack of domain-specific evaluation tools and enabling targeted improvements through gap identification.
Details
Motivation: Current LLM evaluation focuses primarily on mathematical and coding tasks, leaving a significant gap in specialized domains like civil aviation maintenance which requires sophisticated reasoning and domain knowledge.Method: Developed an industrial-grade benchmark specifically for civil aviation maintenance, then used it to evaluate existing vector embedding models and LLMs in maintenance scenarios through experimental exploration.
Result: The benchmark effectively assesses model performance in civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning capabilities.
Conclusion: This benchmark establishes a foundation for targeted improvement efforts (fine-tuning, RAG optimization, prompt engineering) and facilitates progress toward more intelligent solutions in aviation maintenance, with the tool being open-sourced for further research.
Abstract: Civil aviation maintenance is a domain characterized by stringent industry standards. Within this field, maintenance procedures and troubleshooting represent critical, knowledge-intensive tasks that require sophisticated reasoning. To address the lack of specialized evaluation tools for large language models (LLMs) in this vertical, we propose and develop an industrial-grade benchmark specifically designed for civil aviation maintenance. This benchmark serves a dual purpose: It provides a standardized tool to measure LLM capabilities within civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning. By pinpointing these deficiencies, the benchmark establishes a foundation for targeted improvement efforts (e.g., domain-specific fine-tuning, RAG optimization, or specialized prompt engineering), ultimately facilitating progress toward more intelligent solutions within civil aviation maintenance. Our work addresses a significant gap in the current LLM evaluation, which primarily focuses on mathematical and coding reasoning tasks. In addition, given that Retrieval-Augmented Generation (RAG) systems are currently the dominant solutions in practical applications , we leverage this benchmark to evaluate existing well-known vector embedding models and LLMs for civil aviation maintenance scenarios. Through experimental exploration and analysis, we demonstrate the effectiveness of our benchmark in assessing model performance within this domain, and we open-source this evaluation benchmark and code to foster further research and development:https://github.com/CamBenchmark/cambenchmark
[14] Exploring Machine Learning and Language Models for Multimodal Depression Detection
Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, Pai Chet Ng, Xiaoxiao Miao
Main category: cs.CL
TL;DR: Multimodal depression detection using XGBoost, transformers, and LLMs on audio, video, and text features, comparing performance across modalities for mental health prediction.
Details
Motivation: To address the challenge of multimodal personality-aware depression detection by exploring different machine learning approaches to effectively capture depression-related signals across multiple modalities.Method: Used XGBoost, transformer-based architectures, and large language models (LLMs) to analyze and compare performance on audio, video, and text features for depression detection.
Result: Identified the strengths and limitations of each model type in capturing depression-related signals across different modalities, providing comparative performance analysis.
Conclusion: The study offers insights into effective multimodal representation strategies for mental health prediction, highlighting which models work best for specific modalities in depression detection.
Abstract: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.
[15] Searching the Title of Practical Work of the Informatics Engineering Bachelor Program with the Case Base Reasoning Method
Agung Sukrisna Jaya, Osvari Arsalan, Danny Matthew Saputra
Main category: cs.CL
TL;DR: CBR system using TF-IDF and Cosine Similarity for practical work title search achieves high accuracy with same number of titles found and highest average match score in randomized testing.
Details
Motivation: To develop an efficient case-based reasoning system for searching practical work titles using previous experience and similarity matching.Method: Uses Case Base Reasoning (CBR) with TF-IDF for text vectorization and Cosine Similarity for calculating similarity values between practical work titles.
Result: Tested on 705 practical work titles, the system successfully found the same number of titles with highest average match score in randomized title testing compared to exact title searches.
Conclusion: The CBR approach with TF-IDF and Cosine Similarity is effective for practical work title search, demonstrating robust performance even with randomized input titles.
Abstract: Case Base Reasoning (CBR) is a case solving technique based on experience in cases that have occurred before with the highest similarity. CBR is used to search for practical work titles. TF-IDF is applied to process the vectorization of each practical work title word and Cosine Similarity for the calculation of similarity values. This system can search either in the form of titles or keywords. The output of the system is the title of practical work and the match value of each title. Based on the test results using 705 practical work titles, testing was carried out with five titles and carried out in two stages. The first stage searches with existing titles and the second stage randomizes the title from the first stage. And the results obtained in the second stage are the same number of titles found and the highest average match score.
[16] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow
Main category: cs.CL
TL;DR: MCP-Bench is a new benchmark that evaluates LLMs on realistic multi-step tasks requiring tool use, cross-tool coordination, and planning through 28 live MCP servers with 250 tools across various domains.
Details
Motivation: Existing benchmarks fail to adequately evaluate LLMs' capabilities in realistic multi-step tool usage, cross-tool coordination, and complex planning that requires working with complementary tools across domains without explicit specifications.Method: Built on Model Context Protocol (MCP), connecting LLMs to 28 live MCP servers with 250 tools across finance, travel, scientific computing, and academic search domains. Uses a multi-faceted evaluation framework covering tool-level schema understanding, trajectory-level planning, and task completion.
Result: Experiments on 20 advanced LLMs reveal persistent challenges in handling the benchmark’s complex requirements, demonstrating that current models struggle with realistic multi-step tool usage and cross-domain coordination.
Conclusion: MCP-Bench provides a more comprehensive evaluation framework that reveals significant gaps in LLMs’ abilities to handle realistic tool-use scenarios, highlighting the need for improved planning, coordination, and reasoning capabilities in language models.
Abstract: We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.
[17] Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models
Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
Main category: cs.CL
TL;DR: LLMs can identify and adapt to different conversational partners (interlocutor awareness), which improves collaboration but also creates new safety vulnerabilities like reward hacking and jailbreak susceptibility.
Details
Motivation: As LLMs are increasingly used in multi-agent and human-AI systems, understanding their awareness of conversational partners is crucial for reliable performance and safety, but this capability has been overlooked in prior work.Method: Systematic evaluation of interlocutor awareness across three dimensions: reasoning patterns, linguistic style, and alignment preferences. Developed three case studies to demonstrate practical significance through prompt adaptation and vulnerability testing.
Result: LLMs reliably identify same-family peers and prominent model families like GPT and Claude. Interlocutor awareness enhances multi-LLM collaboration but also introduces new alignment vulnerabilities including reward-hacking behaviors and increased jailbreak susceptibility.
Conclusion: Interlocutor awareness represents both promise and peril - it improves collaboration but creates new safety risks, highlighting the need for further understanding and new safeguards in multi-agent deployments.
Abstract: As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM’s ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at https://github.com/younwoochoi/InterlocutorAwarenessLLM.
[18] Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques
Yucheng Ruan, Xiang Lan, Daniel J. Tan, Hairil Rizal Abdullah, Mengling Feng
Main category: cs.CL
TL;DR: A deep learning framework using NLP techniques to integrate multimodal EHRs (structured data + free-text notes) for predicting mortality and resource utilization in ICU settings, showing improved performance and robustness against data corruption.
Details
Motivation: Existing approaches focus mainly on structured EHRs and ignore valuable clinical insights in free-text notes. There's untapped potential in textual information within structured data for improving mortality and resource utilization predictions in critical care.Method: Developed a deep learning framework using natural language processing techniques that integrates multimodal EHRs. Used two real-world EHR datasets, performed ablation studies on medical prompts, free-texts, and pre-trained sentence encoder, and assessed robustness against structured data corruption.
Result: Model improved performance by 1.6%/0.8% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, and 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to best existing methods. Showed superior performance across tasks at different corruption rates.
Conclusion: The framework is effective and accurate for predicting mortality and resource utilization in critical care. Successfully uses prompt learning with transformer encoder for multimodal EHR analysis and demonstrates strong resilience to data corruption, especially at high corruption levels.
Abstract: Background Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. This study aimed to introduce and assess a deep learning framework using natural language processing techniques that integrates multimodal EHRs to predict mortality and resource utilization in critical care settings. Methods Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model’s robustness against the corruption in structured EHRs. Results Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6%/0.8% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusions The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels.
[19] Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma
Main category: cs.CL
TL;DR: AVLM integrates full-face visual cues with pre-trained speech models for expressive speech generation, achieving significant improvements in emotion recognition and dialogue tasks over speech-only baselines.
Details
Motivation: To enhance expressive speech generation by incorporating visual information from facial cues, which can provide additional emotional and expressive context beyond audio alone.Method: Explored multiple visual encoders and multimodal fusion strategies during pre-training, followed by fine-tuning on emotion recognition and expressive dialogue tasks.
Result: Achieved substantial gains over speech-only baselines, including +5 F1 score improvement in emotion recognition tasks.
Conclusion: Visual information significantly enhances expressive speech generation and provides a foundation for end-to-end multimodal conversational systems.
Abstract: We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
[20] ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety
Luke Bates, Max Glockner, Preslav Nakov, Iryna Gurevych
Main category: cs.CL
TL;DR: ConspirED dataset captures cognitive traits of conspiracy theories using the CONSPIR framework, enabling development of detection models and revealing LLM vulnerabilities to conspiratorial content.
Details
Motivation: Conspiracy theories erode public trust and resist debunking, while AI-generated misinformation becomes more sophisticated. Understanding rhetorical patterns is crucial for interventions like prebunking and assessing AI vulnerabilities.Method: Created ConspirED dataset with multi-sentence excerpts from online conspiracy articles annotated using CONSPIR cognitive framework. Used this to develop computational models and evaluate LLM robustness to conspiratorial inputs.
Result: Both computational models and LLMs were misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.
Conclusion: The study demonstrates the need for better AI safeguards against conspiratorial content and provides a valuable dataset (ConspirED) for developing interventions against conspiracy theories.
Abstract: Conspiracy theories erode public trust in science and institutions while resisting debunking by evolving and absorbing counter-evidence. As AI-generated misinformation becomes increasingly sophisticated, understanding rhetorical patterns in conspiratorial content is important for developing interventions such as targeted prebunking and assessing AI vulnerabilities. We introduce ConspirED (CONSPIR Evaluation Dataset), which captures the cognitive traits of conspiratorial ideation in multi-sentence excerpts (80–120 words) from online conspiracy articles, annotated using the CONSPIR cognitive framework (Lewandowsky and Cook, 2020). ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits. Using ConspirED, we (i) develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts, and (ii) evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs. We find that both are misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.
[21] Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark
Chihiro Taguchi, Seng Mai, Keita Kurabe, Yusuke Sakai, Georgina Agyei, Soudabeh Eslami, David Chiang
Main category: cs.CL
TL;DR: FLORES+ benchmark has quality issues - translations fall below claimed 90% standard, contains cultural bias, and simple heuristics can inflate scores. Models trained on natural data perform poorly on FLORES+ but better on domain-relevant evaluation.
Details
Motivation: To evaluate the suitability and reliability of the widely used FLORES+ multilingual machine translation benchmark, which claims high-quality translations for over 200 languages but may have underlying issues affecting true multilingual evaluation.Method: Human assessment of translations in four languages (Asante Twi, Japanese, Jinghpaw, South Azerbaijani), quality analysis against claimed standards, testing simple heuristics like named entity copying, and comparing model performance on FLORES+ vs domain-relevant evaluation sets.
Result: Many FLORES+ translations fall below the claimed 90% quality standard, source sentences show cultural bias toward English-speaking world, simple heuristics yield non-trivial BLEU scores, and models trained on naturalistic data perform poorly on FLORES+ but achieve significant gains on domain-relevant evaluation.
Conclusion: FLORES+ has critical shortcomings for multilingual evaluation. Future benchmarks should use domain-general and culturally neutral source texts with less reliance on named entities to better reflect real-world translation challenges.
Abstract: Multilingual machine translation (MT) benchmarks play a central role in evaluating the capabilities of modern MT systems. Among them, the FLORES+ benchmark is widely used, offering English-to-many translation data for over 200 languages, curated with strict quality control protocols. However, we study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and uncover critical shortcomings in the benchmark’s suitability for truly multilingual evaluation. Human assessments reveal that many translations fall below the claimed 90% quality standard, and the annotators report that source sentences are often too domain-specific and culturally biased toward the English-speaking world. We further demonstrate that simple heuristics, such as copying named entities, can yield non-trivial BLEU scores, suggesting vulnerabilities in the evaluation protocol. Notably, we show that MT models trained on high-quality, naturalistic data perform poorly on FLORES+ while achieving significant gains on our domain-relevant evaluation set. Based on these findings, we advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts rely less on named entities, in order to better reflect real-world translation challenges.
[22] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM
Pengjiang Li, Zaitian Wang, Xinhao Zhang, Ran Zhang, Lu Jiang, Pengfei Wang, Yuanchun Zhou
Main category: cs.CL
TL;DR: SciTopic is an LLM-enhanced topic discovery method that improves scientific topic identification by using a textual encoder with space optimization and contrastive learning guided by large language models.
Details
Motivation: Existing topic discovery methods rely on word embeddings and struggle with complex text relationships, lacking comprehensive understanding of scientific publications. LLMs' exceptional text comprehension capabilities can enhance topic discovery.Method: 1) Build textual encoder for scientific publications (metadata, title, abstract) 2) Space optimization with entropy-based sampling and LLM-guided triplet tasks 3) Fine-tune encoder using contrastive loss optimization based on LLM guidance 4) Force encoder to better discriminate different topics
Result: Extensive experiments on three real-world scientific publication datasets show SciTopic outperforms state-of-the-art scientific topic discovery methods.
Conclusion: SciTopic enables researchers to gain deeper and faster insights into scientific literature by leveraging LLMs for enhanced topic discovery, addressing limitations of traditional word embedding-based approaches.
Abstract: Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.
[23] Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Salvador Lima-López, Eulàlia Farré-Maduell, Martin Krallinger, Natalia Loukachevitch, Vera Davydova, Elena Tutubalina, Georgios Paliouras
Main category: cs.CL
TL;DR: Overview of BioASQ 2024 challenge featuring two established tasks and two new multilingual biomedical NLP tasks, with 37 teams and 700+ submissions showing competitive performance.
Details
Motivation: To advance large-scale biomedical semantic indexing and question answering through international challenges, expanding to include new multilingual and domain-specific tasks.Method: Organized four shared tasks: established Task b and Synergy, plus new MultiCardioNER (clinical entity detection in cardiology domain) and BIONNE (nested NER in Russian/English).
Result: 37 competing teams participated with over 700 distinct submissions across all tasks. Most systems achieved competitive performance.
Conclusion: The BioASQ challenge continues to drive state-of-the-art advancements in biomedical NLP, with successful participation and competitive results across both established and new multilingual tasks.
Abstract: This is an overview of the twelfth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2024. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks b and Synergy, and two new tasks: a) MultiCardioNER on the adaptation of clinical entity detection to the cardiology domain in a multilingual setting, and b) BIONNE on nested NER in Russian and English. In this edition of BioASQ, 37 competing teams participated with more than 700 distinct submissions in total for the four different shared tasks of the challenge. Similarly to previous editions, most of the participating systems achieved competitive performance, suggesting the continuous advancement of the state-of-the-art in the field.
[24] Overview of BioASQ 2025: The Thirteenth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Martin Krallinger, Miguel Rodríguez-Ortega, Eduard Rodriguez-López, Natalia Loukachevitch, Andrey Sakhovskiy, Elena Tutubalina, Dimitris Dimitriadis, Grigorios Tsoumakas, George Giannakoulas, Alexandra Bekiaridou, Athanasios Samaras, Giorgio Maria Di Nunzio, Nicola Ferro, Stefano Marchesin, Marco Martinelli, Gianmaria Silvello, Georgios Paliouras
Main category: cs.CL
TL;DR: Overview of BioASQ 2025 challenge featuring 6 tasks (2 established, 4 new) with 83 teams and 1000+ submissions, showing continued advancement in biomedical semantic indexing and QA.
Details
Motivation: To promote advances in large-scale biomedical semantic indexing and question answering through international challenges that push the state-of-the-art in biomedical NLP.Method: Organized six shared tasks: two established tasks (b and Synergy) and four new tasks focusing on multilingual clinical summarization, nested named entity linking, clinical coding in cardiology, and gut-brain interplay information extraction.
Result: 83 competing teams participated with over 1000 distinct submissions across all tasks. Several systems achieved competitive performance, indicating continuous advancement in the field.
Conclusion: The BioASQ 2025 challenge successfully continued its mission of advancing biomedical NLP through diverse tasks and strong participation, demonstrating ongoing progress in the state-of-the-art for biomedical semantic technologies.
Abstract: This is an overview of the thirteenth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2025. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks, b and Synergy, and four new tasks: a) Task MultiClinSum on multilingual clinical summarization. b) Task BioNNE-L on nested named entity linking in Russian and English. c) Task ELCardioCC on clinical coding in cardiology. d) Task GutBrainIE on gut-brain interplay information extraction. In this edition of BioASQ, 83 competing teams participated with more than 1000 distinct submissions in total for the six different shared tasks of the challenge. Similar to previous editions, several participating systems achieved competitive performance, indicating the continuous advancement of the state-of-the-art in the field.
[25] Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data
Jiahao Xiao, Jiangming Liu
Main category: cs.CL
TL;DR: Proposes Adaptive Federated Distillation (AdaFD) framework to handle multi-domain non-IID data challenges in federated learning, with comprehensive benchmarking for real-world NLP scenarios.
Details
Motivation: Existing federated learning approaches primarily focus on label diversity but neglect language domain diversity, which is crucial for natural language processing tasks in real-world non-IID environments.Method: Introduces a unified benchmarking framework with multi-domain non-IID scenarios and proposes AdaFD framework that adaptively handles both homogeneous and heterogeneous settings to capture local client diversity.
Result: Experimental results show that the proposed models achieve better performance compared to existing works by effectively capturing the diversity of local clients.
Conclusion: The AdaFD framework successfully addresses multi-domain non-IID challenges in federated learning, providing a more realistic evaluation benchmark and improved performance for language model fine-tuning in distributed environments.
Abstract: The widespread success of pre-trained language models has established a new training paradigm, where a global PLM is fine-tuned using task-specific data from local clients. The local data are highly different from each other and can not capture the global distribution of the whole data in real world. To address the challenges of non-IID data in real environments, privacy-preserving federated distillation has been proposed and highly investigated. However, previous experimental non-IID scenarios are primarily identified with the label (output) diversity, without considering the diversity of language domains (input) that is crucial in natural language processing. In this paper, we introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data. The benchmark can be used to evaluate the federated learning framework in a real environment. To this end, we propose an Adaptive Federated Distillation (AdaFD) framework designed to address multi-domain non-IID challenges in both homogeneous and heterogeneous settings. Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works. The code for this paper is available at: https://github.com/jiahaoxiao1228/AdaFD.
[26] Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search
Zeyu Xiong, Yixuan Nan, Li Gao, Hengzhu Tang, Shuaiqiang Wang, Junfeng Wang, Dawei Yin
Main category: cs.CL
TL;DR: A novel generative framework for query-driven text summarization that uses model distillation and optimization techniques to create a lightweight 0.1B parameter model that outperforms traditional extractive methods while maintaining high deployment efficiency.
Details
Motivation: Traditional extractive summarization models suffer from cumulative information loss in multi-stage pipelines and lack sufficient semantic understanding of complex user queries and documents, limiting their effectiveness in industrial web search applications.Method: Integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight 0.1B parameter model into a domain-specialized QDTS expert.
Result: Outperforms production baseline and achieves state-of-the-art performance on multiple industry-relevant metrics while demonstrating excellent deployment efficiency (334 NVIDIA L20 GPUs for ~50,000 queries per second under 55ms average latency).
Conclusion: The proposed generative framework successfully addresses limitations of traditional extractive approaches and provides an efficient, high-performance solution for real-time query-driven text summarization in industrial web search applications.
Abstract: In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.
[27] KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling
Yangfan Wang, Jie Liu, Chen Tang, Lian Yan, Jingchi Jiang
Main category: cs.CL
TL;DR: KCS framework improves multi-hop QA by sampling diverse knowledge compositions using probabilistic contrastive loss and stochastic decoding, achieving 3.9% better knowledge selection accuracy and performance gains on benchmark datasets.
Details
Motivation: Address data sparsity in multi-hop QA where models learn spurious patterns, as prior methods focus on simple question generation without integrating essential document knowledge.Method: Knowledge Composition Sampling (KCS) models knowledge composition selection as sentence-level conditional prediction with probabilistic contrastive loss, using stochastic decoding for inference.
Result: 3.9% improvement in knowledge composition selection accuracy, with data augmentation yielding improvements on HotpotQA and 2WikiMultihopQA datasets.
Conclusion: KCS effectively expands question diversity by sampling varied knowledge compositions, addressing data sparsity and improving multi-hop QA performance through better knowledge integration.
Abstract: Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the Knowledge Composition Sampling (KCS), an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: https://github.com/yangfanww/kcs.
[28] A Graph Talks, But Who’s Listening? Rethinking Evaluations for Graph-Language Models
Soham Petkar, Hari Aakash K, Anirudh Vempati, Akshit Sinha, Ponnurangam Kumarauguru, Chirag Agarwal
Main category: cs.CL
TL;DR: Current GLM benchmarks are inadequate as they can be solved with unimodal information alone. The paper introduces CLEGR benchmark to properly evaluate multimodal graph-language reasoning and finds that current GLMs struggle with structural reasoning tasks.
Details
Motivation: Existing GLM evaluation benchmarks are insufficient because they are repurposed node classification datasets that don't require true multimodal reasoning - strong performance can be achieved using only unimodal information.Method: Introduces CLEGR benchmark with synthetic graph generation pipeline and questions requiring joint reasoning over structure and textual semantics. Evaluates representative GLM architectures and compares them with soft-prompted LLM baselines.
Result: Soft-prompted LLM baselines perform on par with full GNN-backbone GLMs, questioning the architectural necessity of graph incorporation. GLMs show significant performance degradation in structural reasoning tasks.
Conclusion: Current GLMs have limitations in graph reasoning capabilities. The findings provide foundation for advancing explicit multimodal reasoning involving both graph structure and language.
Abstract: Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.
[29] Generative Annotation for ASR Named Entity Correction
Yuanchang Luo, Daimeng Wei, Shaojun Li, Hengchao Shang, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Xiaoyu Chen, Zhiqiang Rao, Jinlong Yang, Hao Yang
Main category: cs.CL
TL;DR: A novel named entity correction method using speech sound features to handle cases where ASR transcripts and ground-truth entities have significant word form differences, improving entity accuracy.
Details
Motivation: End-to-end ASR systems often fail to transcribe domain-specific named entities, causing downstream failures. Existing phonetic-level edit distance methods struggle when word forms differ significantly between transcripts and ground-truth entities.Method: Utilizes speech sound features to retrieve candidate entities, then employs a generative method to annotate entity errors in ASR transcripts and replace text with correct entities.
Result: Significant improvement in entity accuracy demonstrated on both open-source and self-constructed test sets, particularly effective in scenarios with word form differences.
Conclusion: The proposed NEC method effectively addresses the limitation of existing approaches by leveraging speech sound features and generative annotation, providing better performance when transcript and entity forms differ significantly.
Abstract: End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. We will open source our self-constructed test set and training data.
[30] Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning
Nelson Filipe Costa, Leila Kosseim
Main category: cs.CL
TL;DR: First multi-lingual multi-label classification model for implicit discourse relation recognition (HArch) achieves SOTA results, outperforming LLMs like GPT-4o through hierarchical sense modeling and task-specific fine-tuning.
Details
Motivation: To address the lack of multi-lingual and multi-label classification models for implicit discourse relation recognition (IDRR) and leverage hierarchical dependencies between discourse senses in the PDTB 3.0 framework.Method: Proposed HArch model with hierarchical architecture that predicts probability distributions across three sense levels. Evaluated on DiscoGeM 2.0 corpus, compared pre-trained encoders (RoBERTa, XLM-RoBERTa), and benchmarked against LLMs (GPT-4o, Llama-4-Maverick) with few-shot prompting.
Result: RoBERTa-HArch achieved best performance in English, XLM-RoBERTa-HArch performed best in multi-lingual setting. Fine-tuned models consistently outperformed LLMs across all language configurations. Achieved SOTA results on DiscoGeM 1.0 corpus.
Conclusion: Task-specific fine-tuning with hierarchical modeling significantly outperforms prompting large language models for IDRR, demonstrating the effectiveness of the hierarchical approach for discourse relation recognition.
Abstract: This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.
[31] Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models
Ruiyi Yan, Yugo Murawaki
Main category: cs.CL
TL;DR: The paper addresses tokenization inconsistency (TI) issues in LLM-based steganography and watermarking, proposing stepwise verification and post-hoc rollback methods to eliminate TI and improve performance.
Details
Motivation: Tokenization inconsistency between sender and receiver in steganography/watermarking undermines robustness, with problematic tokens showing infrequency and temporariness characteristics.Method: Proposed stepwise verification method for steganography and post-hoc rollback method for watermarking to eliminate tokenization inconsistency.
Result: Experiments show improved fluency, imperceptibility, anti-steganalysis capacity for steganography, and enhanced detectability/robustness against attacks for watermarking.
Conclusion: Directly addressing tokenization inconsistency through tailored methods significantly improves both steganography and watermarking performance compared to traditional disambiguation approaches.
Abstract: Large language models have significantly enhanced the capacities and efficiency of text generation. On the one hand, they have improved the quality of text-based steganography. On the other hand, they have also underscored the importance of watermarking as a safeguard against malicious misuse. In this study, we focus on tokenization inconsistency (TI) between Alice and Bob in steganography and watermarking, where TI can undermine robustness. Our investigation reveals that the problematic tokens responsible for TI exhibit two key characteristics: infrequency and temporariness. Based on these findings, we propose two tailored solutions for TI elimination: a stepwise verification method for steganography and a post-hoc rollback method for watermarking. Experiments show that (1) compared to traditional disambiguation methods in steganography, directly addressing TI leads to improvements in fluency, imperceptibility, and anti-steganalysis capacity; (2) for watermarking, addressing TI enhances detectability and robustness against attacks.
[32] rStar2-Agent: Agentic Reasoning Technical Report
Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang
Main category: cs.CL
TL;DR: rStar2-Agent is a 14B math reasoning model that achieves state-of-the-art performance through agentic reinforcement learning, featuring advanced cognitive behaviors like careful thinking before coding and autonomous step refinement.
Details
Motivation: To develop a math reasoning model that goes beyond current long chain-of-thought approaches by enabling advanced cognitive behaviors and autonomous problem-solving capabilities through efficient agentic reinforcement learning.Method: Uses three key innovations: (1) efficient RL infrastructure with reliable Python code environment for high-throughput execution, (2) GRPO-RoC algorithm with Resample-on-Correct strategy to handle coding environment noise, (3) multi-stage training recipe from non-reasoning SFT to RL stages.
Result: Achieves 80.6% pass@1 on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with shorter responses. Trained in only 510 RL steps within one week using 64 MI300X GPUs.
Conclusion: rStar2-Agent demonstrates that agentic RL can effectively scale to create advanced reasoning models with strong generalization to math, alignment, scientific reasoning, and tool-use tasks, achieving frontier performance with minimal compute resources.
Abstract: We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.
[33] Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees
Stephen Meisenbacher, Maulik Chevli, Florian Matthes
Main category: cs.CL
TL;DR: DP-ST uses semantic triples and LLM post-processing for local differential privacy text generation, achieving better privacy-utility balance at lower ε values through neighborhood-aware document generation.
Details
Motivation: Existing local DP text privatization methods require very high ε values for reasonable utility, making them impractical for real-world applications that need stronger privacy guarantees.Method: Introduces DP-ST which leverages semantic triples for neighborhood-aware private document generation under local DP, using a divide-and-conquer approach with LLM post-processing for coherence.
Result: The method enables coherent text generation even at lower ε values while maintaining privacy-utility balance, demonstrating effectiveness of limiting DP notion to privatization neighborhood.
Conclusion: Semantic triple-based approaches with LLM post-processing can achieve balanced privatization outputs at reasonable ε levels, highlighting the importance of coherence in privacy-preserving text generation.
Abstract: Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under local DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter $\varepsilon$. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high $\varepsilon$ values. Addressing this challenge, we introduce DP-ST, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the divide-and-conquer paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a privatization neighborhood. When combined with LLM post-processing, our method allows for coherent text generation even at lower $\varepsilon$ values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable $\varepsilon$ levels.
[34] Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets
Vassiliy Cheremetiev, Quang Long Ho Ngo, Chau Ying Kot, Alina Elena Baia, Andrea Cavallaro
Main category: cs.CL
TL;DR: Fine-tuning general-purpose LLM-based embedding models achieves state-of-the-art performance for implicit hate speech detection without needing external knowledge or complex pipelines.
Details
Motivation: Implicit hate speech is challenging to detect due to its subtle, indirect nature that lacks explicit derogatory words, requiring advanced detection methods beyond traditional approaches.Method: Solely fine-tuning recent general-purpose embedding models based on large language models (Stella, Jasper, NV-Embed, E5) without external knowledge or additional information.
Result: Achieved up to 1.10 percentage points improvement for in-dataset evaluation and up to 20.35 percentage points improvement in cross-dataset evaluation in terms of F1-macro score across multiple IHS datasets.
Conclusion: General-purpose LLM-based embedding models, when fine-tuned, provide effective and simplified solutions for implicit hate speech detection, outperforming complex task-specific pipelines that rely on external knowledge.
Abstract: Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.
[35] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation
Yuanhao Ding, Esteban Garces Arias, Meimingwei Li, Julian Rodemann, Matthias Aßenmacher, Danlu Chen, Gaojuan Fan, Christian Heumann, Chongsheng Zhang
Main category: cs.CL
TL;DR: GUARD is a self-adaptive decoding method that balances text diversity and coherence using global and local uncertainty signals, achieving faster generation with better quality.
Details
Motivation: Address the trade-off between coherence and diversity in LLM text generation, overcoming limitations of contrastive search methods like hyperparameter dependence and high computational costs.Method: GUARD uses a ‘Glocal’ uncertainty-driven framework combining global entropy estimates with local entropy deviations, plus a token-count-based penalty to reduce computational overhead.
Result: GUARD achieves better balance between diversity and coherence, shows substantial generation speed improvements, and receives strong validation from both human and LLM evaluators.
Conclusion: GUARD provides an effective solution for open-ended text generation by integrating long-term and short-term uncertainty signals with reduced computational costs.
Abstract: Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel “Glocal” uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at https://github.com/YecanLee/GUARD.
[36] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions
Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, Honglei Guo
Main category: cs.CL
TL;DR: LLM-generated therapy dialogues lack emotional fidelity compared to real CBT sessions, showing less variability, emotion-laden language, and authentic emotional dynamics.
Details
Motivation: To assess whether synthetic therapy dialogues from LLMs capture the nuanced emotional dynamics of real therapy sessions, particularly in Cognitive Behavioral Therapy.Method: Adapted Utterance Emotion Dynamics framework to analyze emotional trajectories across valence, arousal, and dominance dimensions in both real (transcribed from public videos) and synthetic (CACTUS dataset) CBT dialogues.
Result: Synthetic dialogues are fluent but diverge from real conversations in emotional properties: real sessions show greater emotional variability, more emotion-laden language, and authentic reactivity/regulation patterns. Emotional arc similarity is low, especially for clients.
Conclusion: Current LLM-generated therapy data has limitations in emotional fidelity, highlighting the need for better emotional representation in mental health applications. Introduced RealCBT dataset for future research.
Abstract: Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we conduct the first comparative analysis of emotional arcs between real and LLM-generated Cognitive Behavioral Therapy dialogues. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions transcribed from public videos and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability,more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity between real and synthetic speakers is low, especially for clients. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. We introduce RealCBT, a curated dataset of real CBT sessions, to support future research in this space.
[37] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection
Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem
Main category: cs.CL
TL;DR: ROSI is a white-box method that improves LLM safety by amplifying refusal-mediating directions through simple rank-one weight modifications, increasing safety refusal rates while preserving model utility.
Details
Motivation: Existing safety mechanisms in LLMs can be bypassed by removing specific representational directions, so the authors propose an opposite approach to amplify safety alignment rather than bypass it.Method: Rank-One Safety Injection (ROSI) - a fine-tuning-free method that applies rank-one weight modifications to all residual stream write matrices, using safety directions computed from harmful/harmless instruction pairs.
Result: ROSI consistently increases safety refusal rates (evaluated by Llama Guard 3) while preserving utility on standard benchmarks (MMLU, HellaSwag, Arc), and can re-align uncensored models using their latent safety directions.
Conclusion: Targeted weight steering is a cheap and effective mechanism to improve LLM safety that complements more resource-intensive fine-tuning approaches.
Abstract: Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model’s safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align ‘uncensored’ models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.
[38] Signs of Struggle: Spotting Cognitive Distortions across Language and Register
Abhishek Kuber, Enrico Liscio, Ruixuan Zhang, Caroline Figueroa, Pradeep K. Murukannaiah
Main category: cs.CL
TL;DR: First cross-lingual study of cognitive distortion detection in Dutch adolescent forum posts, showing language/style changes affect performance but domain adaptation works best.
Details
Motivation: Rising youth mental health issues need automated detection of cognitive distortions for early intervention, but prior work focused only on English clinical data.Method: Cross-lingual and cross-register generalization analysis using Dutch adolescent forum posts, testing domain adaptation methods.
Result: Language and writing style changes significantly affect model performance, but domain adaptation methods show the most promise.
Conclusion: Domain adaptation is crucial for effective cross-lingual cognitive distortion detection in non-clinical digital text.
Abstract: Rising mental health issues among youth have increased interest in automated approaches for detecting early signs of psychological distress in digital text. One key focus is the identification of cognitive distortions, irrational thought patterns that have a role in aggravating mental distress. Early detection of these distortions may enable timely, low-cost interventions. While prior work has focused on English clinical data, we present the first in-depth study of cross-lingual and cross-register generalization of cognitive distortion detection, analyzing forum posts written by Dutch adolescents. Our findings show that while changes in language and writing style can significantly affect model performance, domain adaptation methods show the most promise.
[39] GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction
Jie Zhao, Wanting Ning, Yuxiao Fei, Yubo Feng, Lishuang Li
Main category: cs.CL
TL;DR: GDLLM is a novel approach that enhances Large Language Models for Event Temporal Relation Extraction by incorporating distance-aware graph structures and temporal feature learning to better handle long-distance dependencies and minority classes.
Details
Motivation: Small Language Models struggle with minority class relations in imbalanced datasets, while Large Language Models with manual prompts introduce noise and fail to properly handle long-distance dependencies between events.Method: Proposes GDLLM with distance-aware graph structure using Graph Attention Network to capture long-distance dependencies, and a temporal feature learning paradigm with soft inference to enhance short-distance relation identification by supplementing LLM probabilistic information into multi-head attention.
Result: Achieves state-of-the-art performance on TB-Dense and MATRES datasets, substantially enhancing minority relation class performance and overall learning ability.
Conclusion: The global distance-aware modeling approach effectively captures global features, improving ETRE performance by addressing limitations of both SLMs and LLMs in handling long-distance dependencies and minority classes.
Abstract: In Natural Language Processing(NLP), Event Temporal Relation Extraction (ETRE) is to recognize the temporal relations of two events. Prior studies have noted the importance of language models for ETRE. However, the restricted pre-trained knowledge of Small Language Models(SLMs) limits their capability to handle minority class relations in imbalanced classification datasets. For Large Language Models(LLMs), researchers adopt manually designed prompts or instructions, which may introduce extra noise, leading to interference with the model’s judgment of the long-distance dependencies between events. To address these issues, we propose GDLLM, a Global Distance-aware modeling approach based on LLMs. We first present a distance-aware graph structure utilizing Graph Attention Network(GAT) to assist the LLMs in capturing long-distance dependency features. Additionally, we design a temporal feature learning paradigm based on soft inference to augment the identification of relations with a short-distance proximity band, which supplements the probabilistic information generated by LLMs into the multi-head attention mechanism. Since the global feature can be captured effectively, our framework substantially enhances the performance of minority relation classes and improves the overall learning ability. Experiments on two publicly available datasets, TB-Dense and MATRES, demonstrate that our approach achieves state-of-the-art (SOTA) performance.
[40] MSRS: Evaluating Multi-Source Retrieval-Augmented Generation
Rohan Phanse, Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan
Main category: cs.CL
TL;DR: A framework for evaluating RAG systems on multi-source information integration and long-form generation, with new benchmarks showing retrieval effectiveness is crucial and reasoning models outperform standard LLMs.
Details
Motivation: Real-world applications require RAG systems to integrate information from multiple sources and generate comprehensive responses, but current evaluations focus on single-source or factoid-based scenarios.Method: Developed a scalable framework to create evaluation benchmarks (MSRS-Story and MSRS-Meet) for multi-source retrieval and synthesis, testing various RAG pipelines with different retrievers and LLMs.
Result: Generation quality heavily depends on retrieval effectiveness, which varies by task. Multi-source synthesis remains challenging even with perfect retrieval, but reasoning models significantly outperform standard LLMs.
Conclusion: The framework enables better evaluation of RAG systems for complex real-world tasks, highlighting the importance of both retrieval quality and advanced reasoning capabilities for multi-source information integration.
Abstract: Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user’s question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines – including sparse and dense retrievers combined with frontier LLMs – reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.
[41] The Uneven Impact of Post-Training Quantization in Machine Translation
Benjamin Marie, Atsushi Fujita
Main category: cs.CL
TL;DR: First large-scale evaluation of post-training quantization on machine translation across 55 languages shows 4-bit quantization preserves quality for high-resource languages but causes significant degradation for low-resource languages, especially at 2-bit precision.
Details
Motivation: Quantization is essential for deploying large language models on resource-constrained hardware, but its implications for multilingual tasks remain underexplored, particularly for machine translation across diverse languages.Method: Conducted evaluation of post-training quantization using five LLMs (1.7B to 70B parameters) across 55 languages, comparing four quantization techniques (AWQ, BitsAndBytes, GGUF, and AutoRound) and analyzing interactions with decoding hyperparameters and calibration languages.
Result: 4-bit quantization often preserves translation quality for high-resource languages and large models, but significant degradation occurs for low-resource and typologically diverse languages, especially in 2-bit settings. GGUF variants provided the most consistent performance even at 2-bit precision.
Conclusion: Quantization algorithm choice and model size jointly determine robustness, with language-matched calibration offering benefits primarily in low-bit scenarios. Findings provide actionable insights for deploying multilingual LLMs for machine translation under quantization constraints, especially in low-resource settings.
Abstract: Quantization is essential for deploying large language models (LLMs) on resource-constrained hardware, but its implications for multilingual tasks remain underexplored. We conduct the first large-scale evaluation of post-training quantization (PTQ) on machine translation across 55 languages using five LLMs ranging from 1.7B to 70B parameters. Our analysis reveals that while 4-bit quantization often preserves translation quality for high-resource languages and large models, significant degradation occurs for low-resource and typologically diverse languages, particularly in 2-bit settings. We compare four quantization techniques (AWQ, BitsAndBytes, GGUF, and AutoRound), showing that algorithm choice and model size jointly determine robustness. GGUF variants provide the most consistent performance, even at 2-bit precision. Additionally, we quantify the interactions between quantization, decoding hyperparameters, and calibration languages, finding that language-matched calibration offers benefits primarily in low-bit scenarios. Our findings offer actionable insights for deploying multilingual LLMs for machine translation under quantization constraints, especially in low-resource settings.
[42] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement
Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, JingBo Zhu
Main category: cs.CL
TL;DR: SageLM is an end-to-end, multi-aspect, explainable speech LLM for comprehensive Speech-to-Speech model evaluation that jointly assesses semantic and acoustic dimensions using rationale-based supervision and synthetic preference data.
Details
Motivation: Evaluating Speech-to-Speech Large Language Models remains a fundamental challenge, as existing cascaded approaches disregard acoustic features and there's a scarcity of speech preference data for proper evaluation.Method: Proposes SageLM with three key innovations: 1) joint assessment of semantic and acoustic dimensions, 2) rationale-based supervision for explainability, 3) SpeechFeedback synthetic preference dataset with two-stage training paradigm to address data scarcity.
Result: SageLM achieves 82.79% agreement rate with human evaluators, outperforming cascaded baselines by 7.42% and SLM-based baselines by 26.20%.
Conclusion: SageLM provides a comprehensive, explainable framework for Speech-to-Speech LLM evaluation that effectively addresses the limitations of existing methods through joint semantic-acoustic assessment and synthetic data generation.
Abstract: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.
[43] How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench
Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral
Main category: cs.CL
TL;DR: IRMA framework improves LLM tool-calling agents by automatically reformulating queries with domain rules and tool suggestions, achieving significant performance gains over existing methods.
Details
Motivation: LLM-based agents struggle with consistent reasoning, policy adherence, and information extraction in multi-turn conversational environments like τ-bench, requiring better input formulation.Method: Proposed Input-Reformulation Multi-Agent (IRMA) framework that automatically reformulates user queries augmented with relevant domain rules and tool suggestions to improve agent decision making.
Result: IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1% respectively in overall pass^5 scores.
Conclusion: IRMA demonstrates superior reliability and consistency compared to other methods in dynamic environments, highlighting the importance of proper input formulation for LLM tool-calling agents.
Abstract: Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $\tau$-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.
[44] STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment
Jiaqian Li, Qisheng Hu, Jing Li, Wenya Wang
Main category: cs.CL
TL;DR: A novel two-stage exemplar selection strategy for in-context learning that improves semantic parsing performance by ensuring both semantic relevance and structural alignment in selected exemplars.
Details
Motivation: Current ICL exemplar selection strategies for structured prediction tasks like semantic parsing often overlook structural alignment, leading to suboptimal performance and poor generalization.Method: A two-stage approach: 1) Fine-tune a BERT-based retriever with structure-aware supervision to select semantically relevant and structurally aligned exemplars, 2) Enhance with a plug-in module that amplifies syntactically meaningful information in hidden representations.
Result: Consistently outperforms existing baselines on four benchmarks across three semantic parsing tasks with multiple recent LLMs as inference-time models.
Conclusion: The proposed method achieves strong balance between efficiency, generalizability, and performance for ICL in semantic parsing tasks through structure-aware exemplar selection.
Abstract: In-Context Learning (ICL) has become a powerful paradigm that enables LLMs to perform a wide range of tasks without task-specific fine-tuning. However, the effectiveness of ICL heavily depends on the quality of exemplar selection. In particular, for structured prediction tasks such as semantic parsing, existing ICL selection strategies often overlook structural alignment, leading to suboptimal performance and poor generalization. To address this issue, we propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance. First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned. Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations. This plug-in is model-agnostic, requires minimal overhead, and can be seamlessly integrated into existing pipelines. Experiments on four benchmarks spanning three semantic parsing tasks demonstrate that our method consistently outperforms existing baselines with multiple recent LLMs as inference-time models.
[45] ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents
Tianjian Liu, Fanqi Wan, Jiajian Guo, Xiaojun Quan
Main category: cs.CL
TL;DR: ProactiveEval is a unified framework for evaluating LLM proactive dialogue capabilities through target planning and dialogue guidance metrics across multiple domains, with automated data generation and comprehensive testing of 22 LLMs.
Details
Motivation: Existing proactive dialogue research focuses on domain-specific scenarios, leading to fragmented evaluations that limit comprehensive exploration of models' proactive conversation abilities.Method: Proposed ProactiveEval framework that decomposes proactive dialogue into target planning and dialogue guidance, establishes evaluation metrics across domains, and enables automatic generation of diverse evaluation data with 328 environments across 6 domains.
Result: Testing 22 different LLMs showed DeepSeek-R1 excels at target planning while Claude-3.7-Sonnet performs best on dialogue guidance tasks. The study also investigated how reasoning capabilities influence proactive behaviors.
Conclusion: The framework provides comprehensive evaluation of proactive dialogue capabilities, revealing performance differences among models and offering insights for future model development in proactive conversation abilities.
Abstract: Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models’ proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.
[46] Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution
Chen Chen, Yuchen Sun, Jiaxin Gao, Xueluan Gong, Qian Wang, Ziyao Wang, Yongsen Zheng, Kwok-Yan Lam
Main category: cs.CL
TL;DR: LETHE is a novel defense method that eliminates backdoor behaviors in LLMs through internal knowledge dilution and external prompt distraction, achieving up to 98% reduction in attack success rate while maintaining model utility.
Details
Motivation: LLMs are vulnerable to backdoor attacks that cause harmful responses when specific triggers are activated, and existing defenses lack comprehensiveness against advanced attack scenarios.Method: LETHE uses internal knowledge dilution by training a clean model and merging it with the backdoored model, plus external distraction by incorporating benign evidence into prompts to shift attention from backdoor features.
Result: Outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks, reduces attack success rate by up to 98%, maintains model utility, and proves cost-efficient and robust against adaptive attacks.
Conclusion: LETHE provides a comprehensive and effective defense solution against advanced backdoor attacks in LLMs through dual internal-external mechanisms.
Abstract: Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model’s parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM’s attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.
[47] An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs
Mathieu Bourdin, Anas Neumann, Thomas Paviot, Robert Pellerin, Samir Lamouri
Main category: cs.CL
TL;DR: EASI-RAG is a structured method for deploying RAG systems in SMEs, enabling fast implementation by non-experts with high accuracy and user adoption.
Details
Motivation: SMEs struggle to deploy RAG systems due to limited resources and NLP expertise, despite RAG's ability to address LLM limitations like hallucinations and outdated knowledge.Method: EASI-RAG uses method engineering principles with defined roles, activities, and techniques. Validated through a real-world case study in an environmental testing laboratory where operators query operational procedure data.
Result: System deployed in under a month by team with no RAG experience, achieved high user adoption, accurate answers, and enhanced data reliability through iterative improvements.
Conclusion: EASI-RAG successfully enables RAG deployment in industrial SMEs, demonstrating potential for broader adoption. Future work needs generalization across use cases and integration with fine-tuned models.
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful solution to mitigate the limitations of Large Language Models (LLMs), such as hallucinations and outdated knowledge. However, deploying RAG-based tools in Small and Medium Enterprises (SMEs) remains a challenge due to their limited resources and lack of expertise in natural language processing (NLP). This paper introduces EASI-RAG, Enterprise Application Support for Industrial RAG, a structured, agile method designed to facilitate the deployment of RAG systems in industrial SME contexts. EASI-RAG is based on method engineering principles and comprises well-defined roles, activities, and techniques. The method was validated through a real-world case study in an environmental testing laboratory, where a RAG tool was implemented to answer operators queries using data extracted from operational procedures. The system was deployed in under a month by a team with no prior RAG experience and was later iteratively improved based on user feedback. Results demonstrate that EASI-RAG supports fast implementation, high user adoption, delivers accurate answers, and enhances the reliability of underlying data. This work highlights the potential of RAG deployment in industrial SMEs. Future works include the need for generalization across diverse use cases and further integration with fine-tuned models.
[48] Re-Representation in Sentential Relation Extraction with Sequence Routing Algorithm
Ramazan Ali Bahrami, Ramin Yahyapour
Main category: cs.CL
TL;DR: Proposes dynamic routing in capsules for sentential relation extraction, showing SOTA performance on multiple datasets but identifying noise in Wikidata labels and re-representation as key challenges.
Details
Motivation: To improve sentential relation extraction performance using dynamic routing in capsules and understand the factors affecting performance across different datasets.Method: Dynamic routing in capsules for sentential relation extraction, evaluated on Tacred, Tacredrev, Retacred, Conll04, and Wikidata datasets.
Result: Outperforms state-of-the-art on common datasets but shows low performance on Wikidata due to label noise. Demonstrates better re-representation capability compared to vanilla models.
Conclusion: Noise in distantly supervised datasets and re-representation challenges are key factors in sentential RE performance. The proposed capsule approach shows superior re-representation capabilities.
Abstract: Sentential relation extraction (RE) is an important task in natural language processing (NLP). In this paper we propose to do sentential RE with dynamic routing in capsules. We first show that the proposed approach outperform state of the art on common sentential relation extraction datasets Tacred, Tacredrev, Retacred, and Conll04. We then investigate potential reasons for its good performance on the mentioned datasets, and yet low performance on another similar, yet larger sentential RE dataset, Wikidata. As such, we identify noise in Wikidata labels as one of the reasons that can hinder performance. Additionally, we show associativity of better performance with better re-representation, a term from neuroscience referred to change of representation in human brain to improve the match at comparison time. As example, in the given analogous terms King:Queen::Man:Woman, at comparison time, and as a result of re-representation, the similarity between related head terms (King,Man), and tail terms (Queen,Woman) increases. As such, our observation show that our proposed model can do re-representation better than the vanilla model compared with. To that end, beside noise in the labels of the distantly supervised RE datasets, we propose re-representation as a challenge in sentential RE.
[49] Enabling Equitable Access to Trustworthy Financial Reasoning
William Jurayj, Nils Holzenberger, Benjamin Van Durme
Main category: cs.CL
TL;DR: A neuro-symbolic approach combining LLMs with symbolic solvers for tax filing that improves accuracy and reduces costs below real-world averages.
Details
Motivation: Tax filing requires complex reasoning with high accuracy needs due to costly penalties, making pure LLMs unsuitable. There's a need for automated systems that are both accurate and auditable.Method: Integrates large language models with symbolic solvers, translates plain-text tax rules into formal logic programs, and uses intelligently retrieved exemplars for formal case representations.
Result: The system dramatically improves performance on the SARA dataset and reduces deployment costs well below real-world averages of $270 and 13 hours per filing.
Conclusion: Neuro-symbolic architectures show promise for increasing equitable access to reliable tax assistance while being economically feasible.
Abstract: According to the United States Internal Revenue Service, ‘’the average American spends $$270$ and 13 hours filing their taxes’’. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.
[50] Probing Pre-Trained Language Models for Cross-Cultural Differences in Values
Arnav Arora, Lucie-Aimée Kaffee, Isabelle Augenstein
Main category: cs.CL
TL;DR: PTLMs capture cultural value differences but weakly align with established value surveys, raising concerns about cross-cultural applications.
Details
Motivation: To systematically study how values embedded in pre-trained language models vary across cultures and whether they align with existing cross-cultural value theories and surveys.Method: Introduced probes to investigate which cultural values are embedded in pre-trained language models and compared them with established value surveys.
Result: Found that PTLMs capture differences in values across cultures, but these captured values only weakly align with established value surveys.
Conclusion: Highlights implications of using misaligned models in cross-cultural settings and discusses potential ways to align PTLMs with value surveys for better cultural representation.
Abstract: Language embeds information about social, cultural, and political values people hold. Prior work has explored social and potentially harmful biases encoded in Pre-Trained Language models (PTLMs). However, there has been no systematic study investigating how values embedded in these models vary across cultures. In this paper, we introduce probes to study which values across cultures are embedded in these models, and whether they align with existing theories and cross-cultural value surveys. We find that PTLMs capture differences in values across cultures, but those only weakly align with established value surveys. We discuss implications of using mis-aligned models in cross-cultural settings, as well as ways of aligning PTLMs with value surveys.
[51] ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, Guangyu Sun
Main category: cs.CL
TL;DR: ASVD is a training-free compression method for LLMs that addresses activation distribution variance and layer sensitivity through weight transformation and iterative calibration, achieving 10-30% network compression and 50% KV cache reduction.
Details
Motivation: To facilitate wider adoption of Large Language Models by addressing the challenges of weight low-rank decomposition stemming from activation distribution variance and varying layer sensitivity.Method: Proposes Activation-aware Singular Value Decomposition (ASVD) which transforms weight matrices based on activation distribution to handle outliers, and uses iterative calibration to optimize layer-specific decomposition. Also applies ASVD to compress KV cache by reducing channel dimension.
Result: Achieves 10-30% network compression and 50% KV cache reduction without performance degradation, all in a training-free manner.
Conclusion: ASVD provides an effective training-free compression paradigm for LLMs that successfully addresses key challenges in low-rank decomposition and enables significant memory savings for both weights and KV cache.
Abstract: In this paper, we introduce a new post-training compression paradigm for Large Language Models (LLMs) to facilitate their wider adoption. We delve into LLM weight low-rank decomposition, and find that the challenges of this task stem from (1) the distribution variance in the LLM activations and (2) the sensitivity difference among various kinds of layers. To address these issues, we propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD). Specifically, ASVD manages activation outliers by transforming the weight matrix based on the activation distribution. This transformation allows the outliers in the activation matrix to be absorbed into the transformed weight matrix, thereby enhancing decomposition accuracy. Additionally, we propose an efficient iterative calibration process to optimize layer-specific decomposition by addressing the varying sensitivity of different LLM layers. In this way, ASVD can compress a network by 10%-30%. Based on the success of the low-rank decomposition of projection matrices in the self-attention module, we further introduce ASVD to compress the KV cache. By reducing the channel dimension of KV activations, memory requirements for KV cache can be largely reduced. ASVD can further achieve 50% KV cache reductions without performance drop in a training-free manner.
[52] LGDE: Local Graph-based Dictionary Expansion
Juni Schindler, Sneha Jha, Xixuan Zhang, Kilian Buehling, Annett Heft, Mauricio Barahona
Main category: cs.CL
TL;DR: LGDE is a method that uses manifold learning and network science to discover semantic word neighborhoods by creating similarity graphs from word embeddings and performing local community detection through graph diffusion.
Details
Motivation: To improve information retrieval tasks like database queries and online data collection by expanding dictionaries of pre-selected keywords through better semantic neighborhood discovery.Method: Creates word similarity graph from word embeddings geometry, then performs local community detection using graph diffusion to explore nonlinear semantic associations beyond direct pairwise similarities.
Result: LGDE enriches keyword lists with improved performance compared to direct word similarity or co-occurrence methods, validated on English-language corpora and real-world conspiracy-related dictionary expansion.
Conclusion: LGDE effectively expands seed dictionaries with more useful keywords by leveraging manifold-learning-based similarity networks, as confirmed by empirical results and expert assessment.
Abstract: We present Local Graph-based Dictionary Expansion (LGDE), a method for data-driven discovery of the semantic neighbourhood of words using tools from manifold learning and network science. At the heart of LGDE lies the creation of a word similarity graph from the geometry of word embeddings followed by local community detection based on graph diffusion. The diffusion in the local graph manifold allows the exploration of the complex nonlinear geometry of word embeddings to capture word similarities based on paths of semantic association, over and above direct pairwise similarities. Exploiting such semantic neighbourhoods enables the expansion of dictionaries of pre-selected keywords, an important step for tasks in information retrieval, such as database queries and online data collection. We validate LGDE on two user-generated English-language corpora and show that LGDE enriches the list of keywords with improved performance relative to methods based on direct word similarities or co-occurrences. We further demonstrate our method through a real-world use case from communication science, where LGDE is evaluated quantitatively on the expansion of a conspiracy-related dictionary from online data collected and analysed by domain experts. Our empirical results and expert user assessment indicate that LGDE expands the seed dictionary with more useful keywords due to the manifold-learning-based similarity network.
[53] Bitune: Leveraging Bidirectional Attention to Improve Decoder-Only LLMs
Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano
Main category: cs.CL
TL;DR: Bitune enhances decoder-only LLMs by adding bidirectional attention to prompt processing, improving performance on reasoning and language tasks while maintaining compatibility with various finetuning methods.
Details
Motivation: Decoder-only LLMs are limited by unidirectional attention, restricting information flow and expressiveness in tasks requiring bidirectional context understanding.Method: Incorporates bidirectional attention into prompt processing for pretrained decoder-only LLMs, allowing information to flow in both directions during prompt analysis.
Result: Significant performance improvements on commonsense reasoning, arithmetic, and language understanding tasks. Compatible with parameter-efficient finetuning and full model finetuning.
Conclusion: Bitune effectively addresses the limitations of unidirectional attention in decoder-only LLMs, enhancing their capabilities while maintaining flexibility with existing finetuning approaches.
Abstract: Decoder-only large language models typically rely solely on masked causal attention, which limits their expressiveness by restricting information flow to one direction. We propose Bitune, a method that enhances pretrained decoder-only LLMs by incorporating bidirectional attention into prompt processing. We evaluate Bitune in instruction-tuning and question-answering settings, showing significant improvements in performance on commonsense reasoning, arithmetic, and language understanding tasks. Furthermore, extensive ablation studies validate the role of each component of the method, and demonstrate that Bitune is compatible with various parameter-efficient finetuning techniques and full model finetuning.
[54] SoAy: A Solution-based LLM API-using Methodology for Academic Information Seeking
Yuanchun Wang, Jifan Yu, Zijun Yao, Jing Zhang, Yuyang Xie, Shangqing Tu, Yiyang Fu, Youhe Feng, Jinkai Zhang, Jingyao Zhang, Bowen Huang, Yuanyao Li, Huihui Yuan, Lei Hou, Juanzi Li, Jie Tang
Main category: cs.CL
TL;DR: SoAy is a solution-based LLM API-using methodology that uses pre-constructed API calling sequences (solutions) with code to handle complex academic API coupling, achieving 34.58-75.99% performance improvement over state-of-the-art baselines.
Details
Motivation: Current LLM API-using methods struggle with complex API coupling commonly found in academic queries, which increases researchers' information seeking efforts.Method: SoAy uses code with pre-constructed API calling sequences (solutions) as reasoning method, where solutions reduce the difficulty for models to understand complex API relationships and code improves reasoning efficiency.
Result: Experimental results on SoAyBench show 34.58-75.99% performance improvement compared to state-of-the-art LLM API-based baselines.
Conclusion: SoAy effectively addresses complex API coupling in academic information seeking and significantly outperforms existing methods, with all resources made publicly available.
Abstract: Applying large language models (LLMs) for academic API usage shows promise in reducing researchers’ academic information seeking efforts. However, current LLM API-using methods struggle with complex API coupling commonly encountered in academic queries. To address this, we introduce SoAy, a solution-based LLM API-using methodology for academic information seeking. It uses code with a solution as the reasoning method, where a solution is a pre-constructed API calling sequence. The addition of the solution reduces the difficulty for the model to understand the complex relationships between APIs. Code improves the efficiency of reasoning. To evaluate SoAy, we introduce SoAyBench, an evaluation benchmark accompanied by SoAyEval, built upon a cloned environment of APIs from AMiner. Experimental results demonstrate a 34.58-75.99% performance improvement compared to state-of-the-art LLM API-based baselines. All datasets, codes, tuned models, and deployed online services are publicly accessible at https://github.com/RUCKBReasoning/SoAy.
[55] Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language
Arief Purnama Muharram, Ayu Purwarianti
Main category: cs.CL
TL;DR: Using Knowledge Graphs to enhance Natural Language Inference for automated COVID-19 fact-checking in Indonesian, achieving 86.16% accuracy.
Details
Motivation: Overcome COVID-19 misinformation spread through automated fact-checking, addressing performance stagnation in deep learning models due to knowledge limitations during training.Method: Proposed three-module architecture: fact module processes Knowledge Graph information, NLI module handles semantic relationships between premise and hypothesis, classifier module combines representations for final classification.
Result: Achieved best accuracy of 0.8616, demonstrating significant improvement in NLI performance for fact-checking by incorporating external knowledge from Knowledge Graphs.
Conclusion: Knowledge Graphs are valuable components for enhancing NLI performance in automated fact-checking systems, particularly for combating COVID-19 misinformation.
Abstract: Automated fact-checking is a key strategy to overcome the spread of COVID-19 misinformation on the internet. These systems typically leverage deep learning approaches through Natural Language Inference (NLI) to verify the truthfulness of information based on supporting evidence. However, one challenge that arises in deep learning is performance stagnation due to a lack of knowledge during training. This study proposes using a Knowledge Graph (KG) as external knowledge to enhance NLI performance for automated COVID-19 fact-checking in the Indonesian language. The proposed model architecture comprises three modules: a fact module, an NLI module, and a classifier module. The fact module processes information from the KG, while the NLI module handles semantic relationships between the given premise and hypothesis. The representation vectors from both modules are concatenated and fed into the classifier module to produce the final result. The model was trained using the generated Indonesian COVID-19 fact-checking dataset and the COVID-19 KG Bahasa Indonesia. Our study demonstrates that incorporating KGs can significantly improve NLI performance in fact-checking, achieving the best accuracy of 0.8616. This suggests that KGs are a valuable component for enhancing NLI performance in automated fact-checking.
[56] Explaining word embeddings with perfect fidelity: Case study in research impact prediction
Lucie Dvorackova, Marcin P. Joachimiak, Michal Cerny, Adriana Kubecova, Vilem Sklenak, Tomas Kliegr
Main category: cs.CL
TL;DR: SMER is a new feature importance method for logistic regression classifiers using word embeddings, providing theoretically perfect fidelity with exact correspondence between word-level explanations and model predictions.
Details
Motivation: Existing model-agnostic explanation methods like LIME produce questionable results when applied to embedding-based classifiers, lacking proper correspondence to the actual model behavior.Method: Self-model Rated Entities (SMER) method for logistic regression classifiers trained on word embeddings, where the average of word-level SMER scores exactly corresponds to the model’s prediction logit.
Result: Quantitative evaluation on 50,000 research articles shows SMER outperforms LIME, SHAP and global tree surrogates in AOPC curve analysis, with both quantitative and qualitative improvements.
Conclusion: SMER provides theoretically perfect fidelity explanations for embedding-based classifiers, offering superior performance over existing explanation methods while maintaining exact correspondence to model predictions.
Abstract: The best-performing approaches for scholarly document quality prediction are based on embedding models. In addition to their performance when used in classifiers, embedding models can also provide predictions even for words that were not contained in the labelled training data for the classification model, which is important in the context of the ever-evolving research terminology. Although model-agnostic explanation methods, such as Local interpretable model-agnostic explanations, can be applied to explain machine learning classifiers trained on embedding models, these produce results with questionable correspondence to the model. We introduce a new feature importance method, Self-model Rated Entities (SMER), for logistic regression-based classification models trained on word embeddings. We show that SMER has theoretically perfect fidelity with the explained model, as the average of logits of SMER scores for individual words (SMER explanation) exactly corresponds to the logit of the prediction of the explained model. Quantitative and qualitative evaluation is performed through five diverse experiments conducted on 50,000 research articles (papers) from the CORD-19 corpus. Through an AOPC curve analysis, we experimentally demonstrate that SMER produces better explanations than LIME, SHAP and global tree surrogates.
[57] Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models
Youan Cong, Pritom Saha Akash, Cheng Wang, Kevin Chen-Chuan Chang
Main category: cs.CL
TL;DR: ERRR framework bridges pre-retrieval information gap in RAG systems through LLM-specific query optimization using knowledge extraction and refinement.
Details
Motivation: To address the pre-retrieval information gap in Retrieval-Augmented Generation systems and improve query optimization tailored for Large Language Models' specific knowledge requirements.Method: Extract-Refine-Retrieve-Read framework: extracts parametric knowledge from LLMs, uses specialized query optimizer to refine queries, retrieves only pertinent information. Includes trainable scheme with smaller tunable query optimizer refined through knowledge distillation from larger teacher model.
Result: ERRR consistently outperforms existing baselines on various question-answering datasets with different retrieval systems, demonstrating versatility and effectiveness.
Conclusion: ERRR proves to be a versatile and cost-effective module for improving utility and accuracy of RAG systems through optimized query processing.
Abstract: We introduce the Extract-Refine-Retrieve-Read (ERRR) framework, a novel approach designed to bridge the pre-retrieval information gap in Retrieval-Augmented Generation (RAG) systems through query optimization tailored to meet the specific knowledge requirements of Large Language Models (LLMs). Unlike conventional query optimization techniques used in RAG, the ERRR framework begins by extracting parametric knowledge from LLMs, followed by using a specialized query optimizer for refining these queries. This process ensures the retrieval of only the most pertinent information essential for generating accurate responses. Moreover, to enhance flexibility and reduce computational costs, we propose a trainable scheme for our pipeline that utilizes a smaller, tunable model as the query optimizer, which is refined through knowledge distillation from a larger teacher model. Our evaluations on various question-answering (QA) datasets and with different retrieval systems show that ERRR consistently outperforms existing baselines, proving to be a versatile and cost-effective module for improving the utility and accuracy of RAG systems.
[58] Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee
Main category: cs.CL
TL;DR: Weight merging of pre- and post-fine-tuned LLMs effectively mitigates safety degradation while improving downstream task performance, without requiring additional safety data.
Details
Motivation: Fine-tuning LLMs for downstream tasks causes catastrophic forgetting that degrades safety alignment, and existing methods struggle with inaccessible high-quality safety data needed for recovery.Method: Simply merge the weights of pre-fine-tuned (original aligned) and post-fine-tuned models to preserve safety while maintaining performance improvements.
Result: Experiments across various downstream tasks and models show the method effectively mitigates safety degradation while enhancing performance.
Conclusion: Weight merging provides a practical and effective solution for preserving safety alignment during fine-tuning without needing additional safety data.
Abstract: Fine-tuning large language models (LLMs) for downstream tasks often leads to catastrophic forgetting, notably degrading the safety of originally aligned models. While some existing methods attempt to restore safety by incorporating additional safety data, the quality of such data typically falls short of that used in the original alignment process. Moreover, these high-quality safety datasets are generally inaccessible, making it difficult to fully recover the model’s original safety. We ask: How can we preserve safety while improving downstream task performance without additional safety data? We show that simply merging the weights of pre- and post-fine-tuned models effectively mitigates safety degradation while enhancing performance. Experiments across different downstream tasks and models validate the method’s practicality and effectiveness.
[59] Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Aloka Fernando, Nisansa de Silva, Menan Velyuthan, Charitha Rathnayake, Surangika Ranathunga
Main category: cs.CL
TL;DR: Different multilingual pre-trained language models have biases that allow noisy sentences to rank highly in parallel data curation. Using heuristics to remove this noise improves NMT performance and reduces model disparity.
Details
Motivation: Previous research showed that multiPLM choice significantly impacts ranking quality in parallel data curation, but the reasons for this disparity were not well understood.Method: Analyzed web-mined corpora (CCMatrix, CCAligned) for English-Sinhala, English-Tamil, and Sinhala-Tamil language pairs using LASER3, XLM-R, and LaBSE models. Employed heuristics to identify and remove biased noisy sentences that ranked highly.
Result: Different multiPLMs were biased towards certain sentence types, allowing noise in top-ranked samples. Heuristic-based noise removal improved NMT system performance and reduced performance disparity across different multiPLMs.
Conclusion: Model-specific biases in multiPLMs affect parallel data curation quality, but targeted heuristics can mitigate this noise and improve NMT training outcomes across different models.
Abstract: Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.
[60] Are formal and functional linguistic mechanisms dissociated in language models?
Michael Hanna, Yonatan Belinkov, Sandro Pezzelle
Main category: cs.CL
TL;DR: LLMs show distinct circuits for formal vs functional linguistic tasks but lack unified formal task networks like human brains, though cross-task faithfulness suggests potential shared formal mechanisms.
Details
Motivation: To investigate whether current LLMs exhibit distinct localization of formal and functional linguistic mechanisms as suggested by neuroscience, given their improving functional abilities.Method: Analyzed 5 LLMs across 10 tasks by finding and comparing minimal computational subgraphs (circuits) responsible for various formal and functional linguistic tasks.
Result: Found little overlap between formal and functional task circuits, but also little overlap between different formal linguistic tasks themselves, unlike human brain organization. However, cross-task faithfulness analysis showed separation between formal and functional mechanisms.
Conclusion: A single unified formal linguistic network distinct from functional task circuits remains elusive in current LLMs, though shared mechanisms between formal tasks may exist as suggested by cross-task faithfulness patterns.
Abstract: Although large language models (LLMs) are increasingly capable, these capabilities are unevenly distributed: they excel at formal linguistic tasks, such as producing fluent, grammatical text, but struggle more with functional linguistic tasks like reasoning and consistent fact retrieval. Inspired by neuroscience, recent work suggests that to succeed on both formal and functional linguistic tasks, LLMs should use different mechanisms for each; such localization could either be built-in or emerge spontaneously through training. In this paper, we ask: do current models, with fast-improving functional linguistic abilities, exhibit distinct localization of formal and functional linguistic mechanisms? We answer this by finding and comparing the “circuits”, or minimal computational subgraphs, responsible for various formal and functional tasks. Comparing 5 LLMs across 10 distinct tasks, we find that while there is indeed little overlap between circuits for formal and functional tasks, there is also little overlap between formal linguistic tasks, as exists in the human brain. Thus, a single formal linguistic network, unified and distinct from functional task circuits, remains elusive. However, in terms of cross-task faithfulness - the ability of one circuit to solve another’s task - we observe a separation between formal and functional mechanisms, suggesting that shared mechanisms between formal tasks may exist.
[61] SaRoHead: Detecting Satire in a Multi-Domain Romanian News Headline Dataset
Mihnea-Alexandru Vîrlan, Răzvan-Alexandru Smădu, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
Main category: cs.CL
TL;DR: This paper investigates detecting satirical tone in Romanian news headlines alone using various ML approaches, finding that Bidirectional Transformer models with meta-learning outperform other methods.
Details
Motivation: Current approaches for Romanian language satire detection require both headlines and main article content. The authors want to test if headlines alone contain enough satirical tone signals for detection.Method: Tested multiple baselines including standard ML algorithms, deep learning models, and LLMs. Used Bidirectional Transformer models with meta-learning Reptile approach.
Result: Bidirectional Transformer models outperformed both standard machine-learning approaches and Large Language Models, particularly when the meta-learning Reptile approach was employed.
Conclusion: Headlines alone contain sufficient satirical tone signals for detection, and advanced transformer models with meta-learning provide the best performance for Romanian satire detection.
Abstract: The primary goal of a news headline is to summarize an event in as few words as possible. Depending on the media outlet, a headline can serve as a means to objectively deliver a summary or improve its visibility. For the latter, specific publications may employ stylistic approaches that incorporate the use of sarcasm, irony, and exaggeration, key elements of a satirical approach. As such, even the headline must reflect the tone of the satirical main content. Current approaches for the Romanian language tend to detect the non-conventional tone (i.e., satire and clickbait) of the news content by combining both the main article and the headline. Because we consider a headline to be merely a brief summary of the main article, we investigate in this paper the presence of satirical tone in headlines alone, testing multiple baselines ranging from standard machine learning algorithms to deep learning models. Our experiments show that Bidirectional Transformer models outperform both standard machine-learning approaches and Large Language Models (LLMs), particularly when the meta-learning Reptile approach is employed.
[62] Multilingual Contextualization of Large Language Models for Document-Level Machine Translation
Miguel Moura Ramos, Patrick Fernandes, Sweta Agrawal, André F. T. Martins
Main category: cs.CL
TL;DR: Fine-tuning LLMs on curated document-level data (DocBlocks) improves long-document translation by capturing cross-sentence dependencies through multiple translation paradigms.
Details
Motivation: LLMs perform well at sentence-level translation but struggle with document-level translation due to challenges in modeling long-range dependencies and discourse phenomena across sentences and paragraphs.Method: Propose targeted fine-tuning on high-quality document-level data (DocBlocks) that supports multiple translation paradigms including direct document-to-document and chunk-level translation with context integration.
Result: Experimental results show improved document-level translation quality and inference speed compared to prompting and agent-based methods, while maintaining strong sentence-level performance.
Conclusion: Incorporating multiple translation paradigms through fine-tuning on document-level data enables LLMs to better capture cross-sentence dependencies and improves overall document translation capabilities.
Abstract: Large language models (LLMs) have demonstrated strong performance in sentence-level machine translation, but scaling to document-level translation remains challenging, particularly in modeling long-range dependencies and discourse phenomena across sentences and paragraphs. In this work, we propose a method to improve LLM-based long-document translation through targeted fine-tuning on high-quality document-level data, which we curate and introduce as DocBlocks. Our approach supports multiple translation paradigms, including direct document-to-document and chunk-level translation, by integrating instructions both with and without surrounding context. This enables models to better capture cross-sentence dependencies while maintaining strong sentence-level translation performance. Experimental results show that incorporating multiple translation paradigms improves document-level translation quality and inference speed compared to prompting and agent-based methods.
[63] DART: Distilling Autoregressive Reasoning to Silent Thought
Nan Jiang, Ziming Wu, De-Chuan Zhan, Fuming Lai, Shaobing Lian
Main category: cs.CL
TL;DR: DART is a self-distillation framework that enables LLMs to replace autoregressive Chain-of-Thought reasoning with non-autoregressive Silent Thought, reducing computational overhead while maintaining performance.
Details
Motivation: Chain-of-Thought reasoning causes significant computational overhead and latency, making it unsuitable for latency-sensitive applications. There's a need for efficient reasoning methods that maintain performance without the autoregressive cost.Method: DART uses a self-distillation framework with two training pathways: CoT pathway for traditional reasoning and ST pathway that generates answers directly from Silent Thought tokens. It employs a Reasoning Evolvement Module to align hidden states between pathways, enabling ST tokens to evolve into informative embeddings.
Result: Extensive experiments show DART provides significant performance gains compared to existing non-autoregressive baselines without adding extra inference latency.
Conclusion: DART serves as a feasible alternative for efficient reasoning, enabling LLMs to perform complex tasks with reduced computational overhead while maintaining performance comparable to autoregressive methods.
Abstract: Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose \textbf{DART} (\textbf{D}istilling \textbf{A}utoregressive \textbf{R}easoning to Silent \textbf{T}hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART offers significant performance gains compared with existing non-autoregressive baselines without extra inference latency, serving as a feasible alternative for efficient reasoning.
[64] Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
Yingji Zhang, Danilo S. Carvalho, André Freitas
Main category: cs.CL
TL;DR: Survey paper analyzing how compositional semantics can bridge symbolic and distributional approaches in language models, comparing three autoencoder architectures (VAE, VQVAE, SAE) and their latent space geometries.
Details
Motivation: To enhance interpretability, controllability, compositionality, and generalization of Transformer-based language models by integrating compositional and symbolic properties into distributional semantic spaces.Method: Comparative review and analysis of three autoencoder architectures: Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE), examining their induced latent geometries in relation to semantic structure.
Result: Provides a novel perspective on latent space geometry through compositional semantics, enabling a bridge between symbolic and distributional semantics.
Conclusion: Semantic representation learning offers a promising direction to mitigate the gap between symbolic and distributional approaches, enhancing language model capabilities through improved latent space organization.
Abstract: Integrating compositional and symbolic properties into current distributional semantic spaces can enhance the interpretability, controllability, compositionality, and generalisation capabilities of Transformer-based auto-regressive language models (LMs). In this survey, we offer a novel perspective on latent space geometry through the lens of compositional semantics, a direction we refer to as \textit{semantic representation learning}. This direction enables a bridge between symbolic and distributional semantics, helping to mitigate the gap between them. We review and compare three mainstream autoencoder architectures-Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the distinctive latent geometries they induce in relation to semantic structure and interpretability.
[65] Adversarial Manipulation of Reasoning Models using Internal Representations
Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
Main category: cs.CL
TL;DR: Chain-of-thought reasoning in models like DeepSeek-R1-Distill-Llama-8B contains a “caution” direction that predicts refusal decisions. Manipulating this direction can jailbreak the model by increasing harmful compliance.
Details
Motivation: To understand how chain-of-thought reasoning affects model vulnerability to jailbreak attacks, particularly where refusal decisions are made within the reasoning process rather than at the prompt-response boundary.Method: Identified a linear direction in activation space during CoT token generation that predicts refusal/compliance behavior. Conducted ablation studies to remove this “caution” direction and measured resulting harmful compliance rates.
Result: Ablating the caution direction from model activations increases harmful compliance, effectively jailbreaking the model. Intervening only on CoT token activations suffices to control final outputs, and incorporating this direction into prompt-based attacks improves success rates.
Conclusion: Chain-of-thought reasoning itself represents a promising new target for adversarial manipulation in reasoning models, as refusal decisions occur within the reasoning process rather than at the final output stage.
Abstract: Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply – termed the “caution” direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation.
[66] Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs
Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu
Main category: cs.CL
TL;DR: The paper investigates how data entropy correlates with memorization difficulty in LLMs, discovering a linear Entropy-Memorization Law and using this to develop a method for distinguishing training vs testing data.
Details
Motivation: To understand how to characterize memorization difficulty of training data in Large Language Models, particularly exploring the relationship between data entropy and memorization.Method: Empirical experiments on OLMo open models, analyzing entropy-memorization correlation and testing with randomized strings (gibberish) to develop a dataset inference approach.
Result: Discovered Entropy-Memorization Law showing linear correlation between data entropy and memorization score, and found that randomized strings have unexpectedly low empirical entropy compared to training corpus.
Conclusion: Data entropy serves as a reliable indicator of memorization difficulty, enabling effective dataset inference to distinguish between training and testing data in LLMs.
Abstract: Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or “gibberish”, we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).
[67] Dynamic Context Compression for Efficient RAG
Shuyu Guo, Zhaochun Ren
Main category: cs.CL
TL;DR: ACC-RAG is an adaptive context compression framework for RAG that dynamically adjusts compression rates based on query complexity, achieving 4x faster inference while maintaining accuracy.
Details
Motivation: Standard RAG incurs high inference costs from lengthy retrieved contexts, and existing fixed compression methods either over-compress simple queries or under-compress complex ones, leading to inefficiency.Method: Combines hierarchical compressor for multi-granular embeddings with context selector to retain minimal sufficient information, mimicking human skimming behavior.
Result: Outperforms fixed-rate compression methods and achieves over 4 times faster inference compared to standard RAG while maintaining or improving accuracy on Wikipedia and five QA datasets.
Conclusion: ACC-RAG provides an effective solution for optimizing RAG efficiency through adaptive context compression that balances inference speed and accuracy based on query complexity.
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.
[68] CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
Main category: cs.CL
TL;DR: Novel method using Contextual Co-occurrence Matrix and Tensors to detect jailbreak prompts in LLMs with high efficiency and minimal labeled data.
Details
Motivation: LLMs are vulnerable to jailbreak attacks that produce harmful responses, requiring robust detection methods for safe deployment.Method: Leverages latent space characteristics of Contextual Co-occurrence Matrices and Tensors to identify adversarial and jailbreak prompts.
Result: Achieves F1 score of 0.83 using only 0.5% labeled prompts (96.6% improvement over baselines) and 2.3-128.4x speedup.
Conclusion: The approach demonstrates strong pattern learning capabilities in data-scarce environments and significantly outperforms baseline methods in both accuracy and efficiency.
Abstract: The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models.
[69] WideSearch: Benchmarking Agentic Broad Info-Seeking
Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
Main category: cs.CL
TL;DR: WideSearch benchmark reveals current LLM-powered search agents achieve near 0% success rates on large-scale information collection tasks, despite humans achieving near 100% success, highlighting critical deficiencies in agentic search systems.
Details
Motivation: Automated search agents powered by LLMs promise to liberate humans from tedious wide-scale information seeking, but their capability for reliable and complete large-scale collection remains unevaluated due to lack of suitable benchmarks.Method: Introduced WideSearch benchmark with 200 manually curated questions (100 English, 100 Chinese) from 15+ domains, featuring rigorous five-stage quality control pipeline to ensure difficulty, completeness, and verifiability. Benchmarked over 10 state-of-the-art agentic search systems.
Result: Most systems achieved overall success rates near 0%, with the best performer reaching just 5%. However, human testers achieved near 100% success rate given sufficient time.
Conclusion: Current search agents have critical deficiencies in large-scale information seeking, underscoring urgent need for future research and development in agentic search systems.
Abstract: From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such “wide-context” collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 5%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/
[70] Steering Towards Fairness: Mitigating Political Bias in LLMs
Afrozah Nadeem, Mark Dras, Usman Naseem
Main category: cs.CL
TL;DR: This paper analyzes ideological biases in decoder-based LLMs using Political Compass Test framework, revealing systematic encoding of political bias across model layers and proposing steering vector-based mitigation.
Details
Motivation: Concerns about LLMs encoding and reproducing ideological biases along political and economic dimensions in real-world applications, despite recent advancements.Method: Uses Political Compass Test framework with contrastive pairs to extract and compare hidden layer activations from decoder LLMs like Mistral and DeepSeek. Implements comprehensive activation extraction pipeline for layer-wise analysis across multiple ideological axes.
Result: Decoder LLMs systematically encode representational bias across layers, with meaningful disparities linked to political framing. The bias patterns can be leveraged for effective steering vector-based mitigation.
Conclusion: Provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing that goes beyond surface-level output interventions.
Abstract: Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases along political and economic dimensions. In this paper, we employ a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), this method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.
[71] Estimating Machine Translation Difficulty
Lorenzo Proietti, Stefano Perrella, Vilém Zouhar, Roberto Navigli, Tom Kocmi
Main category: cs.CL
TL;DR: This paper introduces translation difficulty estimation as a new task to identify texts where machine translation systems struggle, develops evaluation metrics, and releases models that outperform existing approaches for creating more challenging benchmarks.
Details
Motivation: As machine translation quality improves to near-perfect levels, it becomes difficult to distinguish between state-of-the-art models and identify areas for improvement. There's a need to automatically identify texts where translation systems struggle to develop better evaluations and guide future research.Method: The authors formalize translation difficulty estimation by defining text difficulty based on expected translation quality. They introduce a new evaluation metric, assess baseline methods, and develop novel approaches including dedicated models that outperform heuristic-based methods and LLM-as-a-judge approaches.
Result: Dedicated models for difficulty estimation outperform both heuristic-based methods and LLM-as-a-judge approaches, with Sentinel-src achieving the best performance. The authors demonstrate practical utility by using difficulty estimators to construct more challenging machine translation benchmarks.
Conclusion: The paper releases two improved models (Sentinel-src-24 and Sentinel-src-25) that can effectively scan large text collections to identify texts most likely to challenge contemporary machine translation systems, providing valuable tools for developing more discriminative evaluations and guiding future research.
Abstract: Machine translation quality has steadily improved over the years, achieving near-perfect translations in recent benchmarks. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. In this context, automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. In this work, we address this gap by formalizing the task of translation difficulty estimation, defining a text’s difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging benchmarks for machine translation. Our results show that dedicated models outperform both heuristic-based methods and LLM-as-a-judge approaches, with Sentinel-src achieving the best performance. Thus, we release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.
[72] Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics
Carter Blum, Katja Filippova, Ann Yuan, Asma Ghandeharioun, Julian Zimmert, Fred Zhang, Jessica Hoffmann, Tal Linzen, Martin Wattenberg, Lucas Dixon, Mor Geva
Main category: cs.CL
TL;DR: Study shows LLMs struggle with cross-lingual knowledge transfer due to representation separation, identifies unification as key factor, and develops methods to improve transfer through data manipulation.
Details
Motivation: Large language models hallucinate when transferring knowledge across languages, creating a need to understand the causes and dynamics of this cross-lingual knowledge transfer problem.Method: Trained small Transformer models from scratch on synthetic multilingual datasets, analyzed representation development phases, measured mutual information between facts and language, and developed methods to modulate transfer through data distribution and tokenization manipulation.
Result: Identified that models develop either separate or unified representations across languages, with unification being essential for cross-lingual transfer. Found that unification degree depends on mutual information between facts and training language, and language extraction difficulty.
Conclusion: Controlled settings can reveal pre-training dynamics and suggest new directions for improving cross-lingual transfer in LLMs through targeted data and tokenization strategies.
Abstract: Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, and show that unification is essential for cross-lingual transfer. We also show that the degree of unification depends on mutual information between facts and training data language, and on how easy it is to extract that language. Based on these insights, we develop methods to modulate the level of cross-lingual transfer by manipulating data distribution and tokenization, and we introduce metrics and visualizations to formally characterize their effects on unification. Our work shows how controlled settings can shed light on pre-training dynamics and suggests new directions for improving cross-lingual transfer in LLMs.
[73] Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, Golnoosh Farnadi
Main category: cs.CL
TL;DR: This position paper critically examines the premature enthusiasm around using large language models as judges (LLJs) for NLG evaluation, questioning their reliability and validity based on measurement theory principles.
Details
Motivation: The rise of LLMs as general-purpose systems has led to their adoption as evaluators (LLJs), but this adoption has outpaced rigorous scrutiny of their reliability and validity, potentially undermining progress in NLG evaluation.Method: The authors draw on measurement theory from social sciences to critically assess four core assumptions underlying LLJs: their ability to proxy human judgment, evaluation capabilities, scalability, and cost-effectiveness. They examine these assumptions through the lens of LLM limitations and current NLG evaluation practices, using text summarization, data annotation, and safety alignment as case studies.
Result: The analysis reveals that current LLJ practices may be based on questionable assumptions and highlights inherent limitations that challenge their validity as evaluators across different NLG applications.
Conclusion: There is a need for more responsible evaluation practices for LLJs to ensure their growing role in the field actually supports rather than undermines progress in natural language generation evaluation.
Abstract: Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.
[74] LLMs Can’t Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions
Maojia Song, Tej Deep Pala, Weisheng Jin, Amir Zadeh, Chuan Li, Dorien Herremans, Soujanya Poria
Main category: cs.CL
TL;DR: KAIROS benchmark tests LLMs’ trust formation, misinformation resistance, and peer integration in multi-agent systems using quiz contests with varying peer reliability conditions.
Details
Motivation: To understand how LLMs form trust from previous interactions, resist misinformation, and integrate peer input in collaborative settings - key factors for achieving collective intelligence in complex social dynamics.Method: Developed KAIROS benchmark simulating quiz contests with peer agents of varying reliability, testing LLMs with historical interactions and current peer responses. Evaluated prompting, supervised fine-tuning, and reinforcement learning (GRPO) mitigation strategies.
Result: GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieved best overall performance, but decreased robustness to social influence compared to Base models.
Conclusion: The study provides insights into LLM social decision-making and demonstrates effective mitigation strategies, though trade-offs exist between performance and social influence robustness.
Abstract: Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. Although prior work has focused on conformity bias, we extend the analysis to examine how LLMs form trust from previous impressions, resist misinformation, and integrate peer input during interaction, key factors for achieving collective intelligence under complex social dynamics. We present KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert-novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how trust, peer action, and self-confidence influence decisions. As for mitigation strategies, we evaluate prompting, supervised fine-tuning, and reinforcement learning, Group Relative Policy Optimisation (GRPO), across multiple models. Our results reveal that GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieves the best overall performance, but also decreases the robustness to social influence compared to Base models. The code and datasets are available at: https://github.com/declare-lab/KAIROS.
[75] Continuously Steering LLMs Sensitivity to Contextual Knowledge with Proxy Models
Yilin Wang, Heng Wang, Yuyang Bai, Minnan Luo
Main category: cs.CL
TL;DR: CSKS is a lightweight framework that enables continuous control over LLMs’ sensitivity to contextual knowledge without modifying model weights, using two small proxy models to shift output distributions.
Details
Motivation: Address knowledge conflicts in LLMs where parametric knowledge contradicts contextual knowledge, overcoming limitations of previous methods that are inefficient, ineffective for large models, or not workable for black-box models.Method: Tune two small proxy models and use the difference in their output distributions to shift the original LLM’s distribution without weight modification, enabling continuous sensitivity adjustment.
Result: Achieves continuous and precise control over LLMs’ sensitivity to contextual knowledge, allowing both increased and reduced sensitivity to prioritize either contextual or parametric knowledge as needed.
Conclusion: CSKS provides an effective, lightweight solution for steering LLMs’ knowledge sensitivity without model modification, demonstrating practical efficacy on both synthetic and real conflict datasets.
Abstract: In Large Language Models (LLMs) generation, there exist knowledge conflicts and scenarios where parametric knowledge contradicts knowledge provided in the context. Previous works studied tuning, decoding algorithms, or locating and editing context-aware neurons to adapt LLMs to be faithful to new contextual knowledge. However, they are usually inefficient or ineffective for large models, not workable for black-box models, or unable to continuously adjust LLMs’ sensitivity to the knowledge provided in the context. To mitigate these problems, we propose CSKS (Continuously Steering Knowledge Sensitivity), a simple framework that can steer LLMs’ sensitivity to contextual knowledge continuously at a lightweight cost. Specifically, we tune two small LMs (i.e. proxy models) and use the difference in their output distributions to shift the original distribution of an LLM without modifying the LLM weights. In the evaluation process, we not only design synthetic data and fine-grained metrics to measure models’ sensitivity to contextual knowledge but also use a real conflict dataset to validate CSKS’s practical efficacy. Extensive experiments demonstrate that our framework achieves continuous and precise control over LLMs’ sensitivity to contextual knowledge, enabling both increased sensitivity and reduced sensitivity, thereby allowing LLMs to prioritize either contextual or parametric knowledge as needed flexibly. Our data and code are available at https://github.com/OliveJuiceLin/CSKS.
[76] NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya
Main category: cs.CL
TL;DR: NLKI framework enhances small vision-language models by integrating retrieved commonsense facts and LLM-generated explanations, boosting performance by up to 7% across datasets while reducing hallucinations.
Details
Motivation: Small vision-language models lag behind larger counterparts in commonsense VQA due to missing knowledge. The study aims to improve sVLMs through careful commonsense knowledge integration.Method: End-to-end framework that retrieves natural language facts using fine-tuned ColBERTv2, prompts LLM to craft explanations, and feeds both signals to sVLMs. Uses noise-robust losses for additional finetuning.
Result: 7% accuracy improvement across 3 datasets, making FLAVA match/exceed medium-sized VLMs. Additional 2.5-5.5% gains with noise-robust training. Reduced hallucinations in explanations.
Conclusion: LLM-based commonsense knowledge integration enables parameter-efficient reasoning for 250M models, with noise-aware training stabilizing performance in knowledge-augmented settings.
Abstract: Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.
[77] Selective Retrieval-Augmentation for Long-Tail Legal Text Classification
Boheng Mao
Main category: cs.CL
TL;DR: SRA method improves legal text classification on long-tail datasets by selectively augmenting low-frequency labels from training data only, achieving better F1 scores than existing baselines.
Details
Motivation: Legal text classification datasets often have long-tail distributions where rare classes are underrepresented, leading to poor model performance on these classes.Method: Selective Retrieval-Augmentation (SRA) that focuses on augmenting samples from low-frequency labels in the training set only, preventing noise for well-represented classes without changing model architecture.
Result: SRA achieves higher micro-F1 and macro-F1 scores compared to all current LexGLUE baselines on both LEDGAR (single-label) and UNFAIR-ToS (multi-label) datasets.
Conclusion: SRA provides consistent improvements for long-tail legal text classification by selectively augmenting rare classes from training data without external resources or architectural changes.
Abstract: Legal text classification is a fundamental NLP task in the legal domain. Benchmark datasets in this area often exhibit a long-tail label distribution, where many labels are underrepresented, leading to poor model performance on rare classes. This paper proposes Selective Retrieval-Augmentation (SRA) as a solution to this problem. SRA focuses on augmenting samples belonging to low-frequency labels in the training set, preventing the introduction of noise for well-represented classes, and requires no changes to the model architecture. Retrieval is performed only from the training data to ensure there is no potential information leakage, removing the need for external corpora simultaneously. The proposed SRA method is tested on two legal text classification benchmark datasets with long-tail distributions: LEDGAR (single-label) and UNFAIR-ToS (multi-label). The results indicate that SRA attains higher micro-F1 and macro-F1 scores compared to all current LexGLUE baselines across both datasets, illustrating consistent improvements in long-tail legal text classification.
[78] Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks
Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
Main category: cs.CL
TL;DR: IMAGINE is a framework that generates jailbreak-like instructions to fill distribution gaps in safety alignment data, reducing attack success rates on LLMs without compromising utility.
Details
Motivation: LLMs remain vulnerable to jailbreak attacks due to distributional mismatch between safety alignment training data and real-world malicious instructions, forcing developers into reactive patching cycles.Method: Leverages embedding space distribution analysis to synthesize jailbreak-like instructions through iterative optimization that dynamically evolves text generation distributions to augment safety alignment data coverage.
Result: Significant decreases in attack success rates on Qwen2.5, Llama3.1, and Llama3.2 models without compromising their utility.
Conclusion: IMAGINE effectively addresses the distributional gap problem in LLM safety alignment by proactively generating synthetic jailbreak patterns to enhance model robustness against unseen malicious instructions.
Abstract: Despite advances in improving large language model (LLM) to refuse to answer malicious instructions, widely used LLMs remain vulnerable to jailbreak attacks where attackers generate instructions with distributions differing from safety alignment corpora. New attacks expose LLMs’ inability to recognize unseen malicious instructions, highlighting a critical distributional mismatch between training data and real-world attacks that forces developers into reactive patching cycles. To tackle this challenge, we propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions. This approach effectively fills the distributional gap between authentic jailbreak patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized data examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.
[79] AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering
Hassan Alhuzali, Farah Shamout, Muhammad Abdul-Mageed, Chaimae Abouzahir, Mouath Abu-Daoud, Ashwag Alasmari, Walid Al-Eisawi, Renad Al-Monef, Ali Alqahtani, Lama Ayash, Nizar Habash, Leen Kharouf
Main category: cs.CL
TL;DR: AraHealthQA 2025 is a comprehensive Arabic health question answering shared task with two tracks: MentalQA for mental health and MedArabiQ for broader medical domains, addressing the lack of high-quality Arabic medical QA resources.
Details
Motivation: To address the paucity of high-quality Arabic medical question answering resources and promote development in realistic, multilingual, and culturally nuanced healthcare contexts.Method: Created two complementary tracks (MentalQA and MedArabiQ) with multiple subtasks, evaluation datasets, and standardized metrics. Developed baseline systems and established a framework for fair benchmarking.
Result: The shared task successfully provided standardized evaluation datasets and metrics, facilitating benchmarking of Arabic health QA systems across mental health and broader medical domains.
Conclusion: The task established a foundation for Arabic health QA research, with observed performance trends providing insights for future iterations and advancements in Arabic medical question answering systems.
Abstract: We introduce {AraHealthQA 2025}, the {Comprehensive Arabic Health Question Answering Shared Task}, held in conjunction with {ArabicNLP 2025} (co-located with EMNLP 2025). This shared task addresses the paucity of high-quality Arabic medical QA resources by offering two complementary tracks: {MentalQA}, focusing on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and {MedArabiQ}, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making. Each track comprises multiple subtasks, evaluation datasets, and standardized metrics, facilitating fair benchmarking. The task was structured to promote modeling under realistic, multilingual, and culturally nuanced healthcare contexts. We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes. We conclude with reflections on the performance trends observed and prospects for future iterations in Arabic health QA.
cs.CV
[80] Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization
Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Main category: cs.CV
TL;DR: CHAIR-DPO method uses CHAIR metric to identify hallucinated vs non-hallucinated answers for Direct Preference Optimization, reducing hallucinations in MLLMs without complex synthetic data pipelines.
Details
Motivation: Address the persistent problem of hallucinations in Multimodal Large Language Models (MLLMs) where models generate answers not reflected in visual inputs, treating it as an alignment problem.Method: Leverage the CHAIR metric to distinguish between hallucinated and non-hallucinated answers, then fine-tune off-the-shelf MLLMs using Direct Preference Optimization (DPO) with this preference data.
Result: CHAIR-DPO effectively reduces hallucinated answers across several hallucination benchmarks, demonstrating the effectiveness of using CHAIR-based rewards for alignment.
Conclusion: The proposed CHAIR-DPO method provides an effective and accessible approach to reduce hallucinations in MLLMs without relying on complex synthetic data pipelines or proprietary models.
Abstract: Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user’s query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at https://github.com/aimagelab/CHAIR-DPO.
[81] SDiFL: Stable Diffusion-Driven Framework for Image Forgery Localization
Yang Su, Shunquan Tan, Jiwu Huang
Main category: cs.CV
TL;DR: Integrates Stable Diffusion 3’s multimodal capabilities for image forgery localization by treating forgery residuals as an explicit modality, achieving 12% performance improvement without needing annotated data.
Details
Motivation: Existing image forgery localization methods struggle with emerging manipulation technologies and rely on costly annotated data. Multi-modal large models like Stable Diffusion offer new opportunities for forensic analysis.Method: Leverages SD3’s multimodal framework by treating image forgery residuals (high-frequency signals) as an explicit modality, fusing them into latent space while preserving original semantic features.
Result: Achieves up to 12% performance improvement on benchmark datasets compared to state-of-the-art methods. Shows strong generalization to real-world document forgery and natural scene images unseen during training.
Conclusion: The integration of SD3’s multimodal capabilities with forgery residuals as an explicit modality provides an efficient and accurate solution for image forgery localization without requiring annotated data.
Abstract: Driven by the new generation of multi-modal large models, such as Stable Diffusion (SD), image manipulation technologies have advanced rapidly, posing significant challenges to image forensics. However, existing image forgery localization methods, which heavily rely on labor-intensive and costly annotated data, are struggling to keep pace with these emerging image manipulation technologies. To address these challenges, we are the first to integrate both image generation and powerful perceptual capabilities of SD into an image forensic framework, enabling more efficient and accurate forgery localization. First, we theoretically show that the multi-modal architecture of SD can be conditioned on forgery-related information, enabling the model to inherently output forgery localization results. Then, building on this foundation, we specifically leverage the multimodal framework of Stable DiffusionV3 (SD3) to enhance forgery localization performance.We leverage the multi-modal processing capabilities of SD3 in the latent space by treating image forgery residuals – high-frequency signals extracted using specific highpass filters – as an explicit modality. This modality is fused into the latent space during training to enhance forgery localization performance. Notably, our method fully preserves the latent features extracted by SD3, thereby retaining the rich semantic information of the input image. Experimental results show that our framework achieves up to 12% improvements in performance on widely used benchmarking datasets compared to current state-of-the-art image forgery localization models. Encouragingly, the model demonstrates strong performance on forensic tasks involving real-world document forgery images and natural scene forging images, even when such data were entirely unseen during training.
[82] Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study
Max Torop, Masih Eskandar, Nicholas Kurtansky, Jinyang Liu, Jochen Weber, Octavia Camps, Veronica Rotemberg, Jennifer Dy, Kivanc Kose
Main category: cs.CV
TL;DR: Combining multimodal LLMs with quantitative skin lesion attributes improves AI interpretability for skin cancer diagnosis by grounding model predictions in measurable clinical features.
Details
Motivation: AI models show promise in skin disease diagnosis but lack interpretability needed for clinical practice. MLLMs offer natural language reasoning while quantitative attributes provide measurable grounding for predictions.Method: Fine-tune Multimodal Large Language Models to predict quantitative lesion attributes (e.g., lesion area) from images and evaluate grounding through attribute-specific content-based image retrieval on SLICE-3D dataset.
Result: Evidence shows MLLM embedding spaces can be effectively grounded in quantitative lesion attributes, enabling attribute-based retrieval and improved interpretability.
Conclusion: Combining MLLMs with quantitative attributes provides a promising approach for creating more interpretable and clinically useful AI diagnostic tools for skin diseases.
Abstract: Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.
[83] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels
Hossein Ahmadi, Banafsheh Saffari
Main category: cs.CV
TL;DR: A unified Vision Transformer framework for automatic modulation recognition that combines supervised, self-supervised, and reconstruction objectives to achieve label-efficient performance with limited labeled data.
Details
Motivation: Existing AMR solutions rely on large labeled datasets or complex multi-stage training pipelines, which limit scalability and generalization in practical wireless communication applications.Method: Uses a ViT encoder with lightweight convolutional decoder and linear classifier; reconstruction branch maps augmented signals back to originals to preserve fine-grained I/Q structure; combines supervised, self-supervised, and reconstruction objectives during pretraining with partial label supervision in fine-tuning.
Result: Outperforms supervised CNN and ViT baselines in low-label regimes on RML2018.01A dataset; approaches ResNet-level accuracy with only 15-20% labeled data; maintains strong performance across varying SNR levels.
Conclusion: Provides a simple, generalizable, and label-efficient solution for automatic modulation recognition that addresses the limitations of existing approaches.
Abstract: Automatic modulation recognition (AMR) is critical for cognitive radio, spectrum monitoring, and secure wireless communication. However, existing solutions often rely on large labeled datasets or multi-stage training pipelines, which limit scalability and generalization in practice. We propose a unified Vision Transformer (ViT) framework that integrates supervised, self-supervised, and reconstruction objectives. The model combines a ViT encoder, a lightweight convolutional decoder, and a linear classifier; the reconstruction branch maps augmented signals back to their originals, anchoring the encoder to fine-grained I/Q structure. This strategy promotes robust, discriminative feature learning during pretraining, while partial label supervision in fine-tuning enables effective classification with limited labels. On the RML2018.01A dataset, our approach outperforms supervised CNN and ViT baselines in low-label regimes, approaches ResNet-level accuracy with only 15-20% labeled data, and maintains strong performance across varying SNR levels. Overall, the framework provides a simple, generalizable, and label-efficient solution for AMR.
[84] InfinityHuman: Towards Long-Term Audio-Driven Human
Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Bingyue Peng, Zehuan Yuan
Main category: cs.CV
TL;DR: InfinityHuman is a coarse-to-fine framework for generating high-resolution, long-duration audio-driven human animation videos with stable appearance and natural hand motions, addressing issues of identity drift and poor hand modeling in existing methods.
Details
Motivation: Existing audio-driven human animation methods suffer from error accumulation causing identity drift, color shifts, scene instability, and poorly modeled hand movements with noticeable distortions and audio misalignment.Method: A coarse-to-fine framework that first generates audio-synchronized representations, then refines them using a pose-guided refiner with stable poses and initial frame as visual anchor. Includes hand-specific reward mechanism trained with high-quality hand motion data.
Result: Achieves state-of-the-art performance on EMTD and HDTF datasets in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies confirm effectiveness of each module.
Conclusion: InfinityHuman successfully addresses key challenges in audio-driven human animation by leveraging pose-guided refinement and hand-specific rewards, producing high-quality, stable, and realistic long-duration videos.
Abstract: Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. Code will be made public.
[85] Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos
Mert Cokelek, Halit Ozsoy, Nevrez Imamoglu, Cagri Ozcinar, Inci Ayhan, Erkut Erdem, Aykut Erdem
Main category: cs.CV
TL;DR: This paper introduces SalViT360 and SalViT360-AV models for 360-degree audio-visual saliency prediction, along with a new dataset YT360-EyeTracking, showing significant improvements over existing methods.
Details
Motivation: Addressing the lack of comprehensive datasets for 360-degree audio-visual saliency prediction and the complexities of spherical distortion and spatial audio integration in omnidirectional videos.Method: Proposed two novel models: SalViT360 (vision-transformer-based with spherical geometry-aware attention) and SalViT360-AV (incorporates transformer adapters conditioned on audio input). Created YT360-EyeTracking dataset with 81 ODVs under varying audio-visual conditions.
Result: Both SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention across multiple benchmark datasets including their new YT360-EyeTracking dataset.
Conclusion: Integrating spatial audio cues in model architecture is crucial for accurate saliency prediction in omnidirectional videos, demonstrating the importance of audio-visual integration for VR experiences.
Abstract: Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at https://cyberiada.github.io/SalViT360.
[86] Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding
Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro
Main category: cs.CV
TL;DR: A unified framework that integrates sign language, lip movements, and audio for spoken-language text generation, achieving state-of-the-art performance across multiple communication modalities.
Details
Motivation: To make communication technologies accessible to deaf and hard-of-hearing individuals by integrating visual alternatives (sign language and lip reading) with audio in a unified system, as these modalities have traditionally been studied in isolation.Method: Designed a unified, modality-agnostic architecture capable of processing heterogeneous inputs (sign language, lip movements, and audio) for spoken-language text generation. Explicitly modeled lip movements as a separate modality to explore their role as non-manual cues in sign language comprehension.
Result: Achieved performance on par with or superior to state-of-the-art models specialized for individual tasks across Sign Language Translation (SLT), Visual Speech Recognition (VSR), Automatic Speech Recognition (ASR), and Audio-Visual Speech Recognition (AVSR). Explicit modeling of lip movements significantly improved SLT performance.
Conclusion: The unified framework successfully integrates multiple communication modalities and demonstrates that explicit modeling of lip movements enhances sign language comprehension, providing a comprehensive solution for accessible communication technologies.
Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.
[87] A Novel Framework for Automated Explain Vision Model Using Vision-Language Models
Phu-Vinh Nguyen, Tan-Hanh Pham, Chris Ngo, Truong Son Hy
Main category: cs.CV
TL;DR: A pipeline using Vision-Language Models to explain vision models at both sample and dataset levels, enabling discovery of failure cases and insights with minimal effort.
Details
Motivation: Current vision model development focuses on performance metrics like accuracy but lacks explainability methods that capture general model behavior across large datasets, which is crucial to prevent bias and understand model patterns.Method: Proposes a pipeline leveraging Vision-Language Models to provide explanations at both sample-by-sample level and dataset level, allowing comprehensive analysis of vision model behavior.
Result: The pipeline enables discovery of failure cases and provides insights into vision models’ general behavior patterns across large datasets.
Conclusion: This approach integrates vision model development with xAI analysis, advancing image analysis by making model explanations more accessible and comprehensive at multiple levels.
Abstract: The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model’s trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.
[88] “Humor, Art, or Misinformation?”: A Multimodal Dataset for Intent-Aware Synthetic Image Detection
Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis
Main category: cs.CV
TL;DR: S-HArM dataset for intent-aware classification of AI-generated images (Humor/Satire, Art, Misinformation) with 9,576 real-world image-text pairs and synthetic data generation strategies.
Details
Motivation: Existing multimodal AI detection methods overlook the intent behind AI-generated images, creating a gap in understanding why content was created.Method: Created S-HArM dataset from Twitter/X and Reddit, explored three prompting strategies (image-guided, description-guided, multimodally-guided) with Stable Diffusion for synthetic data, tested various models including modality fusion, contrastive learning, and vision-language models.
Result: Models trained on image- and multimodally-guided data generalize better to real-world content due to preserved visual context, but overall performance remains limited.
Conclusion: Inferring intent from AI-generated content is complex and requires specialized architectures beyond current approaches.
Abstract: Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 “in the wild” image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to “in the wild” content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.
[89] ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems
Mohamed Ohamouddou, Said Ohamouddou, Abdellatif El Afia, Rafik Lasri
Main category: cs.CV
TL;DR: ATMS-KD framework combines adaptive temperature scheduling with mixed-sample augmentation to create lightweight CNN models for agricultural applications, achieving 97.11% accuracy with compact models while maintaining low inference latency.
Details
Motivation: To develop efficient lightweight CNN models suitable for resource-constrained agricultural environments where computational resources are limited but accurate plant maturity classification is needed.Method: Combines adaptive temperature scheduling with mixed-sample augmentation for knowledge distillation from MobileNetV3 Large teacher to lightweight residual CNN students (Compact, Standard, Enhanced configurations).
Result: All student models achieved validation accuracies exceeding 96.7% (vs 95-96% with direct training), with compact model reaching 97.11% accuracy and 72.19ms inference latency. Knowledge retention rates exceeded 99% across all configurations.
Conclusion: ATMS-KD effectively transfers knowledge to lightweight models, outperforming 11 established distillation methods and demonstrating practical applicability for agricultural computer vision under diverse environmental conditions.
Abstract: This study proposes ATMS-KD (Adaptive Temperature and Mixed-Sample Knowledge Distillation), a novel framework for developing lightweight CNN models suitable for resource-constrained agricultural environments. The framework combines adaptive temperature scheduling with mixed-sample augmentation to transfer knowledge from a MobileNetV3 Large teacher model (5.7,M parameters) to lightweight residual CNN students. Three student configurations were evaluated: Compact (1.3,M parameters), Standard (2.4,M parameters), and Enhanced (3.8,M parameters). The dataset used in this study consists of images of \textit{Rosa damascena} (Damask rose) collected from agricultural fields in the Dades Oasis, southeastern Morocco, providing a realistic benchmark for agricultural computer vision applications under diverse environmental conditions. Experimental evaluation on the Damascena rose maturity classification dataset demonstrated significant improvements over direct training methods. All student models achieved validation accuracies exceeding 96.7% with ATMS-KD compared to 95–96% with direct training. The framework outperformed eleven established knowledge distillation methods, achieving 97.11% accuracy with the compact model – a 1.60 percentage point improvement over the second-best approach while maintaining the lowest inference latency of 72.19,ms. Knowledge retention rates exceeded 99% for all configurations, demonstrating effective knowledge transfer regardless of student model capacity.
[90] Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation
Yifan Gao, Haoyue Li, Feng Yuan, Xiaosong Wang, Xin Gao
Main category: cs.CV
TL;DR: Dino U-Net leverages DINOv3 foundation model features for medical image segmentation, achieving state-of-the-art performance across diverse datasets with a parameter-efficient approach.
Details
Motivation: Effectively transferring learned representations from large-scale natural image foundation models to precise medical image segmentation applications remains challenging.Method: Proposes Dino U-Net with frozen DINOv3 backbone, specialized adapter for feature fusion, and fidelity-aware projection module (FAPM) to preserve feature quality during dimensionality reduction.
Result: Achieves state-of-the-art performance on seven diverse medical image segmentation datasets, with segmentation accuracy improving as backbone model size increases up to 7-billion parameters.
Conclusion: Leveraging dense-pretrained features from general-purpose foundation models provides a highly effective and parameter-efficient approach for advancing medical image segmentation accuracy.
Abstract: Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model’s rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.
[91] Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification
Mutahar Safdar, Gentry Wood, Max Zimmermann, Guy Lamouche, Priti Wanjara, Yaoyao Fiona Zhao
Main category: cs.CV
TL;DR: A novel framework using hybrid vision-language representations (VLRs) integrates microstructure images with expert textual assessments for zero-shot classification of materials quality without retraining.
Details
Motivation: Rapid qualification of advanced materials from additive manufacturing is challenging due to heterogeneous structures, requiring better integration of visual data with expert knowledge.Method: Combines deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA), creates customized similarity-based representations using positive/negative expert references, and uses z-score normalization for cross-modal alignment.
Result: Successfully distinguishes acceptable vs defective samples in metal matrix composites; FLAVA shows higher visual sensitivity while CLIP provides better textual alignment.
Conclusion: The framework enables human-in-the-loop decision making, enhances traceability and interpretability, and advances scalable domain-adaptable qualification strategies in engineering informatics.
Abstract: Rapid and reliable qualification of advanced materials remains a bottleneck in industrial manufacturing, particularly for heterogeneous structures produced via non-conventional additive manufacturing processes. This study introduces a novel framework that links microstructure informatics with a range of expert characterization knowledge using customized and hybrid vision-language representations (VLRs). By integrating deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA), we encode both visual microstructural data and textual expert assessments into shared representations. To overcome limitations in general-purpose embeddings, we develop a customized similarity-based representation that incorporates both positive and negative references from expert-annotated images and their associated textual descriptions. This allows zero-shot classification of previously unseen microstructures through a net similarity scoring approach. Validation on an additively manufactured metal matrix composite dataset demonstrates the framework’s ability to distinguish between acceptable and defective samples across a range of characterization criteria. Comparative analysis reveals that FLAVA model offers higher visual sensitivity, while the CLIP model provides consistent alignment with the textual criteria. Z-score normalization adjusts raw unimodal and cross-modal similarity scores based on their local dataset-driven distributions, enabling more effective alignment and classification in the hybrid vision-language framework. The proposed method enhances traceability and interpretability in qualification pipelines by enabling human-in-the-loop decision-making without task-specific model retraining. By advancing semantic interoperability between raw data and expert knowledge, this work contributes toward scalable and domain-adaptable qualification strategies in engineering informatics.
[92] FakeParts: a New Family of AI-Generated DeepFakes
Gaetan Brison, Soobash Daiboo, Samy Aimeur, Awais Hussain Sani, Xi Wang, Gianni Franchi, Vicky Kalogeiton
Main category: cs.CV
TL;DR: FakeParts introduces a new class of deepfakes with localized manipulations that blend with real video content, making them harder to detect than traditional deepfakes. The paper presents FakePartsBench, a large benchmark dataset to evaluate detection methods.
Details
Motivation: Current deepfake detection methods are vulnerable to partial manipulations that blend seamlessly with authentic content, creating a critical gap in detection capabilities that needs to be addressed.Method: Created FakePartsBench, a large-scale benchmark dataset with over 25K videos containing pixel-level and frame-level manipulation annotations for partial deepfakes, enabling comprehensive evaluation of detection methods.
Result: FakeParts reduces human detection accuracy by over 30% compared to traditional deepfakes, with similar performance degradation in state-of-the-art detection models.
Conclusion: This work identifies an urgent vulnerability in current deepfake detection approaches and provides resources to develop more robust methods for detecting partial video manipulations.
Abstract: We introduce FakeParts, a new class of deepfakes characterized by subtle, localized manipulations to specific spatial regions or temporal segments of otherwise authentic videos. Unlike fully synthetic content, these partial manipulations, ranging from altered facial expressions to object substitutions and background modifications, blend seamlessly with real elements, making them particularly deceptive and difficult to detect. To address the critical gap in detection capabilities, we present FakePartsBench, the first large-scale benchmark dataset specifically designed to capture the full spectrum of partial deepfakes. Comprising over 25K videos with pixel-level and frame-level manipulation annotations, our dataset enables comprehensive evaluation of detection methods. Our user studies demonstrate that FakeParts reduces human detection accuracy by over 30% compared to traditional deepfakes, with similar performance degradation observed in state-of-the-art detection models. This work identifies an urgent vulnerability in current deepfake detection approaches and provides the necessary resources to develop more robust methods for partial video manipulations.
[93] MedNet-PVS: A MedNeXt-Based Deep Learning Model for Automated Segmentation of Perivascular Spaces
Zhen Xuen Brandon Low, Rory Zhang, Hang Min, William Pham, Lucy Vivash, Jasmine Moses, Miranda Lynch, Karina Dorfman, Cassandra Marotta, Shaun Koh, Jacob Bunyamin, Ella Rowsthorn, Alex Jarema, Himashi Peiris, Zhaolin Chen, Sandy R. Shultz, David K. Wright, Dexiao Kong, Sharon L. Naismith, Terence J. O’Brien, Ying Xia, Meng Law, Benjamin Sinclair
Main category: cs.CV
TL;DR: MedNeXt-L-k5, a Transformer-inspired 3D CNN, was adapted for automated perivascular spaces segmentation in MRI, achieving state-of-the-art performance on T2w images but showing limitations on T1w data and cross-site generalization.
Details
Motivation: Manual PVS segmentation is time-consuming with moderate inter-rater reliability, while existing automated deep learning models have moderate performance and poor generalization across diverse MRI datasets.Method: Adapted MedNeXt-L-k5 (Transformer-inspired 3D encoder-decoder CNN) for PVS segmentation. Trained two models: one on homogeneous T2w MRI from HCP-Aging (200 scans), another on heterogeneous T1w MRI from seven studies across six scanners (40 volumes). Evaluated using 5-fold cross validation and leave-one-site-out cross validation.
Result: T2w model achieved voxel-level Dice score of 0.88±0.06 (WM), comparable to inter-rater reliability and highest reported. T1w model scored 0.58±0.09 (WM). Under LOSOCV: voxel-level 0.38±0.16 (WM), 0.35±0.12 (BG); cluster-level 0.61±0.19 (WM), 0.62±0.21 (BG). Did not outperform nnU-Net.
Conclusion: MedNeXt-L-k5 provides efficient automated PVS segmentation across diverse T1w/T2w MRI datasets, but attention-based mechanisms in transformer-inspired models are not required for high accuracy in PVS segmentation.
Abstract: Enlarged perivascular spaces (PVS) are increasingly recognized as biomarkers of cerebral small vessel disease, Alzheimer’s disease, stroke, and aging-related neurodegeneration. However, manual segmentation of PVS is time-consuming and subject to moderate inter-rater reliability, while existing automated deep learning models have moderate performance and typically fail to generalize across diverse clinical and research MRI datasets. We adapted MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, for automated PVS segmentation. Two models were trained: one using a homogeneous dataset of 200 T2-weighted (T2w) MRI scans from the Human Connectome Project-Aging (HCP-Aging) dataset and another using 40 heterogeneous T1-weighted (T1w) MRI volumes from seven studies across six scanners. Model performance was evaluated using internal 5-fold cross validation (5FCV) and leave-one-site-out cross validation (LOSOCV). MedNeXt-L-k5 models trained on the T2w images of the HCP-Aging dataset achieved voxel-level Dice scores of 0.88+/-0.06 (white matter, WM), comparable to the reported inter-rater reliability of that dataset, and the highest yet reported in the literature. The same models trained on the T1w images of the HCP-Aging dataset achieved a substantially lower Dice score of 0.58+/-0.09 (WM). Under LOSOCV, the model had voxel-level Dice scores of 0.38+/-0.16 (WM) and 0.35+/-0.12 (BG), and cluster-level Dice scores of 0.61+/-0.19 (WM) and 0.62+/-0.21 (BG). MedNeXt-L-k5 provides an efficient solution for automated PVS segmentation across diverse T1w and T2w MRI datasets. MedNeXt-L-k5 did not outperform the nnU-Net, indicating that the attention-based mechanisms present in transformer-inspired models to provide global context are not required for high accuracy in PVS segmentation.
[94] Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
Zhixiang Chi, Yanan Wu, Li Gu, Huan Liu, Ziqiang Wang, Yang Zhang, Yang Wang, Konstantinos N. Plataniotis
Main category: cs.CV
TL;DR: Training-free feedback framework that adapts output-based patch correspondences back to intermediate attention to enhance CLIP’s spatial coherence for open-vocabulary segmentation.
Details
Motivation: CLIP struggles with open-vocabulary segmentation due to poor localization and semantic discrepancies between intermediate attention and final outputs.Method: Feedback-driven self-adaptive framework that uses output predictions as spatial coherence prior, with attention isolation, confidence-based pruning, and adaptation ensemble modules.
Result: Consistently improves performance across eight benchmarks when integrated into four state-of-the-art approaches with three ViT backbones and multiple attention types.
Conclusion: Output-based feedback effectively enhances semantic consistency between internal representations and final predictions, serving as a plug-in module for various CLIP-based segmentation methods.
Abstract: CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn’t consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model’s processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model’s outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.
[95] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
Zhuoran Yu, Yong Jae Lee
Main category: cs.CV
TL;DR: A probing framework to analyze layer-wise processing in MLLMs reveals consistent stage-wise structure across models, with early layers handling visual grounding, middle layers for semantic reasoning, and final layers for output preparation.
Details
Motivation: Multimodal Large Language Models show strong performance but their internal processing dynamics remain underexplored, requiring systematic analysis of how they integrate visual and textual information across layers.Method: Trained linear classifiers to predict visual categories from token embeddings at each layer using standardized anchor questions, evaluated under three prompt variations: lexical variants, semantic negation, and output format changes.
Result: Identified consistent stage-wise structure across LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL models. Early layers perform visual grounding, middle layers handle lexical integration and semantic reasoning, final layers prepare outputs. Structure remains stable but layer allocation shifts with base LLM architecture changes.
Conclusion: Provides unified perspective on MLLM layer organization and offers lightweight, model-agnostic approach for analyzing multimodal representation dynamics, revealing consistent functional patterns despite architectural variations.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, we introduce a probing framework to systematically analyze how MLLMs process visual and textual inputs across layers. We train linear classifiers to predict fine-grained visual categories (e.g., dog breeds) from token embeddings extracted at each layer, using a standardized anchor question. To uncover the functional roles of different layers, we evaluate these probes under three types of controlled prompt variations: (1) lexical variants that test sensitivity to surface-level changes, (2) semantic negation variants that flip the expected answer by modifying the visual concept in the prompt, and (3) output format variants that preserve reasoning but alter the answer format. Applying our framework to LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL, we identify a consistent stage-wise structure in which early layers perform visual grounding, middle layers support lexical integration and semantic reasoning, and final layers prepare task-specific outputs. We further show that while the overall stage-wise structure remains stable across variations in visual tokenization, instruction tuning data, and pretraining corpus, the specific layer allocation to each stage shifts notably with changes in the base LLM architecture. Our findings provide a unified perspective on the layer-wise organization of MLLMs and offer a lightweight, model-agnostic approach for analyzing multimodal representation dynamics.
[96] Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS)
Zhi Li, Hau Phan, Matthew Emigh, Austin J. Brockmeier
Main category: cs.CV
TL;DR: SLiCS method disentangles CLIP embeddings into sparse, non-negative concept subspaces using supervised dictionary learning for improved concept-filtered image retrieval.
Details
Motivation: To separate complex scene information in vision-language embedding spaces by decomposing embeddings into concept-specific component vectors in different subspaces.Method: Supervised dictionary learning with group-structured sparse non-negative combinations, alternating optimization with convergence guarantees, leveraging text co-embeddings for semantic descriptions.
Result: More precise concept-filtered image retrieval and conditional generation across CLIP, TiTok autoencoder, and DINOv2 embeddings with quantitative and qualitative improvements.
Conclusion: Sparse linear concept subspaces successfully disentangle embedding spaces, enabling accurate concept-specific retrieval and generation across different embedding methods.
Abstract: Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information that is useful for downstream tasks. We hypothesize that the embedding space can be disentangled to separate the information on the content of complex scenes by decomposing the embedding into multiple concept-specific component vectors that lie in different subspaces. We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors in the dictionary (atoms), whose group-wise activity matches the multi-label information. Each concept-specific component is a non-negative combination of atoms associated to a label. The group-structured dictionary is optimized through a novel alternating optimization with guaranteed convergence. Exploiting the text co-embeddings, we detail how semantically meaningful descriptions can be found based on text embeddings of words best approximated by a concept’s group of atoms, and unsupervised dictionary learning can exploit zero-shot classification of training set images using the text embeddings of concept labels to provide instance-wise multi-labels. We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval (and conditional generation using image-to-prompt) that is more precise. We also apply SLiCS to highly-compressed autoencoder embeddings from TiTok and the latent embedding from self-supervised DINOv2. Quantitative and qualitative results highlight the improved precision of the concept-filtered image retrieval for all embeddings.
[97] MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models
Xiao Li, Yanfan Zhu, Ruining Deng, Wei-Qi Wei, Yu Wang, Shilin Zhao, Yaohong Wang, Haichun Yang, Yuankai Huo
Main category: cs.CV
TL;DR: MedFoundationHub is a GUI toolkit that enables secure deployment of medical vision-language models for clinical applications while addressing privacy and security concerns.
Details
Motivation: Medical VLMs present serious security risks including PHI exposure and data leakage, requiring safeguards for clinical and research use in healthcare environments.Method: Developed a graphical user interface toolkit with Docker-orchestrated deployment, supporting plug-and-play integration of Hugging Face models on local workstations with single GPU requirements.
Result: Evaluated 5 state-of-the-art VLMs through 1015 clinician scoring events, revealing limitations including off-target answers, vague reasoning, and inconsistent pathology terminology.
Conclusion: MedFoundationHub provides a secure, accessible solution for medical VLM deployment while current models still show significant limitations in clinical performance.
Abstract: Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.
[98] Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction
Mang Cao, Sanping Zhou, Yizhe Li, Ye Deng, Wenli Huang, Le Wang
Main category: cs.CV
TL;DR: Bidirectional Interaction Mamba (BIM) addresses the trade-off between cross-task interaction completeness and computational efficiency in multi-task dense prediction through novel bidirectional scanning mechanisms.
Details
Motivation: Existing multi-task dense prediction methods face a fundamental trade-off between achieving sufficient cross-task interaction (which is crucial for performance) and maintaining computational efficiency. Current approaches struggle to balance interaction completeness with manageable computational complexity.Method: Proposes BIM with two novel mechanisms: 1) Bidirectional Interaction Scan (BI-Scan) that constructs task-specific representations as bidirectional sequences with task-first and position-first scanning modes, and 2) Multi-Scale Scan (MS-Scan) for multi-granularity scene modeling to meet diverse task requirements and enhance cross-task feature interactions.
Result: Extensive experiments on NYUD-V2 and PASCAL-Context benchmarks demonstrate that BIM achieves state-of-the-art performance, showing superiority over existing competitors while maintaining linear computational complexity.
Conclusion: BIM successfully resolves the trade-off between interaction completeness and computational efficiency in multi-task dense prediction through innovative bidirectional scanning mechanisms, providing both effective cross-task interaction and computational efficiency.
Abstract: Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scanning mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scanning modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, \emph{i.e.}, NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.
[99] Audio-Guided Visual Editing with Complex Multi-Modal Prompts
Hyeonyu Kim, Seokhoon Jeong, Seonghee Han, Chanhyuk Choi, Taehwan Kim
Main category: cs.CV
TL;DR: Audio-guided visual editing framework that handles complex editing tasks with multiple text and audio prompts without requiring additional training, using pre-trained multi-modal encoders and novel noise branching techniques.
Details
Motivation: Textual guidance alone is insufficient for complex visual editing scenarios, and existing audio-guided methods require dataset-specific training that limits real-world generalization.Method: Leverages pre-trained multi-modal encoder with zero-shot capabilities, integrates diverse audio into visual editing by aligning audio encoder space with diffusion model’s prompt encoder space, and uses separate noise branching with adaptive patch selection for multi-modal prompts.
Result: Comprehensive experiments show the framework excels in handling complicated editing scenarios by incorporating rich audio information where text-only approaches fail.
Conclusion: The proposed audio-guided visual editing framework successfully addresses complex editing tasks without additional training, demonstrating superior performance over text-only methods through effective multi-modal integration.
Abstract: Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this work, we introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring additional training. Existing audio-guided visual editing methods often necessitate training on specific datasets to align audio with text, limiting their generalization to real-world situations. We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks, by alleviating the discrepancy between the audio encoder space and the diffusion model’s prompt encoder space. Additionally, we propose a novel approach to handle complex scenarios with multiple and multi-modal editing prompts through our separate noise branching and adaptive patch selection. Our comprehensive experiments on diverse editing tasks demonstrate that our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.
[100] More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning
Luong Tran, Thieu Vo, Anh Nguyen, Sang Dinh, Van Nguyen
Main category: cs.CV
TL;DR: Proposes AEVLP framework with GPR Loss and DAMP technique for single positive multi-label learning, achieving state-of-the-art results by effectively handling noisy pseudo-labels.
Details
Motivation: Fully annotating large-scale multi-label datasets is costly and impractical. Traditional SPML methods that treat missing labels as unknown/negative cause inaccuracies and false negatives, while pseudo-labeling strategies introduce additional noise.Method: Developed Generalized Pseudo-Label Robust Loss (GPR Loss) to learn from diverse pseudo-labels while mitigating noise, and Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Combined into Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework.
Result: Extensive experiments on four benchmark datasets demonstrate significant advancements in multi-label classification, achieving state-of-the-art results.
Conclusion: The proposed AEVLP framework with GPR Loss and DAMP technique effectively addresses challenges in single positive multi-label learning, providing robust performance against noisy pseudo-labels and achieving superior classification results.
Abstract: Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.
[101] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection
Chengjun Zhang, Yuhao Zhang, Jie Yang, Mohamad Sawan
Main category: cs.CV
TL;DR: A novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture for SNNs that enables dynamic adjustment of accumulation and firing behaviors based on temporal order, achieving state-of-the-art performance in visual detection tasks with ultra-low latency (within 5 time-steps).
Details
Motivation: Current ANN-SNN conversion methods perform well in classification tasks but show suboptimal performance in visual detection tasks due to residual membrane potential issues caused by heterogeneous spiking patterns.Method: Proposes a delay-spike approach to mitigate residual membrane potential issues and introduces a novel tdIF neuron architecture that allows IF neurons to dynamically adjust accumulation and firing behaviors based on temporal order of time-steps.
Result: Achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. Surpasses current ANN-SNN conversion approaches with state-of-the-art performance within 5 time-steps.
Conclusion: The tdIF neuron architecture enables spikes to exhibit distinct temporal properties rather than relying solely on frequency-based representations, while maintaining energy consumption on par with traditional IF neurons, making it highly effective for visual detection tasks.
Abstract: Spiking Neural Networks (SNNs), inspired by the brain, are characterized by minimal power consumption and swift inference capabilities on neuromorphic hardware, and have been widely applied to various visual perception tasks. Current ANN-SNN conversion methods have achieved excellent results in classification tasks with ultra-low time-steps, but their performance in visual detection tasks remains suboptimal. In this paper, we propose a delay-spike approach to mitigate the issue of residual membrane potential caused by heterogeneous spiking patterns. Furthermore, we propose a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture for SNNs. This enables Integrate-and-fire (IF) neurons to dynamically adjust their accumulation and firing behaviors based on the temporal order of time-steps. Our method enables spikes to exhibit distinct temporal properties, rather than relying solely on frequency-based representations. Moreover, the tdIF neuron maintains energy consumption on par with traditional IF neuron. We demonstrate that our method achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. In this study, we conduct extensive evaluation of the tdIF method across two critical vision tasks: object detection and lane line detection. The results demonstrate that the proposed method surpasses current ANN-SNN conversion approaches, achieving state-of-the-art performance with ultra-low latency (within 5 time-steps).
[102] Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection
Yuqi Xiong, Wuzhen Shi, Yang Wen, Ruhan Liu
Main category: cs.CV
TL;DR: Proposes DUP-MCRNet for salient object detection with dynamic uncertainty propagation and multimodal fusion to improve edge clarity and handle complex scenes.
Details
Motivation: Existing SOD methods lose details, blur edges, and insufficiently fuse single-modal information in complex scenes, needing better handling of small structures and cross-modal complementarity.Method: Uses dynamic uncertainty graph convolution for spatial semantic propagation, multimodal collaborative fusion with learnable gating weights for RGB/depth/edge features, and multi-scale loss optimization with uncertainty-guided supervision.
Result: Outperforms various SOD methods on benchmark datasets, especially in edge clarity and robustness to complex backgrounds.
Conclusion: DUP-MCRNet effectively addresses detail loss and edge blurring through uncertainty propagation and multimodal fusion, demonstrating superior performance in complex scenarios.
Abstract: In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at https://github.com/YukiBear426/DUP-MCRNet.
[103] MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection
Taiga Yamane, Satoshi Suzuki, Ryo Masumura, Shota Orihashi, Tomohiro Tanaka, Mana Ihori, Naoki Makishima, Naotaka Kawata
Main category: cs.CV
TL;DR: MSMVD proposes multi-scale BEV feature generation to improve pedestrian detection in multi-view systems, addressing scale consistency issues across different camera views.
Details
Motivation: Existing MVPD methods struggle with detecting pedestrians that have consistently small/large scales within views or vastly different scales between views, due to not exploiting multi-scale image features.Method: Generates multi-scale BEV features by projecting multi-scale image features from individual views into BEV space scale-by-scale, then processes them using a feature pyramid network to combine information across different scales and views.
Result: Extensive experiments show MSMVD outperforms previous methods by 4.5 MODA points on GMVD dataset, demonstrating significant improvement in detection performance.
Conclusion: Exploiting multi-scale image features through multi-scale BEV features greatly enhances pedestrian detection performance in multi-view systems, effectively addressing scale variation challenges.
Abstract: Multi-View Pedestrian Detection (MVPD) aims to detect pedestrians in the form of a bird’s eye view (BEV) from multi-view images. In MVPD, end-to-end trainable deep learning methods have progressed greatly. However, they often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. This is because they do not exploit multi-scale image features to generate the BEV feature and detect pedestrians. To overcome this problem, we propose a novel MVPD method, called Multi-Scale Multi-View Detection (MSMVD). MSMVD generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Each of these BEV features inherits the properties of its corresponding scale image features from multiple views. Therefore, these BEV features help the precise detection of pedestrians with consistently small or large scales in views. Then, MSMVD combines information at different scales of multiple views by processing the multi-scale BEV features using a feature pyramid network. This improves the detection of pedestrians with vastly different scales between views. Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves the detection performance, and MSMVD outperforms the previous highest MODA by $4.5$ points on the GMVD dataset.
[104] A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection
Libo Lv, Tianyi Wang, Mengxiao Huang, Ruixia Liu, Yinglong Wang
Main category: cs.CV
TL;DR: Proposes SFMFNet, a lightweight deepfake detection network that combines spatial and frequency features with efficient multi-scale fusion for real-time performance.
Details
Motivation: Current deepfake detectors achieve high accuracy but are computationally expensive, making them unsuitable for real-time applications like video conferencing and social media.Method: Uses spatial-frequency hybrid aware module with gated mechanism, token-selective cross attention for multi-level feature interaction, and residual-enhanced blur pooling for downsampling.
Result: Achieves favorable balance between accuracy and efficiency on benchmark datasets, with strong generalization capabilities for real-time applications.
Conclusion: SFMFNet provides an effective lightweight solution for real-time deepfake detection with practical value for deployment in real-world scenarios.
Abstract: With the rapid advancement of real-time deepfake generation techniques, forged content is becoming increasingly realistic and widespread across applications like video conferencing and social media. Although state-of-the-art detectors achieve high accuracy on standard benchmarks, their heavy computational cost hinders real-time deployment in practical applications. To address this, we propose the Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), a lightweight yet effective architecture for real-time deepfake detection. We design a spatial-frequency hybrid aware module that jointly leverages spatial textures and frequency artifacts through a gated mechanism, enhancing sensitivity to subtle manipulations. A token-selective cross attention mechanism enables efficient multi-level feature interaction, while a residual-enhanced blur pooling structure helps retain key semantic cues during downsampling. Experiments on several benchmark datasets show that SFMFNet achieves a favorable balance between accuracy and efficiency, with strong generalization and practical value for real-time applications.
[105] Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification
Ayaka Tsutsumi, Guang Li, Ren Togo, Takahiro Ogawa, Satoshi Kondo, Miki Haseyama
Main category: cs.CV
TL;DR: A lightweight medical image classification method using dual-model weight selection and self-knowledge distillation to achieve large-model performance with computational efficiency.
Details
Motivation: Address computational constraints in real-world medical settings where deploying large-scale models is impractical, requiring lightweight alternatives that maintain high performance.Method: Uses dual-model weight selection from large pretrained models, applies self-knowledge distillation for knowledge transfer without excessive computational cost, followed by fine-tuning for target tasks.
Result: Extensive experiments on chest X-ray, lung CT scans, and brain MRI datasets demonstrate superior performance and robustness compared to existing methods.
Conclusion: The combined approach of dual-model weight selection and self-knowledge distillation effectively overcomes limitations of conventional methods in retaining critical information in compact medical image classification models.
Abstract: We propose a novel medical image classification method that integrates dual-model weight selection with self-knowledge distillation (SKD). In real-world medical settings, deploying large-scale models is often limited by computational resource constraints, which pose significant challenges for their practical implementation. Thus, developing lightweight models that achieve comparable performance to large-scale models while maintaining computational efficiency is crucial. To address this, we employ a dual-model weight selection strategy that initializes two lightweight models with weights derived from a large pretrained model, enabling effective knowledge transfer. Next, SKD is applied to these selected models, allowing the use of a broad range of initial weight configurations without imposing additional excessive computational cost, followed by fine-tuning for the target classification tasks. By combining dual-model weight selection with self-knowledge distillation, our method overcomes the limitations of conventional approaches, which often fail to retain critical information in compact models. Extensive experiments on publicly available datasets-chest X-ray images, lung computed tomography scans, and brain magnetic resonance imaging scans-demonstrate the superior performance and robustness of our approach compared to existing methods.
[106] Re-Densification Meets Cross-Scale Propagation: Real-Time Compression of LiDAR Point Clouds
Pengpeng Yu, Haoran Li, Dingquan Li, Runqing Jiang, Jing Wang, Liang Lin, Yulan Guo
Main category: cs.CV
TL;DR: A novel LiDAR point cloud compression method using geometry re-densification and cross-scale feature propagation for efficient predictive coding, achieving state-of-the-art compression with real-time performance.
Details
Motivation: High-precision LiDAR scans incur substantial storage and transmission overhead, and existing methods struggle with efficient context modeling due to extreme sparsity of geometric details, limiting compression performance and speed.Method: Proposes two lightweight modules: 1) Geometry Re-Densification Module that re-densifies sparse geometry, extracts features at denser scale, then re-sparsifies for predictive coding; 2) Cross-scale Feature Propagation Module that leverages occupancy cues from multiple resolutions to guide hierarchical feature propagation across scales.
Result: Achieves state-of-the-art compression ratios on KITTI dataset with real-time performance (26 FPS for both encoding and decoding at 12-bit quantization).
Conclusion: The proposed framework generates compact feature representations that enable efficient context modeling and accelerate the coding process while maintaining high compression performance.
Abstract: LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for both encoding and decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.
[107] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
Xiaochuan Li, Guoguang Du, Runze Zhang, Liang Jin, Qi Jia, Lihua Lu, Zhenhua Guo, Yaqian Zhao, Haiyang Liu, Tianqi Wang, Changsheng Li, Xiaoli Gong, Rengang Li, Baoyu Fan
Main category: cs.CV
TL;DR: The paper addresses 3D data scarcity by leveraging video priors for 3D generation, introducing Droplet3D-4M dataset and a generative model that produces spatially consistent and semantically plausible 3D content.
Details
Motivation: Overcome data scarcity in 3D domain by utilizing abundant video data that contains spatial consistency priors and rich semantic information, providing an alternative supervisory signal for 3D generation.Method: Introduce Droplet3D-4M, the first large-scale video dataset with multi-view annotations, and train Droplet3D generative model supporting both image and dense text input to leverage video commonsense priors.
Result: Extensive experiments show the approach produces spatially consistent and semantically plausible 3D content, with potential for extension to scene-level applications, outperforming prevailing 3D solutions.
Conclusion: Commonsense priors from videos significantly facilitate 3D creation, and the approach demonstrates effectiveness in mitigating generalization bottlenecks caused by limited native 3D data.
Abstract: Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.
[108] Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation
Jiusi Li, Jackson Jiang, Jinyu Miao, Miao Long, Tuopu Wen, Peijin Jia, Shengxiang Liu, Chunlei Yu, Maolin Liu, Yuzhan Cai, Kun Jiang, Mengmeng Yang, Diange Yang
Main category: cs.CV
TL;DR: G^2Editor is a framework for photorealistic object editing in driving videos using 3D Gaussian representation for precise pose control and spatial consistency.
Details
Motivation: Collecting corner cases for autonomous driving training is costly and hazardous. Existing editing methods suffer from limited visual fidelity or imprecise pose control.Method: Uses 3D Gaussian representation as dense prior injected into denoising process, scene-level 3D bounding box layout for occlusion handling, and hierarchical fine-grained features for appearance guidance.
Result: Outperforms existing methods in pose controllability and visual quality on Waymo Open Dataset, supporting object repositioning, insertion, and deletion.
Conclusion: G^2Editor provides effective photorealistic object editing for driving scenarios, benefiting downstream data-driven tasks with improved visual quality and precise control.
Abstract: Corner cases are crucial for training and validating autonomous driving systems, yet collecting them from the real world is often costly and hazardous. Editing objects within captured sensor data offers an effective alternative for generating diverse scenarios, commonly achieved through 3D Gaussian Splatting or image generative models. However, these approaches often suffer from limited visual fidelity or imprecise pose control. To address these issues, we propose G^2Editor, a framework designed for photorealistic and precise object editing in driving videos. Our method leverages a 3D Gaussian representation of the edited object as a dense prior, injected into the denoising process to ensure accurate pose control and spatial consistency. A scene-level 3D bounding box layout is employed to reconstruct occluded areas of non-target objects. Furthermore, to guide the appearance details of the edited object, we incorporate hierarchical fine-grained features as additional conditions during generation. Experiments on the Waymo Open Dataset demonstrate that G^2Editor effectively supports object repositioning, insertion, and deletion within a unified framework, outperforming existing methods in both pose controllability and visual quality, while also benefiting downstream data-driven tasks.
[109] Enhancing Corpus Callosum Segmentation in Fetal MRI via Pathology-Informed Domain Randomization
Marina Grifell i Plana, Vladyslav Zalevskyi, Léa Schmidt, Yvan Gomez, Thomas Sanchez, Vincent Dunet, Mériam Koob, Vanessa Siffredi, Meritxell Bach Cuadra
Main category: cs.CV
TL;DR: A pathology-informed domain randomization strategy that simulates CCD brain alterations from healthy data alone, enabling robust fetal brain segmentation without pathological annotations and improving biomarker accuracy.
Details
Motivation: Accurate fetal brain segmentation is crucial for neurodevelopment assessment, but rare conditions like corpus callosum dysgenesis (CCD) severely limit annotated data, hindering deep learning model generalization.Method: Proposed a pathology-informed domain randomization strategy that embeds prior knowledge of CCD manifestations into synthetic data generation pipeline, simulating diverse brain alterations from healthy data alone.
Result: Achieved substantial improvements on CCD cases while maintaining performance on healthy fetuses and other pathologies. Reduced LCC estimation error from 1.89mm to 0.80mm in healthy cases and from 10.9mm to 0.7mm in CCD cases. Improved topological consistency for shape-based analyses.
Conclusion: Incorporating domain-specific anatomical priors into synthetic data pipelines can effectively mitigate data scarcity and enhance analysis of rare but clinically significant malformations.
Abstract: Accurate fetal brain segmentation is crucial for extracting biomarkers and assessing neurodevelopment, especially in conditions such as corpus callosum dysgenesis (CCD), which can induce drastic anatomical changes. However, the rarity of CCD severely limits annotated data, hindering the generalization of deep learning models. To address this, we propose a pathology-informed domain randomization strategy that embeds prior knowledge of CCD manifestations into a synthetic data generation pipeline. By simulating diverse brain alterations from healthy data alone, our approach enables robust segmentation without requiring pathological annotations. We validate our method on a cohort comprising 248 healthy fetuses, 26 with CCD, and 47 with other brain pathologies, achieving substantial improvements on CCD cases while maintaining performance on both healthy fetuses and those with other pathologies. From the predicted segmentations, we derive clinically relevant biomarkers, such as corpus callosum length (LCC) and volume, and show their utility in distinguishing CCD subtypes. Our pathology-informed augmentation reduces the LCC estimation error from 1.89 mm to 0.80 mm in healthy cases and from 10.9 mm to 0.7 mm in CCD cases. Beyond these quantitative gains, our approach yields segmentations with improved topological consistency relative to available ground truth, enabling more reliable shape-based analyses. Overall, this work demonstrates that incorporating domain-specific anatomical priors into synthetic data pipelines can effectively mitigate data scarcity and enhance analysis of rare but clinically significant malformations.
[110] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding
Yuan Xie, Tianshui Chen, Zheng Ge, Lionel Ni
Main category: cs.CV
TL;DR: Video-MTR is a reinforced multi-turn reasoning framework that iteratively selects key video segments and comprehends questions through multiple reasoning turns, outperforming existing methods in long-form video understanding.
Details
Motivation: Long-form video understanding faces challenges with long-range temporal dependencies and multiple events. Existing methods rely on static reasoning or external VLMs, leading to complexity and sub-optimal performance due to lack of end-to-end training.Method: Proposes a reinforced multi-turn reasoning framework that performs iterative key video segment selection and question comprehension. Uses a novel gated bi-level reward system combining trajectory-level rewards (answer correctness) and turn-level rewards (frame-query relevance) for end-to-end training without external VLMs.
Result: Extensive experiments on VideoMME, MLVU, and EgoSchema benchmarks demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency.
Conclusion: Video-MTR advances the state-of-the-art in long video understanding by enabling iterative reasoning through multiple turns with a gated bi-level reward system, eliminating the need for external VLMs and allowing end-to-end optimization.
Abstract: Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.
[111] Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts
Zixuan Hu, Dongxiao Li, Xinzhu Ma, Shixiang Tang, Xiaotong Li, Wenhan Yang, Ling-Yu Duan
Main category: cs.CV
TL;DR: DUO is a test-time adaptation framework that jointly minimizes semantic and geometric uncertainties in monocular 3D object detection through dual-branch optimization and convex loss formulation.
Details
Motivation: Existing TTA methods fail to address the dual uncertainty (semantic and geometric) inherent in monocular 3D object detection, which deteriorates reliability under real-world domain shifts in autonomous driving applications.Method: Proposes Dual Uncertainty Optimization (DUO) with: 1) convex optimization of focal loss with unsupervised uncertainty weighting, 2) semantic-aware normal field constraint for geometric coherence, and 3) dual-branch complementary learning between spatial perception and semantic classification.
Result: Extensive experiments show DUO outperforms existing methods across various datasets and domain shift types, demonstrating superior robustness and adaptation capability.
Conclusion: DUO effectively addresses both semantic and geometric uncertainties in M3OD through joint optimization, providing a comprehensive solution for reliable 3D object detection under domain shifts without requiring target labels.
Abstract: Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types.
[112] CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information
Seunghyeon Jung, Seoyoung Hong, Jiwoo Jeong, Seungwon Jeong, Jaerim Choi, Hoki Kim, Woojin Lee
Main category: cs.CV
TL;DR: CaddieSet is a new golf dataset with joint information and ball trajectory data from swing videos, enabling interpretable analysis of swing posture effects on shot outcomes.
Details
Motivation: Existing deep learning studies haven't quantitatively established the relationship between swing posture and ball trajectory, limiting actionable insights for golfers.Method: Created CaddieSet dataset by extracting joint information from swing videos segmented into 8 phases using computer vision, and defined 15 key swing metrics based on expert domain knowledge.
Result: Demonstrated feasibility for predicting ball trajectories using various benchmarks, with interpretable models showing swing feedback quantitatively consistent with established golf knowledge.
Conclusion: CaddieSet provides new insights for golf swing analysis that can benefit both academic research and the sports industry.
Abstract: Recent advances in deep learning have led to more studies to enhance golfers’ shot precision. However, these existing studies have not quantitatively established the relationship between swing posture and ball trajectory, limiting their ability to provide golfers with the necessary insights for swing improvement. In this paper, we propose a new dataset called CaddieSet, which includes joint information and various ball information from a single shot. CaddieSet extracts joint information from a single swing video by segmenting it into eight swing phases using a computer vision-based approach. Furthermore, based on expert golf domain knowledge, we define 15 key metrics that influence a golf swing, enabling the interpretation of swing outcomes through swing-related features. Through experiments, we demonstrated the feasibility of CaddieSet for predicting ball trajectories using various benchmarks. In particular, we focus on interpretable models among several benchmarks and verify that swing feedback using our joint features is quantitatively consistent with established domain knowledge. This work is expected to offer new insight into golf swing analysis for both academia and the sports industry.
[113] IAENet: An Importance-Aware Ensemble Model for 3D Point Cloud-Based Anomaly Detection
Xuanming Cao, Chengyu Tao, Yifeng Cheng, Juan Du
Main category: cs.CV
TL;DR: IAENet is a novel ensemble framework that combines 2D and 3D experts for surface anomaly detection, using an Importance-Aware Fusion module to dynamically weight predictions and achieve state-of-the-art performance with lower false positives.
Details
Motivation: 3D point cloud-based anomaly detection lags behind 2D methods due to lack of powerful pretrained backbones, despite offering richer geometric information for industrial quality control.Method: Proposes Importance-Aware Ensemble Network (IAENet) with novel Importance-Aware Fusion module that dynamically assesses and reweights anomaly scores from 2D and 3D expert models using specially designed loss functions.
Result: Extensive experiments on MVTec 3D-AD show IAENet achieves new state-of-the-art performance with significantly lower false positive rate.
Conclusion: The framework successfully bridges the 2D-3D gap in anomaly detection and demonstrates practical value for industrial deployment through superior performance and reduced false positives.
Abstract: Surface anomaly detection is pivotal for ensuring product quality in industrial manufacturing. While 2D image-based methods have achieved remarkable success, 3D point cloud-based detection remains underexplored despite its richer geometric cues. We argue that the key bottleneck is the absence of powerful pretrained foundation backbones in 3D comparable to those in 2D. To bridge this gap, we propose Importance-Aware Ensemble Network (IAENet), an ensemble framework that synergizes 2D pretrained expert with 3D expert models. However, naively fusing predictions from disparate sources is non-trivial: existing strategies can be affected by a poorly performing modality and thus degrade overall accuracy. To address this challenge, We introduce an novel Importance-Aware Fusion (IAF) module that dynamically assesses the contribution of each source and reweights their anomaly scores. Furthermore, we devise critical loss functions that explicitly guide the optimization of IAF, enabling it to combine the collective knowledge of the source experts but also preserve their unique strengths, thereby enhancing the overall performance of anomaly detection. Extensive experiments on MVTec 3D-AD demonstrate that our IAENet achieves a new state-of-the-art with a markedly lower false positive rate, underscoring its practical value for industrial deployment.
[114] Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent
En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai
Main category: cs.CV
TL;DR: DescriptiveEdit is a new image editing framework that reframes instruction-based editing as reference-image-based text-to-image generation, using a Cross-Attentive UNet to inject reference image features without architectural changes.
Details
Motivation: Address limitations in semantic image editing where inversion-based methods have reconstruction errors and instruction-based models suffer from poor dataset quality and scale.Method: Proposes DescriptiveEdit framework that uses reference image and prompt as input, with Cross-Attentive UNet adding attention bridges to inject reference image features into the text-to-image generation process.
Result: Experiments on Emu Edit benchmark show improved editing accuracy and consistency compared to existing methods.
Conclusion: The approach overcomes dataset limitations, integrates well with existing extensions like ControlNet and IP-Adapter, and offers better scalability while preserving generative power of text-to-image models.
Abstract: Despite the progress in text-to-image generation, semantic image editing
remains a challenge. Inversion-based algorithms unavoidably introduce
reconstruction errors, while instruction-based models mainly suffer from
limited dataset quality and scale. To address these problems, we propose a
descriptive-prompt-based editing framework, named DescriptiveEdit. The core
idea is to re-frame instruction-based image editing' as
reference-image-based
text-to-image generation’, which preserves the generative power of well-trained
Text-to-Image models without architectural modifications or inversion.
Specifically, taking the reference image and a prompt as input, we introduce a
Cross-Attentive UNet, which newly adds attention bridges to inject reference
image features into the prompt-to-edit-image generation process. Owing to its
text-to-image nature, DescriptiveEdit overcomes limitations in instruction
dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other
extensions, and is more scalable. Experiments on the Emu Edit benchmark show it
improves editing accuracy and consistency.
[115] DCFS: Continual Test-Time Adaptation via Dual Consistency of Feature and Sample
Wenting Yin, Han Sun, Xinru Meng, Ningzhong Liu, Huiyu Zhou
Main category: cs.CV
TL;DR: DCFS is a novel CTTA framework that uses dual-path feature consistency and confidence-aware learning to address error accumulation in continual test-time adaptation without source data.
Details
Motivation: Current CTTA methods rely on pseudo-labels from model predictions, which suffer from quality issues and error accumulation. Without source data, models can develop biases from focusing only on target domain features.Method: Proposes dual classifiers to disentangle semantic-related and domain-related features, maintains consistency between sub-features and whole features, and uses adaptive thresholds with confidence scores for weighted self-supervised learning.
Result: Extensive experiments on CIFAR10-C, CIFAR100-C, and ImageNet-C datasets demonstrate consistent performance improvements in continual test-time adaptation scenarios.
Conclusion: DCFS effectively reduces pseudo-label noise and alleviates error accumulation by comprehensively capturing data features from multiple perspectives through dual-path feature consistency and confidence-aware learning.
Abstract: Continual test-time adaptation aims to continuously adapt a pre-trained model to a stream of target domain data without accessing source data. Without access to source domain data, the model focuses solely on the feature characteristics of the target data. Relying exclusively on these features can lead to confusion and introduce learning biases. Currently, many existing methods generate pseudo-labels via model predictions. However, the quality of pseudo-labels cannot be guaranteed and the problem of error accumulation must be solved. To address these challenges, we propose DCFS, a novel CTTA framework that introduces dual-path feature consistency and confidence-aware sample learning. This framework disentangles the whole feature representation of the target data into semantic-related feature and domain-related feature using dual classifiers to learn distinct feature representations. By maintaining consistency between the sub-features and the whole feature, the model can comprehensively capture data features from multiple perspectives. Additionally, to ensure that the whole feature information of the target domain samples is not overlooked, we set a adaptive threshold and calculate a confidence score for each sample to carry out loss weighted self-supervised learning, effectively reducing the noise of pseudo-labels and alleviating the problem of error accumulation. The efficacy of our proposed method is validated through extensive experimentation across various datasets, including CIFAR10-C, CIFAR100-C, and ImageNet-C, demonstrating consistent performance in continual test-time adaptation scenarios.
[116] Adam SLAM - the last mile of camera calibration with 3DGS
Matthieu Gendrin, Stéphane Pateux, Xiaoran Jiang, Théo Ladune, Luce Morin
Main category: cs.CV
TL;DR: Using 3D Gaussian Splatting (3DGS) to fine-tune camera calibration through backpropagation of novel view color loss, achieving 0.4 dB PSNR improvement on benchmark datasets.
Details
Motivation: Camera calibration quality significantly impacts novel view synthesis performance, with even 1-pixel errors having substantial effects on reconstruction quality. Since real scenes lack ground truth calibration, view synthesis quality becomes the evaluation metric.Method: Proposes using a 3DGS model to refine camera calibration by backpropagating novel view color loss with respect to camera parameters, effectively optimizing calibration through view synthesis performance.
Result: The method achieves an average improvement of 0.4 dB PSNR on the dataset used as reference by 3DGS, demonstrating significant calibration enhancement.
Conclusion: While the fine-tuning process can be time-consuming, the approach is particularly valuable for reference scenes like Mip-NeRF 360 where novel view quality is paramount, making the calibration refinement worthwhile despite computational costs.
Abstract: The quality of the camera calibration is of major importance for evaluating progresses in novel view synthesis, as a 1-pixel error on the calibration has a significant impact on the reconstruction quality. While there is no ground truth for real scenes, the quality of the calibration is assessed by the quality of the novel view synthesis. This paper proposes to use a 3DGS model to fine tune calibration by backpropagation of novel view color loss with respect to the cameras parameters. The new calibration alone brings an average improvement of 0.4 dB PSNR on the dataset used as reference by 3DGS. The fine tuning may be long and its suitability depends on the criticity of training time, but for calibration of reference scenes, such as Mip-NeRF 360, the stake of novel view quality is the most important.
[117] Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation
Jingyun Yang, Guoqing Zhang, Jingge Wang, Yang Li
Main category: cs.CV
TL;DR: Proposes an active sequential domain adaptation framework for multi-modal medical image segmentation that dynamically selects the most informative samples to reduce annotation costs while maintaining high performance.
Details
Motivation: Medical image labeling is time-consuming and expensive. Existing active domain adaptation methods suffer from negative transfer and limited source data access, with no dedicated strategies for multi-modal medical data.Method: Develops an active sequential domain adaptation framework with a query strategy that prioritizes samples based on both informativeness and representativeness for multi-modal medical data.
Result: Achieves superior segmentation performance on gross tumor volume segmentation tasks, significantly outperforming state-of-the-art ADA methods.
Conclusion: The proposed framework effectively reduces annotation costs while maintaining high segmentation accuracy, making it valuable for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma.
Abstract: Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \href{https://github.com/Hiyoochan/mmActS}{mmActS}.
[118] Enhancing Pseudo-Boxes via Data-Level LiDAR-Camera Fusion for Unsupervised 3D Object Detection
Mingqian Ji, Jian Yang, Shanshan Zhang
Main category: cs.CV
TL;DR: A novel data-level fusion framework for unsupervised 3D object detection that integrates RGB images and LiDAR data early in the process using vision foundation models, bi-directional fusion, and filtering methods to improve pseudo-box quality without manual annotations.
Details
Motivation: Existing LiDAR-based 3D object detectors require time-consuming manual annotations, and current unsupervised methods that simply fuse pseudo-boxes from LiDAR and RGB overlook the complementary nature of these modalities, providing limited improvements.Method: Proposes data-level fusion using vision foundation models for instance segmentation and depth estimation. Introduces bi-directional fusion where real points get category labels from 2D space and 2D pixels are projected to 3D. Uses local radius filtering and global statistical filtering to mitigate noise. Implements data-level fusion based dynamic self-evolution strategy for iterative pseudo-box refinement.
Result: Achieves 28.4% mAP on nuScenes validation benchmark, significantly outperforming previous state-of-the-art unsupervised methods.
Conclusion: The proposed data-level fusion framework effectively leverages complementary LiDAR and RGB data through early integration and sophisticated filtering, demonstrating superior performance in unsupervised 3D object detection without manual annotations.
Abstract: Existing LiDAR-based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high-quality 3D labels is time-consuming and labor-intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo-box generation. However, these methods simply integrate pseudo-boxes generated by LiDAR point clouds and RGB images. Yet, such a label-level fusion strategy brings limited improvements to the quality of pseudo-boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data-level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi-directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation-induced outliers. Furthermore, we propose a data-level fusion based dynamic self-evolution strategy, which iteratively refines pseudo-boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state-of-the-art methods with 28.4$%$ mAP on the nuScenes validation benchmark.
[119] Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World Dataset
Frederik Rajiv Manichand, Robin Deuber, Robert Jakob, Steve Swerling, Jamie Rosen, Elgar Fleisch, Patrick Langer
Main category: cs.CV
TL;DR: Deep learning BMI estimation from smartphone images using large proprietary dataset (WayBED) with automatic filtering, achieving state-of-the-art results and mobile deployment.
Details
Motivation: Enable rapid weight assessment via camera images when traditional BMI measurement methods are unavailable or impractical, particularly in telehealth and emergency scenarios.Method: Deep learning-based approach trained on WayBED dataset (84,963 images from 25,353 individuals) with automatic filtering using posture clustering and person detection to remove low-quality images. Deployed on Android via CLAID framework.
Result: Achieved MAPE of 7.9% on WayBED test set (lowest published), 13% on unseen VisualBodyToBMI dataset (comparable to SOTA), and 8.56% after fine-tuning on VisualBodyToBMI (lowest reported).
Conclusion: The method demonstrates robust generalization across datasets and achieves state-of-the-art performance in BMI estimation from images, with complete open-source release of code and mobile deployment package.
Abstract: Estimating Body Mass Index (BMI) from camera images with machine learning models enables rapid weight assessment when traditional methods are unavailable or impractical, such as in telehealth or emergency scenarios. Existing computer vision approaches have been limited to datasets of up to 14,500 images. In this study, we present a deep learning-based BMI estimation method trained on our WayBED dataset, a large proprietary collection of 84,963 smartphone images from 25,353 individuals. We introduce an automatic filtering method that uses posture clustering and person detection to curate the dataset by removing low-quality images, such as those with atypical postures or incomplete views. This process retained 71,322 high-quality images suitable for training. We achieve a Mean Absolute Percentage Error (MAPE) of 7.9% on our hold-out test set (WayBED data) using full-body images, the lowest value in the published literature to the best of our knowledge. Further, we achieve a MAPE of 13% on the completely unseen~(during training) VisualBodyToBMI dataset, comparable with state-of-the-art approaches trained on it, demonstrating robust generalization. Lastly, we fine-tune our model on VisualBodyToBMI and achieve a MAPE of 8.56%, the lowest reported value on this dataset so far. We deploy the full pipeline, including image filtering and BMI estimation, on Android devices using the CLAID framework. We release our complete code for model training, filtering, and the CLAID package for mobile deployment as open-source contributions.
[120] Domain Adaptation Techniques for Natural and Medical Image Classification
Ahmad Chaddad, Yihang Wu, Reem Kateb, Christian Desrosiers
Main category: cs.CV
TL;DR: Comprehensive study comparing 7 domain adaptation techniques across 5 natural and 8 medical image datasets, with DSAN algorithm showing superior performance particularly in medical applications like COVID-19 classification.
Details
Motivation: To better understand domain adaptation benefits for both natural and medical images, addressing performance bias in mainstream datasets and challenges with medical data.Method: Conducted 557 simulation studies using 7 widely-used DA techniques for image classification across various scenarios including out-of-distribution, dynamic data streams, and limited training samples.
Result: DSAN algorithm achieved 91.2% accuracy on COVID-19 dataset using Resnet50 and showed +6.7% improvement in dynamic data stream scenarios compared to baseline, with remarkable explainability on medical datasets.
Conclusion: The study provides valuable insights into effective model adaptation for medical data and contributes to understanding DA techniques, with DSAN demonstrating outstanding performance and applicability in medical imaging.
Abstract: Domain adaptation (DA) techniques have the potential in machine learning to alleviate distribution differences between training and test sets by leveraging information from source domains. In image classification, most advances in DA have been made using natural images rather than medical data, which are harder to work with. Moreover, even for natural images, the use of mainstream datasets can lead to performance bias. {With the aim of better understanding the benefits of DA for both natural and medical images, this study performs 557 simulation studies using seven widely-used DA techniques for image classification in five natural and eight medical datasets that cover various scenarios, such as out-of-distribution, dynamic data streams, and limited training samples.} Our experiments yield detailed results and insightful observations highlighting the performance and medical applicability of these techniques. Notably, our results have shown the outstanding performance of the Deep Subdomain Adaptation Network (DSAN) algorithm. This algorithm achieved feasible classification accuracy (91.2%) in the COVID-19 dataset using Resnet50 and showed an important accuracy improvement in the dynamic data stream DA scenario (+6.7%) compared to the baseline. Our results also demonstrate that DSAN exhibits remarkable level of explainability when evaluated on COVID-19 and skin cancer datasets. These results contribute to the understanding of DA techniques and offer valuable insight into the effective adaptation of models to medical data.
[121] Contrastive Learning through Auxiliary Branch for Video Object Detection
Lucas Rakotoarivony
Main category: cs.CV
TL;DR: CLAB method uses contrastive learning with auxiliary branch and dynamic loss weighting to improve video object detection without increasing computational cost during inference.
Details
Motivation: Video object detection faces challenges from image deterioration like motion blur and occlusion. Existing methods improve performance but add computational complexity. Need robust detection without extra inference cost.Method: Contrastive Learning through Auxiliary Branch (CLAB) with contrastive loss to enhance backbone features. Dynamic loss weighting that prioritizes auxiliary learning early, then shifts to detection task as training converges.
Result: Achieves 84.0% mAP with ResNet-101 and 85.2% mAP with ResNeXt-101 on ImageNet VID dataset, setting state-of-the-art for CNN-based models without post-processing.
Conclusion: CLAB effectively improves video object detection robustness to image degradation without additional computational overhead during inference, demonstrating consistent performance gains.
Abstract: Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector’s backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.
[122] Towards Mechanistic Defenses Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek
Main category: cs.CV
TL;DR: Analysis of CLIP vision encoders under typographic attacks reveals specialized attention heads that transmit text information. A training-free defense method selectively ablates these heads, improving robustness by up to 19.6% on typographic ImageNet-100 with minimal standard accuracy loss.
Details
Motivation: Typographic attacks exploit multi-modal systems by injecting text into images, causing targeted misclassifications, malicious content generation, and VLM jailbreaks. Understanding and defending against these attacks is crucial for safety-critical applications.Method: Analyze CLIP vision encoders to locate specialized attention heads that extract typographic information. Introduce a training-free defense by selectively ablating a typographic circuit consisting of these attention heads.
Result: Method improves performance by up to 19.6% on typographic ImageNet-100 variant while reducing standard ImageNet-100 accuracy by less than 1%. Competitive with state-of-the-art finetuning-based defenses. Releases family of dyslexic CLIP models as drop-in replacements.
Conclusion: The proposed training-free ablation approach effectively defends CLIP models against typographic attacks, providing robust drop-in replacements for safety-critical applications where text-based manipulation risks outweigh text recognition utility.
Abstract: Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model’s layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
[123] GLaRE: A Graph-based Landmark Region Embedding Network for Emotion Recognition
Debasis Maji, Debaditya Barman
Main category: cs.CV
TL;DR: GLaRE: Graph-based Landmark Region Embedding network for facial expression recognition using 3D facial landmarks and hierarchical coarsening to achieve state-of-the-art performance on AffectNet and FERG datasets.
Details
Motivation: Traditional FER systems face challenges with occlusion, expression variability, and lack of interpretability. GNNs offer structured and interpretable learning by modeling relational dependencies between facial landmarks.Method: Extract facial landmarks using 3D facial alignment, construct quotient graph via hierarchical coarsening to preserve spatial structure while reducing complexity, and use region-level embeddings for emotion recognition.
Result: Achieves 64.89% accuracy on AffectNet and 94.24% on FERG, outperforming existing baselines. Ablation studies show region-level embeddings from quotient graphs improve prediction performance.
Conclusion: GLaRE demonstrates that graph-based approaches with hierarchical coarsening and region-level embeddings effectively address FER challenges, providing both high performance and interpretability.
Abstract: Facial expression recognition (FER) is a crucial task in computer vision with wide range of applications including human computer interaction, surveillance, and assistive technologies. However, challenges such as occlusion, expression variability, and lack of interpretability hinder the performance of traditional FER systems. Graph Neural Networks (GNNs) offer a powerful alternative by modeling relational dependencies between facial landmarks, enabling structured and interpretable learning. In this paper, we propose GLaRE, a novel Graph-based Landmark Region Embedding network for emotion recognition. Facial landmarks are extracted using 3D facial alignment, and a quotient graph is constructed via hierarchical coarsening to preserve spatial structure while reducing complexity. Our method achieves 64.89 percentage accuracy on AffectNet and 94.24 percentage on FERG, outperforming several existing baselines. Additionally, ablation studies have demonstrated that region-level embeddings from quotient graphs have contributed to improved prediction performance.
[124] FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models
Zheng Chong, Yanwei Lei, Shiyue Zhang, Zhuandi He, Zhen Wang, Xujie Zhang, Xiao Dong, Yiling Wu, Dongmei Jiang, Xiaodan Liang
Main category: cs.CV
TL;DR: FastFit is a high-speed multi-reference virtual try-on framework that uses cacheable diffusion architecture to achieve 3.5x speedup over existing methods while supporting complex outfit compositions.
Details
Motivation: Current virtual try-on methods cannot support multi-reference outfit compositions (garments and accessories) and suffer from significant inefficiency due to redundant re-computation of reference features in each denoising step.Method: Proposes FastFit with Semi-Attention mechanism and class embeddings instead of timestep embeddings for reference items, fully decoupling reference feature encoding from denoising process. Also introduces DressCode-MR dataset with 28,179 sets of high-quality paired images across 5 categories.
Result: Achieves 3.5x speedup over comparable methods with negligible parameter overhead. Surpasses state-of-the-art methods on key fidelity metrics across VITON-HD, DressCode, and DressCode-MR datasets.
Conclusion: FastFit breaks the efficiency bottleneck of virtual try-on systems while supporting complex multi-reference compositions, making it suitable for real-world applications.
Abstract: Despite its great potential, virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions (including garments and accessories), and their significant inefficiency caused by the redundant re-computation of reference features in each denoising step. To address these challenges, we propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. By employing a Semi-Attention mechanism and substituting traditional timestep embeddings with class embeddings for reference items, our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5x speedup over comparable methods. Furthermore, to facilitate research on complex, multi-reference virtual try-on, we introduce DressCode-MR, a new large-scale dataset. It comprises 28,179 sets of high-quality, paired images covering five key categories (tops, bottoms, dresses, shoes, and bags), constructed through a pipeline of expert models and human feedback refinement. Extensive experiments on the VITON-HD, DressCode, and our DressCode-MR datasets show that FastFit surpasses state-of-the-art methods on key fidelity metrics while offering its significant advantage in inference efficiency.
[125] UTA-Sign: Unsupervised Thermal Video Augmentation via Event-Assisted Traffic Signage Sketching
Yuqi Han, Songqian Zhang, Weijian Su, Ke Li, Jiayu Yang, Jinli Suo, Qiang Zhang
Main category: cs.CV
TL;DR: UTA-Sign: Unsupervised thermal-event fusion for traffic signage in low-light conditions, combining thermal cameras and event cameras to overcome limitations of each modality for improved autonomous driving safety.
Details
Motivation: Thermal cameras struggle with signage detection on similar materials, while event cameras have non-uniform sampling. Their complementary characteristics can enhance traffic signage perception in low-light autonomous driving scenarios.Method: Dual-boosting mechanism that fuses thermal frames and event signals. Uses thermal frames for motion cues and temporal alignment of event signals, while event signals add subtle signage details to thermal frames.
Result: Validated on real-world datasets, showing superior traffic signage sketching quality and improved detection accuracy at the perceptual level.
Conclusion: The proposed UTA-Sign framework effectively addresses signage blind spots in thermal imaging and sampling issues in event cameras, providing consistent signage representation for safer autonomous navigation in low-light conditions.
Abstract: The thermal camera excels at perceiving outdoor environments under low-light conditions, making it ideal for applications such as nighttime autonomous driving and unmanned navigation. However, thermal cameras encounter challenges when capturing signage from objects made of similar materials, which can pose safety risks for accurately understanding semantics in autonomous driving systems. In contrast, the neuromorphic vision camera, also known as an event camera, detects changes in light intensity asynchronously and has proven effective in high-speed, low-light traffic environments. Recognizing the complementary characteristics of these two modalities, this paper proposes UTA-Sign, an unsupervised thermal-event video augmentation for traffic signage in low-illumination environments, targeting elements such as license plates and roadblock indicators. To address the signage blind spots of thermal imaging and the non-uniform sampling of event cameras, we developed a dual-boosting mechanism that fuses thermal frames and event signals for consistent signage representation over time. The proposed method utilizes thermal frames to provide accurate motion cues as temporal references for aligning the uneven event signals. At the same time, event signals contribute subtle signage content to the raw thermal frames, enhancing the overall understanding of the environment. The proposed method is validated on datasets collected from real-world scenarios, demonstrating superior quality in traffic signage sketching and improved detection accuracy at the perceptual level.
[126] Disruptive Attacks on Face Swapping via Low-Frequency Perceptual Perturbations
Mengxiao Huang, Minglei Shu, Shuwang Zhou, Zhaoyang Liu
Main category: cs.CV
TL;DR: Proposes an active defense method using low-frequency perceptual perturbations to disrupt face swapping deepfakes, combining frequency and spatial domain features to reduce manipulation effectiveness while preserving visual quality.
Details
Motivation: Existing deepfake detection methods are passive and focus on post-event analysis, lacking preventive measures against face swapping attacks that threaten privacy and societal security.Method: Uses low-frequency perceptual perturbations to target the generative process directly. Combines frequency and spatial domain features with discrete wavelet transform (DWT) to extract low-frequency components. Features encoder, perturbation generator, and decoder architecture to introduce artifacts while preserving high-frequency details.
Result: Experiments on CelebA-HQ and LFW datasets show significant reductions in face-swapping effectiveness, improved defense success rates, and maintained visual quality of protected images.
Conclusion: The active defense method effectively disrupts deepfake generation processes while ensuring output remains visually plausible, providing a proactive approach to counter face swapping manipulation.
Abstract: Deepfake technology, driven by Generative Adversarial Networks (GANs), poses significant risks to privacy and societal security. Existing detection methods are predominantly passive, focusing on post-event analysis without preventing attacks. To address this, we propose an active defense method based on low-frequency perceptual perturbations to disrupt face swapping manipulation, reducing the performance and naturalness of generated content. Unlike prior approaches that used low-frequency perturbations to impact classification accuracy,our method directly targets the generative process of deepfake techniques. We combine frequency and spatial domain features to strengthen defenses. By introducing artifacts through low-frequency perturbations while preserving high-frequency details, we ensure the output remains visually plausible. Additionally, we design a complete architecture featuring an encoder, a perturbation generator, and a decoder, leveraging discrete wavelet transform (DWT) to extract low-frequency components and generate perturbations that disrupt facial manipulation models. Experiments on CelebA-HQ and LFW demonstrate significant reductions in face-swapping effectiveness, improved defense success rates, and preservation of visual quality.
[127] Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion
Zheng Qin, Yabing Wang, Minghui Yang, Sanping Zhou, Ming Yang, Le Wang
Main category: cs.CV
TL;DR: Diverse-T2M introduces uncertainty modeling and stochastic sampling to generate diverse 3D human motions from text while maintaining semantic consistency.
Details
Motivation: Current text-to-motion generation methods struggle with achieving diversity in generated motions while maintaining text-motion consistency.Method: Introduces uncertainty via noise signals as diversity carriers in transformer-based methods, creates continuous text representation latent space, and integrates latent space sampler for stochastic sampling.
Result: Significantly enhances motion diversity while maintaining state-of-the-art text consistency performance on HumanML3D and KIT-ML benchmarks.
Conclusion: The proposed method successfully addresses the diversity challenge in text-to-motion generation through explicit uncertainty modeling and stochastic sampling approaches.
Abstract: Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, \textit{i.e.}, Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.
[128] Optimization-Based Calibration for Intravascular Ultrasound Volume Reconstruction
Karl-Philippe Beaudet, Sidaty El Hadramy, Philippe C Cattin, Juan Verde, Stéphane Cotin
Main category: cs.CV
TL;DR: Optimization-based calibration method using 3D-printed phantom for accurate 3D IVUS volume reconstruction to bridge preoperative CT and intraoperative ultrasound in liver surgery.
Details
Motivation: Intraoperative ultrasound images are challenging to interpret due to limited field of view and complex anatomy. Bridging preoperative CT and intraoperative data is crucial for effective surgical guidance in liver surgery.Method: Proposed an optimization-based calibration method using a 3D-printed phantom for accurate 3D Intravascular Ultrasound (IVUS) volume reconstruction. Ensures precise alignment of tracked IVUS data with preoperative CT images.
Result: Validated with in vivo swine liver images, achieving calibration error of 0.88-1.80 mm and registration error of 3.40-5.71 mm between 3D IVUS data and corresponding CT scans.
Conclusion: Method provides reliable and accurate calibration and volume reconstruction for registering intraoperative ultrasound with preoperative CT images, enhancing intraoperative guidance in liver surgery.
Abstract: Intraoperative ultrasound images are inherently challenging to interpret in liver surgery due to the limited field of view and complex anatomical structures. Bridging the gap between preoperative and intraoperative data is crucial for effective surgical guidance. 3D IntraVascular UltraSound (IVUS) offers a potential solution by enabling the reconstruction of the entire organ, which facilitates registration between preoperative computed tomography (CT) scans and intraoperative IVUS images. In this work, we propose an optimization-based calibration method using a 3D-printed phantom for accurate 3D Intravascular Ultrasound volume reconstruction. Our approach ensures precise alignment of tracked IVUS data with preoperative CT images, improving intraoperative navigation. We validated our method using in vivo swine liver images, achieving a calibration error from 0.88 to 1.80 mm and a registration error from 3.40 to 5.71 mm between the 3D IVUS data and the corresponding CT scan. Our method provides a reliable and accurate means of calibration and volume reconstruction. It can be used to register intraoperative ultrasound images with preoperative CT images in the context of liver surgery, and enhance intraoperative guidance.
[129] Physics Informed Generative Models for Magnetic Field Images
Aye Phyu Phyu Aung, Lucas Lum, Zhansen Shi, Wen Qiu, Bernice Zee, JM Chin, Yeow Kheng Lim, J. Senthilnath
Main category: cs.CV
TL;DR: Proposes PI-GenMFI, a physics-informed diffusion model to generate synthetic Magnetic Field Images (MFI) for semiconductor defect detection, addressing data scarcity issues in training ML models.
Details
Motivation: Limited availability of MFI datasets due to proprietary concerns creates a bottleneck for training machine learning models in semiconductor defect localization, while MFI offers more efficient ROI localization than traditional X-ray scanning.Method: Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI) using diffusion models with two physical constraints to generate synthetic MFI samples for power short defects, integrating specific physical information.
Result: The model shows promising results in qualitative and quantitative evaluations using various image generation and signal processing metrics, outperforming state-of-the-art generative models from VAE and diffusion methods in domain expert evaluation.
Conclusion: PI-GenMFI provides an effective solution to generate synthetic MFI training data, enabling efficient ML-based defect localization in semiconductor manufacturing while overcoming data scarcity challenges.
Abstract: In semiconductor manufacturing, defect detection and localization are critical to ensuring product quality and yield. While X-ray imaging is a reliable non-destructive testing method, it is memory-intensive and time-consuming for large-scale scanning, Magnetic Field Imaging (MFI) offers a more efficient means to localize regions of interest (ROI) for targeted X-ray scanning. However, the limited availability of MFI datasets due to proprietary concerns presents a significant bottleneck for training machine learning (ML) models using MFI. To address this challenge, we consider an ML-driven approach leveraging diffusion models with two physical constraints. We propose Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI) to generate synthetic MFI samples by integrating specific physical information. We generate MFI images for the most common defect types: power shorts. These synthetic images will serve as training data for ML algorithms designed to localize defect areas efficiently. To evaluate generated MFIs, we compare our model to SOTA generative models from both variational autoencoder (VAE) and diffusion methods. We present a domain expert evaluation to assess the generated samples. In addition, we present qualitative and quantitative evaluation using various metrics used for image generation and signal processing, showing promising results to optimize the defect localization process.
[130] Improving Alignment in LVLMs with Debiased Self-Judgment
Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan, Ying Wei, Huaxiu Yao
Main category: cs.CV
TL;DR: A novel self-evaluation approach for visual-language models that generates debiased self-judgment scores internally to improve modality alignment, reduce hallucinations, and enhance safety without external resources.
Details
Motivation: Current alignment methods for LVLMs rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs while still struggling with hallucinations and safety concerns.Method: Proposes generating debiased self-judgment scores internally by the model itself, enabling autonomous improvement of alignment through enhanced decoding strategies and preference tuning processes without external resources.
Result: Empirical results show significant outperformance over traditional methods, with reduced hallucinations, enhanced safety, and improved overall capability in visual-linguistic modality alignment.
Conclusion: The approach offers a more effective and scalable solution for aligning LVLMs by enabling models to self-evaluate and improve alignment autonomously, addressing key challenges of hallucinations and safety concerns.
Abstract: The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations–where generated outputs are not grounded in the visual input–and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.
[131] Revisiting the Privacy Risks of Split Inference: A GAN-Based Data Reconstruction Attack via Progressive Feature Optimization
Yixiang Qiu, Yanhan Liu, Hongyao Yu, Hao Fang, Bin Chen, Shu-Tao Xia, Ke Xu
Main category: cs.CV
TL;DR: A novel GAN-based data reconstruction attack framework with progressive feature optimization that significantly outperforms existing attacks, especially on deep neural networks and high-resolution scenarios.
Details
Motivation: Existing data reconstruction attacks in split inference are limited to shallow models and fail to leverage semantic priors effectively, leaving privacy vulnerabilities in deeper DNNs unaddressed.Method: Proposes a GAN-based framework with Progressive Feature Optimization (PFO) that decomposes the generator into hierarchical blocks and incrementally refines intermediate representations, using L1-ball constraints to stabilize optimization.
Result: Extensive experiments show the method outperforms prior attacks by a large margin in high-resolution scenarios, out-of-distribution settings, and against deeper/more complex DNNs.
Conclusion: The proposed framework demonstrates significantly improved reconstruction quality and generalizability across datasets and model architectures, revealing greater privacy risks in split inference systems.
Abstract: The growing complexity of Deep Neural Networks (DNNs) has led to the adoption of Split Inference (SI), a collaborative paradigm that partitions computation between edge devices and the cloud to reduce latency and protect user privacy. However, recent advances in Data Reconstruction Attacks (DRAs) reveal that intermediate features exchanged in SI can be exploited to recover sensitive input data, posing significant privacy risks. Existing DRAs are typically effective only on shallow models and fail to fully leverage semantic priors, limiting their reconstruction quality and generalizability across datasets and model architectures. In this paper, we propose a novel GAN-based DRA framework with Progressive Feature Optimization (PFO), which decomposes the generator into hierarchical blocks and incrementally refines intermediate representations to enhance the semantic fidelity of reconstructed images. To stabilize the optimization and improve image realism, we introduce an L1-ball constraint during reconstruction. Extensive experiments show that our method outperforms prior attacks by a large margin, especially in high-resolution scenarios, out-of-distribution settings, and against deeper and more complex DNNs.
[132] MobileCLIP2: Improving Multi-Modal Reinforced Training
Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari
Main category: cs.CV
TL;DR: MobileCLIP2 improves upon MobileCLIP with better teacher ensembles and captioner fine-tuning, achieving state-of-the-art zero-shot accuracy at low latencies with 2.2% ImageNet-1k improvement.
Details
Motivation: To enhance the multi-modal reinforced training of MobileCLIP for better zero-shot capabilities while maintaining low latency and small model size.Method: Improved CLIP teacher ensembles trained on DFN dataset, enhanced captioner teachers fine-tuned on diverse high-quality datasets, and novel insights on temperature tuning and multi-model synthetic caption combination.
Result: MobileCLIP2 achieves SOTA ImageNet-1k zero-shot accuracy at low latencies, with MobileCLIP2-B showing 2.2% improvement over MobileCLIP-B, and MobileCLIP2-S4 matching SigLIP-SO400M/14 accuracy while being 2x smaller.
Conclusion: The improved training methodology enables more efficient and accurate mobile-friendly image-text models, with released pretrained models and scalable data generation code for reproducible research.
Abstract: Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.
[133] EmoCAST: Emotional Talking Portrait via Emotive Text Description
Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang, Chi-Man Pun
Main category: cs.CV
TL;DR: EmoCAST is a diffusion-based framework for emotional talking head synthesis that uses text-guided emotional modules and emotion-aware audio attention to generate realistic, expressive, and audio-synchronized portrait videos.
Details
Motivation: Existing methods have limitations in control flexibility, motion naturalness, and expression quality, with datasets primarily collected in lab settings hindering real-world applications.Method: Proposes a diffusion-based framework with two key modules: text-guided decoupled emotive module for appearance modeling, and emotive audio attention module to capture emotion-audio interplay. Also constructs an emotional dataset with emotive text descriptions and uses emotion-aware sampling and progressive functional training strategies.
Result: Achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
Conclusion: EmoCAST successfully addresses limitations of existing methods and demonstrates superior performance in emotional talking head synthesis for practical real-world applications.
Abstract: Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are primarily collected in lab settings, further exacerbating these shortcomings. Consequently, these limitations substantially hinder practical applications in real-world scenarios. To address these challenges, we propose EmoCAST, a diffusion-based framework with two key modules for precise text-driven emotional synthesis. In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module, enhancing the spatial knowledge to improve emotion comprehension. To improve the relationship between audio and emotion, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide more precise facial motion synthesis. Additionally, we construct an emotional talking head dataset with comprehensive emotive text descriptions to optimize the framework’s performance. Based on the proposed dataset, we propose an emotion-aware sampling training strategy and a progressive functional training strategy that further improve the model’s ability to capture nuanced expressive features and achieve accurate lip-synchronization. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST
[134] Mask-Guided Multi-Channel SwinUNETR Framework for Robust MRI Classification
Smriti Joshi, Lidia Garrucho, Richard Osuala, Oliver Diaz, Karim Lekadir
Main category: cs.CV
TL;DR: SwinUNETR-based deep learning framework for breast cancer detection in MRI, achieving second place in multi-center challenge with 511 studies from 6 European centers.
Details
Motivation: Early detection of breast cancer is crucial for improving outcomes, especially in high-risk women or those with dense breast tissue where mammography is less effective. MRI provides high sensitivity but requires AI solutions for better diagnosis and classification.Method: Developed a SwinUNETR-based deep learning framework incorporating breast region masking, extensive data augmentation, and ensemble learning to enhance robustness and generalizability across multi-center data from different scanner vendors (1.5T and 3T).
Result: Achieved second place on the ODELIA consortium challenge leaderboard, demonstrating strong performance in classifying breast MRI studies as no lesion, benign lesion, or malignant lesion.
Conclusion: The framework shows potential to support clinical breast MRI interpretation and the codebase is publicly shared to facilitate further research and development in AI-based breast cancer diagnosis.
Abstract: Breast cancer is one of the leading causes of cancer-related mortality in women, and early detection is essential for improving outcomes. Magnetic resonance imaging (MRI) is a highly sensitive tool for breast cancer detection, particularly in women at high risk or with dense breast tissue, where mammography is less effective. The ODELIA consortium organized a multi-center challenge to foster AI-based solutions for breast cancer diagnosis and classification. The dataset included 511 studies from six European centers, acquired on scanners from multiple vendors at both 1.5 T and 3 T. Each study was labeled for the left and right breast as no lesion, benign lesion, or malignant lesion. We developed a SwinUNETR-based deep learning framework that incorporates breast region masking, extensive data augmentation, and ensemble learning to improve robustness and generalizability. Our method achieved second place on the challenge leaderboard, highlighting its potential to support clinical breast MRI interpretation. We publicly share our codebase at https://github.com/smriti-joshi/bcnaim-odelia-challenge.git.
[135] ArtFace: Towards Historical Portrait Face Identification via Model Adaptation
Francois Poh, Anjith George, Sébastien Marcel
Main category: cs.CV
TL;DR: Foundation models fine-tuned and integrated with traditional facial recognition networks significantly improve facial recognition in historical paintings, overcoming domain shift and artistic variations.
Details
Motivation: Automated facial recognition struggles with historical paintings due to domain shift, high intra-class variation, and artistic factors like style and intent, making sitter identification challenging for art historians.Method: Fine-tuned foundation models and integrated their embeddings with conventional facial recognition networks to handle artistic variations and domain differences.
Result: Demonstrated notable improvements over current state-of-the-art methods in facial recognition for artworks.
Conclusion: Foundation models can effectively bridge the gap where traditional facial recognition methods fail in artwork analysis, offering better assistance for art historical research.
Abstract: Identifying sitters in historical paintings is a key task for art historians, offering insight into their lives and how they chose to be seen. However, the process is often subjective and limited by the lack of data and stylistic variations. Automated facial recognition is capable of handling challenging conditions and can assist, but while traditional facial recognition models perform well on photographs, they struggle with paintings due to domain shift and high intra-class variation. Artistic factors such as style, skill, intent, and influence from other works further complicate recognition. In this work, we investigate the potential of foundation models to improve facial recognition in artworks. By fine-tuning foundation models and integrating their embeddings with those from conventional facial recognition networks, we demonstrate notable improvements over current state-of-the-art methods. Our results show that foundation models can bridge the gap where traditional methods are ineffective. Paper page at https://www.idiap.ch/paper/artface/
[136] ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering
Paritosh Parmar, Eric Peh, Basura Fernando
Main category: cs.CV
TL;DR: A modular framework that decouples causal reasoning from answer generation using interpretable natural language causal chains, outperforming state-of-the-art VideoQA models while improving explainability and generalization.
Details
Motivation: Existing VideoQA models struggle with higher-order reasoning, rely on opaque monolithic pipelines, and lack interpretability, depending on shallow heuristics rather than transparent causal inference.Method: Two-stage architecture: Causal Chain Extractor (CCE) generates causal chains from video-question pairs, and Causal Chain-Driven Answerer (CCDA) produces answers grounded in these chains. Uses LLMs to generate causal chains from existing datasets and introduces CauCo evaluation metric.
Result: Outperforms state-of-the-art models on three large-scale benchmarks, with substantial gains in explainability, user trust, and generalization. CCE serves as reusable causal reasoning engine.
Conclusion: The modular framework with explicit causal chains enables transparent and logically coherent inference, bridging low-level video content with high-level causal reasoning while providing interpretable intermediate representations.
Abstract: Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization – positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/
[137] AvatarBack: Back-Head Generation for Complete 3D Avatars from Front-View Images
Shiqi Xin, Xiaolin Zhang, Yanbin Liu, Peng Zhang, Caifeng Shan
Main category: cs.CV
TL;DR: AvatarBack is a plug-and-play framework that reconstructs complete 3D Gaussian head avatars by addressing the poor back-head reconstruction in existing methods through synthetic back-view generation and adaptive spatial alignment.
Details
Motivation: Existing Gaussian Splatting methods for head avatars rely mainly on frontal-view images, resulting in poorly constructed back-head regions with geometric inconsistencies, structural blurring, and reduced realism, which limits avatar fidelity.Method: AvatarBack integrates two core innovations: 1) Subject-specific Generator (SSG) that synthesizes identity-consistent back-view pseudo-images from sparse frontal inputs, and 2) Adaptive Spatial Alignment Strategy (ASA) that uses learnable transformation matrices to resolve pose and coordinate discrepancies between synthetic views and 3D Gaussian representation.
Result: Extensive experiments on NeRSemble and K-hairstyle datasets show AvatarBack significantly enhances back-head reconstruction quality while preserving frontal fidelity, with improved performance across geometric, photometric, and GPT-4o-based perceptual metrics. The avatars maintain consistent visual realism under diverse motions and remain fully animatable.
Conclusion: AvatarBack successfully addresses the back-head reconstruction challenge in 3D Gaussian avatars through its novel plug-and-play framework, enabling complete and consistent avatar modeling with enhanced realism and animatability.
Abstract: Recent advances in Gaussian Splatting have significantly boosted the reconstruction of head avatars, enabling high-quality facial modeling by representing an 3D avatar as a collection of 3D Gaussians. However, existing methods predominantly rely on frontal-view images, leaving the back-head poorly constructed. This leads to geometric inconsistencies, structural blurring, and reduced realism in the rear regions, ultimately limiting the fidelity of reconstructed avatars. To address this challenge, we propose AvatarBack, a novel plug-and-play framework specifically designed to reconstruct complete and consistent 3D Gaussian avatars by explicitly modeling the missing back-head regions. AvatarBack integrates two core technical innovations,i.e., the Subject-specific Generator (SSG) and the Adaptive Spatial Alignment Strategy (ASA). The former leverages a generative prior to synthesize identity-consistent, plausible back-view pseudo-images from sparse frontal inputs, providing robust multi-view supervision. To achieve precise geometric alignment between these synthetic views and the 3D Gaussian representation, the later employs learnable transformation matrices optimized during training, effectively resolving inherent pose and coordinate discrepancies. Extensive experiments on NeRSemble and K-hairstyle datasets, evaluated using geometric, photometric, and GPT-4o-based perceptual metrics, demonstrate that AvatarBack significantly enhances back-head reconstruction quality while preserving frontal fidelity. Moreover, the reconstructed avatars maintain consistent visual realism under diverse motions and remain fully animatable.
[138] CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models
Ayan Banerjee, Fernando Vilariño, Josep Lladós
Main category: cs.CV
TL;DR: CraftGraffiti is an end-to-end text-guided graffiti generation framework that preserves facial identity while applying extreme stylistic transformations, using a style-first approach with face-consistent self-attention mechanisms.
Details
Motivation: Preserving facial identity in graffiti art is challenging because high-contrast, abstract styles can distort facial features and erase recognizability, undermining personal and cultural authenticity.Method: Uses LoRA-fine-tuned pretrained diffusion transformer for style transfer, face-consistent self-attention with identity embeddings for identity preservation, and CLIP-guided prompt extension for pose customization without keypoints.
Result: Achieves competitive facial feature consistency, state-of-the-art aesthetic scores, and high human preference. Successfully deployed at Cruilla Festival with real-world creative impact.
Conclusion: CraftGraffiti advances identity-respectful AI-assisted artistry by blending stylistic freedom with recognizability through a principled style-first, identity-after paradigm.
Abstract: Preserving facial identity under extreme stylistic transformation remains a major challenge in generative art. In graffiti, a high-contrast, abstract medium, subtle distortions to the eyes, nose, or mouth can erase the subject’s recognizability, undermining both personal and cultural authenticity. We present CraftGraffiti, an end-to-end text-guided graffiti generation framework designed with facial feature preservation as a primary objective. Given an input image and a style and pose descriptive prompt, CraftGraffiti first applies graffiti style transfer via LoRA-fine-tuned pretrained diffusion transformer, then enforces identity fidelity through a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings. Pose customization is achieved without keypoints, using CLIP-guided prompt extension to enable dynamic re-posing while retaining facial coherence. We formally justify and empirically validate the “style-first, identity-after” paradigm, showing it reduces attribute drift compared to the reverse order. Quantitative results demonstrate competitive facial feature consistency and state-of-the-art aesthetic and human preference scores, while qualitative analyses and a live deployment at the Cruilla Festival highlight the system’s real-world creative impact. CraftGraffiti advances the goal of identity-respectful AI-assisted artistry, offering a principled approach for blending stylistic freedom with recognizability in creative AI applications.
[139] Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network
Chenhao Zhang, Wei Gao
Main category: cs.CV
TL;DR: Dynamic video compression framework with variable coding routes and rate control agent for precise bitrate targeting and improved RD performance.
Details
Motivation: Neural video compression lacks precise rate control due to inherent limitations of learning-based codecs, making variable bitrate scenarios challenging.Method: Proposes Dynamic-Route Autoencoder with variable coding routes (each with partial complexity and distinct RD trade-off) and Rate Control Agent that estimates bitrates and adjusts routes at runtime. Uses Joint-Routes Optimization for collaborative training.
Result: Achieves 14.8% BD-Rate reduction and 0.47dB BD-PSNR gain over state-of-the-art methods with only 1.66% average bitrate error on HEVC and UVG datasets.
Conclusion: The framework successfully achieves Rate-Distortion-Complexity Optimization for various bitrate-constrained applications, providing precise rate control in neural video compression.
Abstract: Neural Video Compression (NVC) has achieved remarkable performance in recent years. However, precise rate control remains a challenge due to the inherent limitations of learning-based codecs. To solve this issue, we propose a dynamic video compression framework designed for variable bitrate scenarios. First, to achieve variable bitrate implementation, we propose the Dynamic-Route Autoencoder with variable coding routes, each occupying partial computational complexity of the whole network and navigating to a distinct RD trade-off. Second, to approach the target bitrate, the Rate Control Agent estimates the bitrate of each route and adjusts the coding route of DRA at run time. To encompass a broad spectrum of variable bitrates while preserving overall RD performance, we employ the Joint-Routes Optimization strategy, achieving collaborative training of various routes. Extensive experiments on the HEVC and UVG datasets show that the proposed method achieves an average BD-Rate reduction of 14.8% and BD-PSNR gain of 0.47dB over state-of-the-art methods while maintaining an average bitrate error of 1.66%, achieving Rate-Distortion-Complexity Optimization (RDCO) for various bitrate and bitrate-constrained applications. Our code is available at https://git.openi.org.cn/OpenAICoding/DynamicDVC.
[140] CardioMorphNet: Cardiac Motion Prediction Using a Shape-Guided Bayesian Recurrent Deep Network
Reza Akbari Movahed, Abuzar Rezaee, Arezoo Zakeri, Colin Berry, Edmond S. L. Ho, Ali Gooya
Main category: cs.CV
TL;DR: CardioMorphNet is a recurrent Bayesian deep learning framework for 3D cardiac motion estimation that uses shape-guided registration instead of intensity-based methods, achieving superior performance on UK Biobank data with lower uncertainty.
Details
Motivation: Existing cardiac motion estimation methods struggle with accuracy because they rely on intensity-based image registration similarity losses that may overlook cardiac anatomical regions.Method: A recurrent variational autoencoder framework that models spatio-temporal dependencies, uses two posterior models for bi-ventricular segmentation and motion estimation, and employs shape-guided registration without intensity-based similarity loss.
Result: Superior performance in cardiac motion estimation compared to state-of-the-art methods on UK Biobank dataset, with lower uncertainty values indicating higher confidence in predictions.
Conclusion: CardioMorphNet effectively addresses limitations of intensity-based registration by focusing on anatomical regions through shape-guided approach, providing more accurate and confident cardiac motion estimation.
Abstract: Accurate cardiac motion estimation from cine cardiac magnetic resonance (CMR) images is vital for assessing cardiac function and detecting its abnormalities. Existing methods often struggle to capture heart motion accurately because they rely on intensity-based image registration similarity losses that may overlook cardiac anatomical regions. To address this, we propose CardioMorphNet, a recurrent Bayesian deep learning framework for 3D cardiac shape-guided deformable registration using short-axis (SAX) CMR images. It employs a recurrent variational autoencoder to model spatio-temporal dependencies over the cardiac cycle and two posterior models for bi-ventricular segmentation and motion estimation. The derived loss function from the Bayesian formulation guides the framework to focus on anatomical regions by recursively registering segmentation maps without using intensity-based image registration similarity loss, while leveraging sequential SAX volumes and spatio-temporal features. The Bayesian modelling also enables computation of uncertainty maps for the estimated motion fields. Validated on the UK Biobank dataset by comparing warped mask shapes with ground truth masks, CardioMorphNet demonstrates superior performance in cardiac motion estimation, outperforming state-of-the-art methods. Uncertainty assessment shows that it also yields lower uncertainty values for estimated motion fields in the cardiac region compared with other probabilistic-based cardiac registration methods, indicating higher confidence in its predictions.
[141] ${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting
Yuxi Hu, Jun Zhang, Kuangyi Chen, Zhe Zhang, Friedrich Fraundorfer
Main category: cs.CV
TL;DR: C3-GS is a novel framework that improves generalizable Gaussian splatting by incorporating context-aware, cross-dimension, and cross-scale constraints to enhance feature learning for better novel view synthesis from sparse input views.
Details
Motivation: Existing generalizable Gaussian splatting methods struggle with encoding discriminative, multi-view consistent features for accurate geometry construction from sparse views, limiting their rendering quality.Method: Proposes C3-GS framework with three lightweight modules integrated into a unified rendering pipeline: context-aware, cross-dimension, and cross-scale constraints to improve feature fusion without additional supervision.
Result: Achieves state-of-the-art rendering quality and generalization ability on benchmark datasets, enabling photorealistic synthesis from sparse input views.
Conclusion: C3-GS effectively addresses feature learning limitations in generalizable Gaussian splatting through multi-constraint integration, demonstrating superior performance in novel view synthesis without per-scene optimization.
Abstract: Generalizable Gaussian Splatting aims to synthesize novel views for unseen scenes without per-scene optimization. In particular, recent advancements utilize feed-forward networks to predict per-pixel Gaussian parameters, enabling high-quality synthesis from sparse input views. However, existing approaches fall short in encoding discriminative, multi-view consistent features for Gaussian predictions, which struggle to construct accurate geometry with sparse views. To address this, we propose $\mathbf{C}^{3}$-GS, a framework that enhances feature learning by incorporating context-aware, cross-dimension, and cross-scale constraints. Our architecture integrates three lightweight modules into a unified rendering pipeline, improving feature fusion and enabling photorealistic synthesis without requiring additional supervision. Extensive experiments on benchmark datasets validate that $\mathbf{C}^{3}$-GS achieves state-of-the-art rendering quality and generalization ability. Code is available at: https://github.com/YuhsiHu/C3-GS.
[142] Mix, Align, Distil: Reliable Cross-Domain Atypical Mitosis Classification
Kaustubh Atey, Sameer Anand Jha, Gouranga Bala, Amit Sethi
Main category: cs.CV
TL;DR: A simple training-time recipe for domain-robust atypical mitotic figure classification that uses style perturbations, attention-based feature alignment, and EMA teacher distillation to achieve strong performance across different scanners and acquisition conditions.
Details
Motivation: Atypical mitotic figures are important histopathological markers but are challenging to identify consistently due to domain shifts from scanner, stain, and acquisition differences. Current methods struggle with cross-domain generalization.Method: Three key components: (1) style perturbations at early/mid backbone stages for feature diversity, (2) attention-refined feature alignment across domains using weak domain labels, (3) EMA teacher distillation with temperature-scaled KL divergence for prediction stability.
Result: Achieved balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499 on MIDOG 2025 Task 2 preliminary leaderboard, with negligible inference overhead.
Conclusion: The method provides strong, balanced performance for domain-robust AMF classification using only coarse domain metadata, making it a competitive submission for the MIDOG 2025 challenge.
Abstract: Atypical mitotic figures (AMFs) are important histopathological markers yet remain challenging to identify consistently, particularly under domain shift stemming from scanner, stain, and acquisition differences. We present a simple training-time recipe for domain-robust AMF classification in MIDOG 2025 Task 2. The approach (i) increases feature diversity via style perturbations inserted at early and mid backbone stages, (ii) aligns attention-refined features across sites using weak domain labels (Scanner, Origin, Species, Tumor) through an auxiliary alignment loss, and (iii) stabilizes predictions by distilling from an exponential moving average (EMA) teacher with temperature-scaled KL divergence. On the organizer-run preliminary leaderboard for atypical mitosis classification, our submission attains balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499. The method incurs negligible inference-time overhead, relies only on coarse domain metadata, and delivers strong, balanced performance, positioning it as a competitive submission for the MIDOG 2025 challenge.
[143] SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
Jiawen Lin, Shiran Bian, Yihang Zhu, Wenbin Tan, Yachao Zhang, Yuan Xie, Yanyun Qu
Main category: cs.CV
TL;DR: SeqVLM is a zero-shot 3D visual grounding framework that uses multi-view scene images with spatial information to localize objects from natural language descriptions without scene-specific training.
Details
Motivation: Existing zero-shot 3DVG methods suffer from spatial-limited reasoning due to single-view localization and contextual omissions/detail degradation, limiting real-world applicability.Method: Generates 3D instance proposals via semantic segmentation, refines through semantic filtering, uses proposal-guided multi-view projection to preserve spatial relationships, and implements dynamic scheduling for VLM processing of sequence-query prompts.
Result: Achieves state-of-the-art performance on ScanRefer (55.6% Acc@0.25) and Nr3D (53.2% Acc@0.25) benchmarks, surpassing previous zero-shot methods by 4.0% and 5.2% respectively.
Conclusion: SeqVLM advances 3D visual grounding toward greater generalization and real-world applicability by effectively leveraging multi-view spatial information and VLM reasoning capabilities without scene-specific training.
Abstract: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM’s cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.
[144] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Main category: cs.CV
TL;DR: Pref-GRPO addresses reward hacking in T2I generation by using pairwise preference rewards instead of pointwise scoring, and introduces UniGenBench for comprehensive model evaluation.
Details
Motivation: Current GRPO methods using pointwise reward models suffer from reward hacking where minimal score differences are amplified, causing unstable training and over-optimization for trivial gains.Method: Pref-GRPO uses pairwise preference comparisons within groups and win rates as reward signals, shifting from score maximization to preference fitting. Also introduces UniGenBench with 600 prompts across 5 themes and detailed evaluation criteria using MLLM.
Result: Pref-GRPO effectively differentiates subtle image quality differences, provides stable advantages, and mitigates reward hacking. UniGenBench reveals strengths/weaknesses of T2I models and validates Pref-GRPO’s effectiveness.
Conclusion: The proposed Pref-GRPO method and UniGenBench benchmark address critical limitations in current T2I generation and evaluation, providing more stable training and comprehensive assessment capabilities.
Abstract: Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.
[145] Occlusion Robustness of CLIP for Military Vehicle Classification
Jan Erik van Woerden, Gertjan Burghouts, Lotte Nijskens, Alma M. Liezenga, Sabina van Rooij, Frank Ruis, Hugo J. Kuijf
Main category: cs.CV
TL;DR: CLIP’s robustness to occlusion in military environments is tested, showing Transformer models outperform CNNs, fine-grained occlusions are more damaging, and backbone finetuning significantly improves occlusion resilience.
Details
Motivation: Evaluate CLIP's robustness in challenging military environments with occlusion and degraded SNR, which remains underexplored despite VLMs' advantages for defense applications with scarce labeled data.Method: Investigated CLIP variants’ robustness using a custom military vehicle dataset with 18 classes, evaluated using Normalized Area Under the Curve (NAUC) across different occlusion percentages and types.
Result: Four key findings: Transformer-based CLIP outperforms CNNs; fine-grained dispersed occlusions degrade performance more than large contiguous ones; linear-probed models drop sharply at ~35% occlusion; backbone finetuning pushes performance drop to >60% occlusion.
Conclusion: Occlusion-specific augmentations during training are crucial, and further exploration into patch-level sensitivity and architectural resilience is needed for real-world CLIP deployment in military applications.
Abstract: Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP’s robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants’ robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model’s backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.
[146] SKGE-SWIN: End-To-End Autonomous Vehicle Waypoint Prediction and Navigation Using Skip Stage Swin Transformer
Fachri Najm Noer Kartiman, Rasim, Yaya Wihardi, Nurul Hasanah, Oskar Natan, Bambang Wahono, Taufik Ibnu Salim
Main category: cs.CV
TL;DR: Proposes SKGE-Swin architecture for autonomous vehicles using Swin Transformer with skip-stage mechanism to enhance global feature representation and context awareness from pixel to pixel level.
Details
Motivation: To develop an end-to-end autonomous vehicle model with improved pixel-to-pixel context awareness and better understanding of complex patterns in vehicle surroundings.Method: Utilizes Swin Transformer with skip-stage mechanism and Shifted Window-based Multi-head Self-Attention (SW-MSA) to extract information from distant pixels while retaining critical information throughout feature extraction stages.
Result: Achieves superior Driving Score compared to previous methods when evaluated on CARLA platform using adversarial scenarios simulating real-world conditions.
Conclusion: The SKGE-Swin architecture effectively enhances autonomous vehicle perception and performance, with planned ablation studies to validate individual component contributions.
Abstract: Focusing on the development of an end-to-end autonomous vehicle model with pixel-to-pixel context awareness, this research proposes the SKGE-Swin architecture. This architecture utilizes the Swin Transformer with a skip-stage mechanism to broaden feature representation globally and at various network levels. This approach enables the model to extract information from distant pixels by leveraging the Swin Transformer’s Shifted Window-based Multi-head Self-Attention (SW-MSA) mechanism and to retain critical information from the initial to the final stages of feature extraction, thereby enhancing its capability to comprehend complex patterns in the vehicle’s surroundings. The model is evaluated on the CARLA platform using adversarial scenarios to simulate real-world conditions. Experimental results demonstrate that the SKGE-Swin architecture achieves a superior Driving Score compared to previous methods. Furthermore, an ablation study will be conducted to evaluate the contribution of each architectural component, including the influence of skip connections and the use of the Swin Transformer, in improving model performance.
[147] Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
Gowreesh Mago, Pascal Mettes, Stevan Rudinac
Main category: cs.CV
TL;DR: Survey paper advocating for renewed focus on abstract concept recognition in videos using modern foundation models, building on decades of community experience to avoid reinventing solutions.
Details
Motivation: Humans can recognize abstract concepts like justice and freedom in videos, but current AI systems mainly understand concrete visible elements. Abstract concept recognition remains a crucial open challenge that aligns models with human reasoning and values.Method: This is a survey paper that studies different tasks and datasets for abstract concept understanding in videos. It analyzes decades of community experience and research attempts, advocating for leveraging recent advances in multi-modal foundation models.
Result: The paper provides a comprehensive overview of the field, identifying that researchers have periodically attempted to solve abstract concept recognition using available tools over a long period. It establishes the foundation for addressing this grand challenge.
Conclusion: Drawing on decades of community experience will help address abstract concept understanding in videos more effectively in the era of multi-modal foundation models, avoiding redundant work and building on existing knowledge.
Abstract: The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel’’ as we start revisiting it in the era of multi-modal foundation models.
[148] Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML
Kuniko Paxton, Koorosh Aslansefat, Amila Akagić, Dhavalkumar Thakker, Yiannis Papadopoulos
Main category: cs.CV
TL;DR: Proposes Global Class Activation Probabilistic Map Evaluation method for trustworthy skin lesion diagnosis by analyzing all classes’ activation maps probabilistically at pixel level, combined with SafeML for error detection.
Details
Motivation: Address distrust in AI medical models by providing explainable diagnoses beyond just high accuracy, overcoming limitations of existing explainability methods like LIME inconsistency and CAM's failure to consider all classes.Method: Global Class Activation Probabilistic Map Evaluation that analyzes all classes’ activation probability maps probabilistically at pixel level, plus SafeML integration for false diagnosis detection and warnings.
Result: Method evaluated on ISIC datasets using MobileNetV2 and Vision Transformers, showing improved diagnostic reliability through unified visualization of diagnostic process.
Conclusion: The proposed approach reduces misdiagnosis risk, enhances diagnostic reliability, and improves patient safety by providing trustworthy, explainable AI diagnoses in dermatology.
Abstract: Recent advancements in skin lesion classification models have significantly improved accuracy, with some models even surpassing dermatologists’ diagnostic performance. However, in medical practice, distrust in AI models remains a challenge. Beyond high accuracy, trustworthy, explainable diagnoses are essential. Existing explainability methods have reliability issues, with LIME-based methods suffering from inconsistency, while CAM-based methods failing to consider all classes. To address these limitations, we propose Global Class Activation Probabilistic Map Evaluation, a method that analyses all classes’ activation probability maps probabilistically and at a pixel level. By visualizing the diagnostic process in a unified manner, it helps reduce the risk of misdiagnosis. Furthermore, the application of SafeML enhances the detection of false diagnoses and issues warnings to doctors and patients as needed, improving diagnostic reliability and ultimately patient safety. We evaluated our method using the ISIC datasets with MobileNetV2 and Vision Transformers.
[149] Evaluating Compositional Generalisation in VLMs and Diffusion Models
Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis
Main category: cs.CV
TL;DR: Diffusion Classifier shows improved compositional generalization compared to CLIP in attribute binding tasks, but all vision-language models struggle with relational reasoning like left/right concepts.
Details
Motivation: Vision-language models like CLIP often fail at compositional semantics, incorrectly combining attributes and objects (e.g., calling a red cube + blue cylinder as 'red cylinder'). The paper explores whether generative diffusion-based classifiers have better compositional generalization abilities.Method: Evaluated three models (Diffusion Classifier, CLIP, and ViLT) on binding objects with attributes and relations in zero-shot learning (ZSL) and generalized zero-shot learning (GZSL) settings. Analyzed embedding similarities to understand performance issues.
Result: Diffusion Classifier and ViLT performed well at concept binding tasks, but all models struggled significantly with relational GZSL tasks. CLIP embeddings showed overly similar representations for relational concepts like left and right.
Conclusion: While diffusion-based classifiers show promise for compositional generalization, all current vision-language models face significant challenges with relational reasoning, indicating broader limitations in handling compositional semantics.
Abstract: A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words’ and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models – Diffusion Classifier, CLIP, and ViLT – on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip
[150] Surfel-based 3D Registration with Equivariant SE(3) Features
Xueyang Kang, Hang Zhao, Kourosh Khoshelham, Patrick Vandewalle
Main category: cs.CV
TL;DR: A novel surfel-based pose learning regression approach for point cloud registration that uses SE(3) equivariant features to handle noisy inputs and aggressive rotations without extensive training augmentations.
Details
Motivation: Existing point cloud registration methods ignore point orientations and uncertainties, making them susceptible to noise and aggressive rotations like orthogonal transformations, requiring extensive training data with transformation augmentations.Method: Initialize surfels from Lidar point cloud using virtual perspective camera parameters, learn explicit SE(3) equivariant features (position and rotation) through SE(3) equivariant convolutional kernels to predict relative transformations. Uses equivariant convolutional encoder, cross-attention mechanism, fully-connected decoder, and non-linear Huber loss.
Result: Experimental results on indoor and outdoor datasets demonstrate superior and robust performance on real point-cloud scans compared to state-of-the-art methods.
Conclusion: The proposed surfel-based approach with SE(3) equivariant features effectively addresses limitations of traditional point cloud registration methods, providing robust performance against noise and aggressive rotations without requiring extensive training augmentations.
Abstract: Point cloud registration is crucial for ensuring 3D alignment consistency of multiple local point clouds in 3D reconstruction for remote sensing or digital heritage. While various point cloud-based registration methods exist, both non-learning and learning-based, they ignore point orientations and point uncertainties, making the model susceptible to noisy input and aggressive rotations of the input point cloud like orthogonal transformation; thus, it necessitates extensive training point clouds with transformation augmentations. To address these issues, we propose a novel surfel-based pose learning regression approach. Our method can initialize surfels from Lidar point cloud using virtual perspective camera parameters, and learns explicit $\mathbf{SE(3)}$ equivariant features, including both position and rotation through $\mathbf{SE(3)}$ equivariant convolutional kernels to predict relative transformation between source and target scans. The model comprises an equivariant convolutional encoder, a cross-attention mechanism for similarity computation, a fully-connected decoder, and a non-linear Huber loss. Experimental results on indoor and outdoor datasets demonstrate our model superiority and robust performance on real point-cloud scans compared to state-of-the-art methods.
[151] Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training
Tao Luo, Han Wu, Tong Yang, Dinggang Shen, Zhiming Cui
Main category: cs.CV
TL;DR: DVCTNet is a dual-view co-training network that combines global panoramic X-ray screening with detailed tooth-level inspection for superior dental caries detection accuracy.
Details
Motivation: Current dental caries detection methods have suboptimal accuracy due to subtle contrast variations and diverse lesion morphology in panoramic X-rays. The clinical workflow where dentists combine whole-image screening with detailed tooth inspection inspired this approach.Method: Uses automated tooth detection to create global (panoramic) and local (cropped tooth) views. Pretrains two vision foundation models separately, then integrates them with a Gated Cross-View Attention module that dynamically fuses dual-view features for final detection.
Result: Demonstrates superior performance against state-of-the-art methods on both public dataset and newly curated high-precision dataset with double verification annotations.
Conclusion: DVCTNet shows clinical applicability and provides a novel approach to dental caries detection by effectively mimicking the clinical workflow through dual-view feature integration.
Abstract: Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet’s superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at https://github.com/ShanghaiTech-IMPACT/DVCTNet.
[152] FusionCounting: Robust visible-infrared image fusion guided by crowd counting via multi-task learning
He Li, Xinyu Liu, Weihang Kong, Xingchen Zhang
Main category: cs.CV
TL;DR: FusionCounting integrates crowd counting with visible and infrared image fusion in a unified multi-task framework, using dynamic loss weighting and adversarial training to improve both fusion quality and counting accuracy.
Details
Motivation: Existing VIF methods focus on image quality or use semantic segmentation/detection which require heavy annotations. Crowd counting provides quantitative density measures with minimal annotation, making it suitable for dense scenes where detection struggles with occlusion.Method: Multi-task learning framework that jointly optimizes VIF and crowd counting. Uses dynamic loss weighting for task balance and adversarial training for robustness. Leverages population density information to guide fusion.
Result: Experimental results show FusionCounting improves both image fusion quality and crowd counting performance compared to existing methods on public datasets.
Conclusion: Integrating crowd counting with VIF creates a mutually beneficial framework that enhances both tasks while requiring minimal annotation, making it particularly effective for dense crowd scenes.
Abstract: Most visible and infrared image fusion (VIF) methods focus primarily on optimizing fused image quality. Recent studies have begun incorporating downstream tasks, such as semantic segmentation and object detection, to provide semantic guidance for VIF. However, semantic segmentation requires extensive annotations, while object detection, despite reducing annotation efforts compared with segmentation, faces challenges in highly crowded scenes due to overlapping bounding boxes and occlusion. Moreover, although RGB-T crowd counting has gained increasing attention in recent years, no studies have integrated VIF and crowd counting into a unified framework. To address these challenges, we propose FusionCounting, a novel multi-task learning framework that integrates crowd counting into the VIF process. Crowd counting provides a direct quantitative measure of population density with minimal annotation, making it particularly suitable for dense scenes. Our framework leverages both input images and population density information in a mutually beneficial multi-task design. To accelerate convergence and balance tasks contributions, we introduce a dynamic loss function weighting strategy. Furthermore, we incorporate adversarial training to enhance the robustness of both VIF and crowd counting, improving the model’s stability and resilience to adversarial attacks. Experimental results on public datasets demonstrate that FusionCounting not only enhances image fusion quality but also achieves superior crowd counting performance.
[153] Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation
Krit Duangprom, Tryphon Lambrou, Binod Bhattarai
Main category: cs.CV
TL;DR: Novel pipeline using Vision Language Models (VLMs) fine-tuned with LoRA for 2D surgical tool keypoint estimation, outperforming traditional CNN/Transformer methods with minimal training.
Details
Motivation: Traditional CNN and Transformer-based approaches often overfit on small-scale medical datasets, requiring a more generalized approach that leverages pre-trained VLMs for better performance in low-resource scenarios.Method: Fine-tune pre-trained Vision Language Models using Low Rank Adjusting (LoRA) technique with carefully designed prompts for instruction-tuning, aligning visual features with semantic keypoint descriptions.
Result: With only two epochs of fine-tuning, the adapted VLM outperforms baseline models, demonstrating effectiveness of LoRA in low-resource scenarios and improved keypoint detection performance.
Conclusion: This approach not only enhances 2D keypoint estimation for surgical tools but also provides a foundation for future work in 3D surgical hands and tools pose estimation.
Abstract: This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.
[154] PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification
Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Xuequan Lu, Lizhuang Ma, Shuicheng Yan
Main category: cs.CV
TL;DR: PointDGRWKV is the first RWKV-based framework for Domain Generalization in Point Cloud Classification that addresses spatial distortion and attention drift issues through adaptive geometric token shift and cross-domain key feature distribution alignment.
Details
Motivation: Existing DG PCC methods using convolutional networks, Transformers or Mamba architectures suffer from limited receptive fields, high computational cost, or insufficient long-range dependency modeling. RWKV offers linear complexity and global receptive fields but faces challenges when directly applied to unstructured point clouds.Method: Proposes PointDGRWKV with two key modules: 1) Adaptive Geometric Token Shift to model local neighborhood structures and improve geometric context awareness, and 2) Cross-Domain key feature Distribution Alignment to mitigate attention drift by aligning key feature distributions across domains.
Result: Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC while maintaining RWKV’s linear efficiency.
Conclusion: The proposed PointDGRWKV framework successfully adapts RWKV architecture for domain generalization in point cloud classification, overcoming spatial distortion and attention drift challenges while delivering superior performance.
Abstract: Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV’s fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV’s linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.
[155] PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis
Ye Zhang, Yu Zhou, Jingwen Qi, Yongbing Zhang, Simon Puettmann, Finn Wichmann, Larissa Pereira Ferreira, Lara Sichward, Julius Keyl, Sylvia Hartmann, Shuo Zhao, Hongxiao Wang, Xiaowei Xu, Jianxu Chen
Main category: cs.CV
TL;DR: PathMR is a cell-level multimodal visual reasoning framework that generates diagnostic explanations and cell distribution predictions for pathological images, outperforming state-of-the-art methods in text quality, segmentation accuracy, and cross-modal alignment.
Details
Motivation: Current deep learning diagnostic tools lack transparency and traceable rationale, limiting clinical adoption. There's a need for AI systems that provide both pixel-level segmentation and semantically aligned textual explanations for dependable pathology assistance.Method: Proposed PathMR framework that takes pathological images and textual queries to generate expert-level diagnostic explanations while simultaneously predicting cell distribution patterns at the cell level.
Result: PathMR consistently outperformed state-of-the-art visual reasoning methods on both PathGen dataset and newly developed GADVR dataset in text generation quality, segmentation accuracy, and cross-modal alignment.
Conclusion: PathMR demonstrates strong potential for improving interpretability in AI-driven pathological diagnosis by providing transparent insights through localized lesion regions and expert-style diagnostic narratives.
Abstract: Deep learning based automated pathological diagnosis has markedly improved diagnostic efficiency and reduced variability between observers, yet its clinical adoption remains limited by opaque model decisions and a lack of traceable rationale. To address this, recent multimodal visual reasoning architectures provide a unified framework that generates segmentation masks at the pixel level alongside semantically aligned textual explanations. By localizing lesion regions and producing expert style diagnostic narratives, these models deliver the transparent and interpretable insights necessary for dependable AI assisted pathology. Building on these advancements, we propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis. Given a pathological image and a textual query, PathMR generates expert-level diagnostic explanations while simultaneously predicting cell distribution patterns. To benchmark its performance, we evaluated our approach on the publicly available PathGen dataset as well as on our newly developed GADVR dataset. Extensive experiments on these two datasets demonstrate that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment. These results highlight the potential of PathMR for improving interpretability in AI-driven pathological diagnosis. The code will be publicly available in https://github.com/zhangye-zoe/PathMR.
[156] Deep Learning Framework for Early Detection of Pancreatic Cancer Using Multi-Modal Medical Imaging Analysis
Dennis Slobodzian, Karissa Tilbury, Amir Kordijazi
Main category: cs.CV
TL;DR: Deep learning framework using dual-modality imaging (autofluorescence + SHG) achieves over 90% accuracy for early pancreatic cancer detection, outperforming manual methods.
Details
Motivation: PDAC has low survival rates due to late detection, creating urgent need for early diagnostic methods using advanced imaging and AI.Method: Analyzed 40 patient samples with 6 deep learning architectures (CNNs vs ViTs), using modified ResNet with frozen pre-trained layers and class-weighted training to handle limited data and class imbalance.
Result: Achieved over 90% accuracy in distinguishing normal, fibrotic, and cancerous tissue, significantly improving upon current manual analysis methods.
Conclusion: Establishes robust automated PDAC detection pipeline with clinical deployment potential, providing framework for other cancer types and insights for limited medical datasets.
Abstract: Pacreatic ductal adenocarcinoma (PDAC) remains one of the most lethal forms of cancer, with a five-year survival rate below 10% primarily due to late detection. This research develops and validates a deep learning framework for early PDAC detection through analysis of dual-modality imaging: autofluorescence and second harmonic generation (SHG). We analyzed 40 unique patient samples to create a specialized neural network capable of distinguishing between normal, fibrotic, and cancerous tissue. Our methodology evaluated six distinct deep learning architectures, comparing traditional Convolutional Neural Networks (CNNs) with modern Vision Transformers (ViTs). Through systematic experimentation, we identified and overcome significant challenges in medical image analysis, including limited dataset size and class imbalance. The final optimized framework, based on a modified ResNet architecture with frozen pre-trained layers and class-weighted training, achieved over 90% accuracy in cancer detection. This represents a significant improvement over current manual analysis methods an demonstrates potential for clinical deployment. This work establishes a robust pipeline for automated PDAC detection that can augment pathologists’ capabilities while providing a foundation for future expansion to other cancer types. The developed methodology also offers valuable insights for applying deep learning to limited-size medical imaging datasets, a common challenge in clinical applications.
[157] ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts
Patryk Będkowski, Jan Dubiński, Filip Szatkowski, Kamil Deja, Przemysław Rokita, Tomasz Trzciński
Main category: cs.CV
TL;DR: ExpertSim is a deep learning approach using Mixture-of-Generative-Experts architecture to simulate Zero Degree Calorimeter responses in ALICE experiment, providing more accurate and faster simulations than traditional Monte Carlo methods.
Details
Motivation: Traditional Monte Carlo simulations for particle detector responses at CERN are computationally expensive and strain computational resources, creating a need for more efficient simulation methods.Method: Uses a Mixture-of-Generative-Experts architecture where each expert specializes in simulating different subsets of data, allowing focused expertise on specific aspects of calorimeter response.
Result: ExpertSim improves simulation accuracy and provides significant speedup compared to traditional Monte Carlo methods, enabling more efficient detector simulations.
Conclusion: The approach offers a promising solution for high-efficiency detector simulations in particle physics experiments at CERN, with code made publicly available.
Abstract: Simulating detector responses is a crucial part of understanding the inner workings of particle collisions in the Large Hadron Collider at CERN. Such simulations are currently performed with statistical Monte Carlo methods, which are computationally expensive and put a significant strain on CERN’s computational grid. Therefore, recent proposals advocate for generative machine learning methods to enable more efficient simulations. However, the distribution of the data varies significantly across the simulations, which is hard to capture with out-of-the-box methods. In this study, we present ExpertSim - a deep learning simulation approach tailored for the Zero Degree Calorimeter in the ALICE experiment. Our method utilizes a Mixture-of-Generative-Experts architecture, where each expert specializes in simulating a different subset of the data. This allows for a more precise and efficient generation process, as each expert focuses on a specific aspect of the calorimeter response. ExpertSim not only improves accuracy, but also provides a significant speedup compared to the traditional Monte-Carlo methods, offering a promising solution for high-efficiency detector simulations in particle physics experiments at CERN. We make the code available at https://github.com/patrick-bedkowski/expertsim-mix-of-generative-experts.
[158] Understanding and evaluating computer vision models through the lens of counterfactuals
Pushkar Shukla
Main category: cs.CV
TL;DR: This thesis develops counterfactual frameworks for explaining, auditing, and mitigating bias in vision classifiers and generative models through systematic attribute variation and causal analysis.
Details
Motivation: To address interpretability and fairness in AI by using counterfactual reasoning to uncover spurious correlations, probe causal dependencies, and build more robust systems that avoid biased behaviors.Method: Developed multiple frameworks: CAVLI (combines LIME and TCAV for concept-level analysis), ASAC (adversarial counterfactuals with curriculum learning), TIBET (scalable pipeline for prompt-sensitive bias evaluation), BiasConnect (causal graphs for intersectional biases), and InterMit (training-free mitigation via causal sensitivity scores).
Result: The methods successfully quantify concept dependencies, improve fairness and accuracy in biased models, enable causal auditing of identity-related biases in generative models, and provide scalable solutions for intersectional bias diagnosis and mitigation.
Conclusion: Counterfactuals serve as a unifying framework for interpretability, fairness, and causality across both discriminative and generative models, establishing principled methods for socially responsible bias evaluation and mitigation.
Abstract: Counterfactual reasoning – the practice of asking ``what if’’ by varying inputs and observing changes in model behavior – has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.
[159] To New Beginnings: A Survey of Unified Perception in Autonomous Vehicle Software
Loïc Stratil, Felix Fent, Esteban Rivera, Markus Lienkamp
Main category: cs.CV
TL;DR: Survey paper on unified perception for autonomous vehicles that integrates detection, tracking and prediction into shared architectures, categorizing methods into Early, Late and Full Unified paradigms with comprehensive taxonomy.
Details
Motivation: Traditional modular perception pipelines suffer from error accumulation and limited inter-task synergy, while unified perception offers improved robustness, contextual reasoning and efficiency while maintaining interpretability.Method: Comprehensive survey with holistic taxonomy categorizing methods along task integration, tracking formulation and representation flow. Systematic review of architectures, training strategies, datasets and open-source availability.
Result: Establishes first comprehensive framework for unified perception, consolidates fragmented research efforts, and provides systematic categorization into three paradigms (Early, Late, Full Unified Perception).
Conclusion: Unified perception is a promising paradigm that addresses limitations of modular pipelines. The survey provides guidance for future research toward more robust, generalizable and interpretable perception systems.
Abstract: Autonomous vehicle perception typically relies on modular pipelines that decompose the task into detection, tracking, and prediction. While interpretable, these pipelines suffer from error accumulation and limited inter-task synergy. Unified perception has emerged as a promising paradigm that integrates these sub-tasks within a shared architecture, potentially improving robustness, contextual reasoning, and efficiency while retaining interpretable outputs. In this survey, we provide a comprehensive overview of unified perception, introducing a holistic and systemic taxonomy that categorizes methods along task integration, tracking formulation, and representation flow. We define three paradigms -Early, Late, and Full Unified Perception- and systematically review existing methods, their architectures, training strategies, datasets used, and open-source availability, while highlighting future research directions. This work establishes the first comprehensive framework for understanding and advancing unified perception, consolidates fragmented efforts, and guides future research toward more robust, generalizable, and interpretable perception.
[160] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning
Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, Zhen Lei
Main category: cs.CV
TL;DR: HydraFake dataset addresses real-world deepfake detection challenges with hierarchical generalization testing, and Veritas MLLM-based detector uses pattern-aware reasoning to achieve superior OOD performance.
Details
Motivation: Existing deepfake detection benchmarks suffer from discrepancies with industrial practice, featuring homogeneous training sources and low-quality testing images that hinder practical deployment.Method: Introduces HydraFake dataset with diversified deepfake techniques and rigorous evaluation protocol. Proposes Veritas, a multi-modal LLM detector with pattern-aware reasoning (planning and self-reflection) and two-stage training pipeline.
Result: Previous detectors show good cross-model generalization but fail on unseen forgeries and data domains. Veritas achieves significant gains across different out-of-distribution scenarios with transparent outputs.
Conclusion: HydraFake provides realistic benchmark for deepfake detection, and Veritas demonstrates superior generalization capabilities through human-like forensic reasoning patterns.
Abstract: Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical reasoning patterns such as “planning” and “self-reflection” to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different OOD scenarios, and is capable of delivering transparent and faithful detection outputs.
[161] Classifying Mitotic Figures in the MIDOG25 Challenge with Deep Ensemble Learning and Rule Based Refinement
Sara Krauss, Ellena Spieß, Daniel Hieber, Frank Kramer, Johannes Schobel, Dominik Müller
Main category: cs.CV
TL;DR: Ensemble of ConvNeXtBase models with rule-based refinement for classifying atypical mitotic figures, achieving 84.02% balanced accuracy on MIDOG25 test set.
Details
Motivation: Differentiating atypical mitotic figures from normal ones is challenging due to time-consuming and subjective manual annotation in tumor grading.Method: Trained ensemble of ConvNeXtBase models using AUCMEDI framework, extended with rule-based refinement module.
Result: Achieved 84.02% balanced accuracy on MIDOG25 preliminary test set. Rule-based refinement increased specificity but reduced sensitivity and overall performance.
Conclusion: Deep ensembles perform well for atypical mitotic figure classification, but rule-based refinement requires further research despite improving specific metrics.
Abstract: Mitotic figures (MFs) are relevant biomarkers in tumor grading. Differentiating atypical MFs (AMFs) from normal MFs (NMFs) remains difficult, as manual annotation is time-consuming and subjective. In this work an ensemble of ConvNeXtBase models was trained with AUCMEDI and extend with a rule-based refinement (RBR) module. On the MIDOG25 preliminary test set, the ensemble achieved a balanced accuracy of 84.02%. While the RBR increased specificity, it reduced sensitivity and overall performance. The results show that deep ensembles perform well for AMF classification. RBR can increase specific metrics but requires further research.
[162] COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans
Enrico Martini, Ho Jin Choi, Nadia Figueroa, Nicola Bombieri
Main category: cs.CV
TL;DR: COMETH is a lightweight multi-view human pose fusion algorithm that uses convex optimization and biomechanical constraints to improve accuracy and temporal consistency for real-time industrial monitoring applications.
Details
Motivation: Address the limitations of multi-camera setups in Industry 5.0, which suffer from high computational costs and bandwidth requirements, while edge devices face accuracy degradation and temporal inconsistencies.Method: Integrates kinematic and biomechanical constraints, employs convex optimization-based inverse kinematics for spatial fusion, and implements a state observer for temporal consistency.
Result: Outperforms state-of-the-art methods in localization, detection, and tracking accuracy on both public and industrial datasets.
Conclusion: COMETH enables accurate and scalable human motion tracking suitable for industrial and safety-critical applications, with publicly available code.
Abstract: In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at https://github.com/PARCO-LAB/COMETH.
[163] Olive Tree Satellite Image Segmentation Based On SAM and Multi-Phase Refinement
Amir Jmal, Chaima Chtourou, Mahdi Louati, Abdelaziz Kallel, Houda Khmila
Main category: cs.CV
TL;DR: Novel olive tree segmentation method using SAM with alignment and shape constraints achieves 98% accuracy, significantly improving over baseline SAM performance.
Details
Motivation: Climate change threatens olive biodiversity, requiring early anomaly detection through remote sensing for effective agricultural management.Method: Integrates Segment Anything Model (SAM) with corrections based on tree alignment patterns and learnable constraints about tree shape and size.
Result: Achieved 98% accuracy rate in olive tree segmentation from satellite images, significantly surpassing initial SAM performance of 82%.
Conclusion: The approach successfully enhances olive tree segmentation accuracy, providing a valuable tool for biodiversity conservation and precision agriculture management.
Abstract: In the context of proven climate change, maintaining olive biodiversity through early anomaly detection and treatment using remote sensing technology is crucial, offering effective management solutions. This paper presents an innovative approach to olive tree segmentation from satellite images. By leveraging foundational models and advanced segmentation techniques, the study integrates the Segment Anything Model (SAM) to accurately identify and segment olive trees in agricultural plots. The methodology includes SAM segmentation and corrections based on trees alignement in the field and a learanble constraint about the shape and the size. Our approach achieved a 98% accuracy rate, significantly surpassing the initial SAM performance of 82%.
[164] E-ConvNeXt: A Lightweight and Efficient ConvNeXt Variant with Cross-Stage Partial Connections
Fang Wang, Huitao Li, Wenhan Chao, Zheng Zhuo, Yiran Ji, Chang Peng, Yupeng Sun
Main category: cs.CV
TL;DR: E-ConvNeXt: A lightweight ConvNeXt variant that reduces parameters by 80% while maintaining high accuracy through CSP connections, structural optimizations, and channel attention.
Details
Motivation: High-performance networks like ConvNeXt were not designed for lightweight applications, limiting their practical deployment in resource-constrained scenarios.Method: Integrates Cross Stage Partial Connections (CSPNet) with ConvNeXt, optimizes Stem and Block structures, and replaces Layer Scale with channel attention to reduce complexity while enhancing feature expression.
Result: Achieves 78.3% Top-1 accuracy at 0.9GFLOPs (mini) and 81.9% at 3.1GFLOPs (small) on ImageNet, with strong generalization in object detection tasks.
Conclusion: E-ConvNeXt provides an excellent accuracy-efficiency balance, making high-performance ConvNeXt architecture suitable for lightweight applications without compromising performance.
Abstract: Many high-performance networks were not designed with lightweight application scenarios in mind from the outset, which has greatly restricted their scope of application. This paper takes ConvNeXt as the research object and significantly reduces the parameter scale and network complexity of ConvNeXt by integrating the Cross Stage Partial Connections mechanism and a series of optimized designs. The new network is named E-ConvNeXt, which can maintain high accuracy performance under different complexity configurations. The three core innovations of E-ConvNeXt are : (1) integrating the Cross Stage Partial Network (CSPNet) with ConvNeXt and adjusting the network structure, which reduces the model’s network complexity by up to 80%; (2) Optimizing the Stem and Block structures to enhance the model’s feature expression capability and operational efficiency; (3) Replacing Layer Scale with channel attention. Experimental validation on ImageNet classification demonstrates E-ConvNeXt’s superior accuracy-efficiency balance: E-ConvNeXt-mini reaches 78.3% Top-1 accuracy at 0.9GFLOPs. E-ConvNeXt-small reaches 81.9% Top-1 accuracy at 3.1GFLOPs. Transfer learning tests on object detection tasks further confirm its generalization capability.
[165] DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes
Yajiao Xiong, Xiaoyu Zhou, Yongtao Wan, Deqing Sun, Ming-Hsuan Yang
Main category: cs.CV
TL;DR: DrivingGaussian++ is an efficient framework for realistic reconstruction and controllable editing of dynamic autonomous driving scenes using 3D Gaussians and LiDAR priors, supporting training-free editing and automatic motion generation.
Details
Motivation: To address the challenges of realistic reconstruction and controllable editing of dynamic autonomous driving scenes with accurate positions, occlusions, and photorealistic surround-view synthesis.Method: Uses incremental 3D Gaussians for static background and composite dynamic Gaussian graph for moving objects, integrates LiDAR prior for detailed reconstruction, and leverages LLMs for automatic motion trajectory generation.
Result: Outperforms existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis, demonstrates consistent realistic editing results, and significantly enhances scene diversity.
Conclusion: DrivingGaussian++ provides an effective solution for reconstructing and editing dynamic driving scenes with high realism and controllability, supporting various editing operations without additional training.
Abstract: We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io
[166] Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation
Chenfan Qu, Yiwu Zhong, Bin Li, Lianwen Jin
Main category: cs.CV
TL;DR: Novel methods to address image manipulation localization data scarcity using web data, including automatic pixel-level annotation (CAAAv2), quality filtering (QES), and a large-scale dataset MIMLv2 with 246K images, achieving 31% performance gain over previous methods.
Details
Motivation: Accurate localization of manipulated image regions is challenging due to high data acquisition costs and lack of high-quality annotated datasets, posing risks to social security.Method: Leverage web data with CAAAv2 for automatic pixel-level annotation, QES metric for quality filtering, Object Jitter for artifact generation, and Web-IML model for web-scale supervision.
Result: Created MIMLv2 dataset (246K images, 120x larger than IMD20), Web-IML achieved 31% performance gain and surpassed previous SOTA TruFor by 24.1 average IoU points.
Conclusion: The approach effectively mitigates data scarcity, significantly improves manipulation localization performance, and provides valuable resources for the research community.
Abstract: Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at https://github.com/qcf-568/MIML.
[167] POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models
Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, Qinglin Lu
Main category: cs.CV
TL;DR: POSE is a novel distillation framework that enables single-step video generation from large-scale diffusion models, achieving 100x speedup while maintaining video quality through phased adversarial training and equilibrium optimization.
Details
Motivation: Existing video acceleration methods fail to model temporal coherence and cannot provide single-step distillation for large-scale video models, creating bottlenecks in sampling efficiency for long video sequences.Method: Two-phase distillation process: (1) stability priming - warm-up mechanism to stabilize adversarial distillation across SNR regimes, (2) unified adversarial equilibrium - self-adversarial distillation to reach Nash equilibrium in Gaussian noise space, and (3) conditional adversarial consistency for improved semantic and frame consistency.
Result: Outperforms other acceleration methods by average 7.15% on VBench-I2V metrics, reduces latency from 1000 seconds to 10 seconds (100x speedup) while maintaining competitive performance.
Conclusion: POSE successfully bridges the gap in video diffusion acceleration by enabling single-step generation with preserved temporal coherence and quality, making large-scale video generation significantly more efficient.
Abstract: The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$\times$, from 1000 seconds to 10 seconds, while maintaining competitive performance.
[168] Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets
Dale Decatur, Thibault Groueix, Wang Yifan, Rana Hanocka, Vladimir Kim, Matheus Gadelha
Main category: cs.CV
TL;DR: Training-free method that clusters semantically similar prompts and shares computation in early diffusion steps to reduce redundancy and improve efficiency in text-to-image generation.
Details
Motivation: Text-to-image diffusion models are computationally expensive, and prior work focused on per-inference optimization rather than reducing redundancy across correlated prompts.Method: Leverages coarse-to-fine nature of diffusion models by clustering prompts based on semantic similarity and sharing computation in early denoising steps that capture shared structures. Uses UnClip’s text-to-image prior to enhance diffusion step allocation.
Result: Significantly reduces compute cost while improving image quality for models trained with image embeddings. Method integrates with existing pipelines and scales with prompt sets.
Conclusion: Provides an efficient approach that reduces environmental and financial burden of large-scale text-to-image generation by exploiting redundancy across similar prompts.
Abstract: Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip’s text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/
[169] Mitosis detection in domain shift scenarios: a Mamba-based approach
Gennaro Percannella, Mattia Sarno, Francesco Tortorella, Mario Vento
Main category: cs.CV
TL;DR: A Mamba-based VM-UNet approach with stain augmentation for mitosis detection under domain shift, submitted to MIDOG challenge with preliminary results showing room for improvement.
Details
Motivation: Mitosis detection is crucial for tumor assessment but ML algorithms suffer performance drops when tested on domains different from training data. Domain shift is a significant challenge in medical imaging.Method: Proposes a Mamba-based VM-UNet architecture with stain augmentation operations to improve model robustness against domain shift in mitosis detection.
Result: Preliminary experiments on MIDOG++ dataset show large room for improvement for the proposed method. The approach has been submitted to MIDOG challenge track 1.
Conclusion: The Mamba-based approach shows potential for mitosis detection under domain shift but requires further improvement as indicated by preliminary results on the MIDOG++ benchmark dataset.
Abstract: Mitosis detection in histopathology images plays a key role in tumor assessment. Although machine learning algorithms could be exploited for aiding physicians in accurately performing such a task, these algorithms suffer from significative performance drop when evaluated on images coming from domains that are different from the training ones. In this work, we propose a Mamba-based approach for mitosis detection under domain shift, inspired by the promising performance demonstrated by Mamba in medical imaging segmentation tasks. Specifically, our approach exploits a VM-UNet architecture for carrying out the addressed task, as well as stain augmentation operations for further improving model robustness against domain shift. Our approach has been submitted to the track 1 of the MItosis DOmain Generalization (MIDOG) challenge. Preliminary experiments, conducted on the MIDOG++ dataset, show large room for improvement for the proposed method.
[170] A multi-task neural network for atypical mitosis recognition under domain shift
Gennaro Percannella, Mattia Sarno, Francesco Tortorella, Mario Vento
Main category: cs.CV
TL;DR: Multi-task learning approach for domain generalization in atypical mitosis detection, using auxiliary tasks to help models focus on classification objects while ignoring domain-varying backgrounds.
Details
Motivation: Machine learning models for recognizing atypical mitotic figures suffer significant performance drops under domain shift, which affects accurate tumor aggressiveness assessment.Method: Multi-task learning approach that exploits auxiliary tasks correlated to the main classification task to help the model focus only on the classification object and ignore domain-varying backgrounds.
Result: Promising performance in preliminary evaluation on three distinct datasets: MIDOG 2025 Atypical Training Set, Ami-Br dataset, and preliminary test set of MIDOG25 challenge.
Conclusion: The proposed multi-task learning approach shows potential for addressing domain shift problems in atypical mitosis detection, helping models maintain performance across different domains.
Abstract: Recognizing atypical mitotic figures in histopathology images allows physicians to correctly assess tumor aggressiveness. Although machine learning models could be exploited for automatically performing such a task, under domain shift these models suffer from significative performance drops. In this work, an approach based on multi-task learning is proposed for addressing this problem. By exploiting auxiliary tasks, correlated to the main classification task, the proposed approach, submitted to the track 2 of the MItosis DOmain Generalization (MIDOG) challenge, aims to aid the model to focus only on the object to classify, ignoring the domain varying background of the image. The proposed approach shows promising performance in a preliminary evaluation conducted on three distinct datasets, i.e., the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, as well as the preliminary test set of the MIDOG25 challenge.
[171] FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator
Huynh Tong Dang Khoa, Dang Hoai Nam, Vo Nguyen Le Duy
Main category: cs.CV
TL;DR: FW-GAN is a one-shot handwriting synthesis framework that generates realistic, style-consistent text from a single example using frequency-aware components and Wave-MLP architecture.
Details
Motivation: Handwriting data scarcity limits recognition systems, and current synthesis methods struggle with long-range dependencies and ignore frequency information crucial for capturing fine-grained stylistic details.Method: Proposes FW-GAN with phase-aware Wave-MLP generator, frequency-guided discriminator, and novel Frequency Distribution Loss to align frequency characteristics between synthetic and real handwriting.
Result: Experiments on Vietnamese and English datasets show FW-GAN generates high-quality, style-consistent handwriting that effectively augments low-resource handwriting recognition pipelines.
Conclusion: FW-GAN successfully addresses limitations of current methods by incorporating frequency information and advanced architecture, providing valuable synthetic data for handwriting recognition systems.
Abstract: Labeled handwriting data is often scarce, limiting the effectiveness of recognition systems that require diverse, style-consistent training samples. Handwriting synthesis offers a promising solution by generating artificial data to augment training. However, current methods face two major limitations. First, most are built on conventional convolutional architectures, which struggle to model long-range dependencies and complex stroke patterns. Second, they largely ignore the crucial role of frequency information, which is essential for capturing fine-grained stylistic and structural details in handwriting. To address these challenges, we propose FW-GAN, a one-shot handwriting synthesis framework that generates realistic, writer-consistent text from a single example. Our generator integrates a phase-aware Wave-MLP to better capture spatial relationships while preserving subtle stylistic cues. We further introduce a frequency-guided discriminator that leverages high-frequency components to enhance the authenticity detection of generated samples. Additionally, we introduce a novel Frequency Distribution Loss that aligns the frequency characteristics of synthetic and real handwriting, thereby enhancing visual fidelity. Experiments on Vietnamese and English handwriting datasets demonstrate that FW-GAN generates high-quality, style-consistent handwriting, making it a valuable tool for augmenting data in low-resource handwriting recognition (HTR) pipelines. Official implementation is available at https://github.com/DAIR-Group/FW-GAN
[172] MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang
Main category: cs.CV
TL;DR: MMG-Vid is a training-free visual token pruning framework that reduces computational overhead in Video LLMs by maximizing marginal gains at segment and token levels, maintaining 99.5% performance while reducing 75% tokens and accelerating processing by 3.9x.
Details
Motivation: Current VLLMs face computational challenges due to excessive visual tokens, and existing pruning methods ignore dynamic characteristics and temporal dependencies in videos.Method: Divides video into segments based on frame similarity, dynamically allocates token budget per segment, and uses temporal-guided DPC algorithm to model inter-frame uniqueness and intra-frame diversity for token pruning.
Result: Maintains over 99.5% of original performance while reducing 75% visual tokens and accelerating prefilling stage by 3.9x on LLaVA-OneVision-7B.
Conclusion: MMG-Vid effectively maximizes limited token budget utilization, significantly improving efficiency while preserving strong video understanding performance without requiring additional training.
Abstract: Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.
[173] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie
Main category: cs.CV
TL;DR: CogVLA is an efficient Vision-Language-Action framework that uses instruction-driven routing and sparsification to reduce computational overhead while improving performance, achieving state-of-the-art results with significantly reduced training and inference costs.
Details
Motivation: Current VLA models built on pre-trained VLMs require extensive post-training with high computational overhead, limiting scalability and deployment. The authors aim to create a more efficient framework inspired by human multimodal coordination.Method: 3-stage progressive architecture: 1) EFA-Routing injects instruction info into vision encoder to selectively aggregate visual tokens, 2) LFP-Routing prunes instruction-irrelevant tokens for token-level sparsity, 3) V-L-A Coupled Attention combines vision-language attention with action parallel decoding.
Result: Achieves state-of-the-art performance with 97.4% success rate on LIBERO benchmark and 70.0% on real-world robotic tasks, while reducing training costs by 2.5x and inference latency by 2.8x compared to OpenVLA.
Conclusion: CogVLA demonstrates that cognition-aligned design with instruction-driven routing and sparsification can significantly improve both efficiency and performance in VLA models, making them more scalable and deployable.
Abstract: Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.
[174] Multi-View 3D Point Tracking
Frano Rajič, Haofei Xu, Marko Mihajlovic, Siyuan Li, Irem Demir, Emircan Gündoğdu, Lei Ke, Sergey Prokudin, Marc Pollefeys, Siyu Tang
Main category: cs.CV
TL;DR: First data-driven multi-view 3D point tracker that uses multiple camera views to track arbitrary points in dynamic scenes, overcoming limitations of monocular trackers and requiring fewer cameras than previous multi-camera methods.
Details
Motivation: Existing monocular trackers struggle with depth ambiguities and occlusion, while prior multi-camera methods require over 20 cameras and tedious per-sequence optimization. There's a need for a practical solution that works with fewer cameras and enables robust online tracking.Method: Feed-forward model that directly predicts 3D correspondences using 4+ cameras. Fuses multi-view features into unified point cloud and applies k-nearest-neighbors correlation with transformer-based update for reliable long-range 3D correspondence estimation, even under occlusion. Trained on 5K synthetic multi-view Kubric sequences.
Result: Achieved median trajectory errors of 3.1 cm on Panoptic Studio and 2.0 cm on DexYCB benchmarks. Generalizes well to diverse camera setups (1-8 views) with varying vantage points and video lengths (24-150 frames).
Conclusion: Sets a new standard for multi-view 3D tracking research with practical real-world applications. The method is released alongside training and evaluation datasets to advance the field.
Abstract: We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.
[175] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning
Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, Xinglong Wu
Main category: cs.CV
TL;DR: OneReward is a unified RL framework that uses a single vision-language model as a reward model to enhance multi-task generative capabilities across different evaluation criteria, eliminating the need for task-specific supervised fine-tuning.
Details
Motivation: Existing methods rely on task-specific supervised fine-tuning which limits generalization and training efficiency for multi-task generation models with varied data distributions and evaluation metrics.Method: Uses a single vision-language model as a generative reward model that can distinguish winners/losers for given tasks and evaluation criteria. Applied to mask-guided image generation tasks through multi-task reinforcement learning directly on pre-trained base models.
Result: The unified edit model (Seedream 3.0 Fill) consistently outperforms both commercial and open-source competitors (Ideogram, Adobe Photoshop, FLUX Fill [Pro]) across multiple evaluation dimensions.
Conclusion: OneReward provides an effective unified framework for multi-task reinforcement learning that eliminates task-specific SFT requirements while achieving superior performance across diverse generation tasks and evaluation criteria.
Abstract: In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model’s generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io
[176] Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu
Main category: cs.CV
TL;DR: This paper investigates security risks from Typographic Visual Prompt Injection (TVPI) in cross-vision models, creating a dataset to evaluate how visual prompts with typographic words disrupt LVLMs and I2I generation models.
Details
Motivation: Previous research shows that typographic words in input images can induce disruptive outputs in vision-language models and image-to-image generation models, but the specific characteristics of visual prompt threats remain underexplored.Method: The authors propose a Typographic Visual Prompts Injection Dataset and thoroughly evaluate TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics.
Result: The study comprehensively investigates the performance impact induced by TVPI, revealing security vulnerabilities in cross-vision models when exposed to typographic visual prompts.
Conclusion: The research deepens understanding of TVPI threats and provides valuable insights into the security risks posed by visual prompt injections in cross-vision generation models.
Abstract: Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.
[177] Dress&Dance: Dress up and Dance as You Like It - Technical Preview
Jun-Kun Chen, Aayush Bansal, Minh Phuoc Vo, Yu-Xiong Wang
Main category: cs.CV
TL;DR: Dress&Dance is a video diffusion framework that generates high-quality virtual try-on videos from a single user image, supporting various garment types and simultaneous top/bottom try-ons using a novel conditioning network called CondNet.
Details
Motivation: To create high-quality virtual try-on videos that accurately show users wearing desired garments while moving according to reference videos, addressing limitations in existing solutions.Method: Uses CondNet, a novel conditioning network with attention mechanisms to unify multi-modal inputs (text, images, videos). Trained on heterogeneous data combining limited video data with larger image datasets in a multistage progressive manner.
Result: Generates 5-second 24 FPS videos at 1152x720 resolution with enhanced garment registration and motion fidelity. Outperforms existing open source and commercial solutions.
Conclusion: Dress&Dance enables high-quality, flexible virtual try-on experiences through its innovative CondNet architecture and multi-modal input handling, demonstrating superior performance over current alternatives.
Abstract: We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.
[178] Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models
Diogo Freitas, Brigt Håvardstun, Cèsar Ferri, Darío Garigliotti, Jan Arne Telle, José Hernández-Orallo
Main category: cs.CV
TL;DR: Teaching vision-language models using machine teaching theory shows image representations require fewer examples than coordinate-based representations, but concept simplicity rankings are consistent across modalities.
Details
Motivation: To investigate whether multimodal language models truly integrate different modalities into common representations by testing if visual and coordinate-based representations of the same concepts map to similar latent spaces.Method: Used machine teaching theory to evaluate teaching complexity of Quick, Draw! objects using two presentations: raw bitmap images and trace coordinates in TikZ format, controlling for concept priors.
Result: Image-based representations generally require fewer teaching segments and achieve higher accuracy than coordinate-based representations, but teaching size ranks concepts similarly across both modalities.
Conclusion: The simplicity of concepts appears to be an inherent property that transcends modality representations, suggesting consistent underlying concept structure despite different input formats.
Abstract: Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to a similar area in the latent space as a textual description of the strokes that form the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper, we evaluate the complexity of teaching vision-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But, surprisingly, the teaching size usually ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors, suggesting that the simplicity of concepts may be an inherent property that transcends modality representations.
[179] First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge
Fahad Shamshad, Tameem Bakr, Yahia Shaaban, Noor Hussein, Karthik Nandakumar, Nils Lukas
Main category: cs.CV
TL;DR: Winning solution to NeurIPS 2024 watermark removal challenge using adaptive attacks for both black-box and beige-box scenarios, achieving 95.7% removal success with minimal quality impact.
Details
Motivation: To stress-test watermark robustness against adversarial attacks and determine if existing watermarks can withstand varying degrees of adversary knowledge.Method: For beige-box: adaptive VAE-based evasion with test-time optimization and CIELAB color-contrast restoration. For black-box: clustering by artifacts, then using diffusion models with controlled noise injection and ChatGPT-generated semantic priors.
Result: Achieved near-perfect watermark removal (95.7%) with negligible impact on residual image quality.
Conclusion: The successful attacks demonstrate vulnerabilities in current watermarking methods and should inspire development of more robust image watermarking techniques.
Abstract: Content watermarking is an important tool for the authentication and copyright protection of digital media. However, it is unclear whether existing watermarks are robust against adversarial attacks. We present the winning solution to the NeurIPS 2024 Erasing the Invisible challenge, which stress-tests watermark robustness under varying degrees of adversary knowledge. The challenge consisted of two tracks: a black-box and beige-box track, depending on whether the adversary knows which watermarking method was used by the provider. For the beige-box track, we leverage an adaptive VAE-based evasion attack, with a test-time optimization and color-contrast restoration in CIELAB space to preserve the image’s quality. For the black-box track, we first cluster images based on their artifacts in the spatial or frequency-domain. Then, we apply image-to-image diffusion models with controlled noise injection and semantic priors from ChatGPT-generated captions to each cluster with optimized parameter settings. Empirical evaluations demonstrate that our method successfully achieves near-perfect watermark removal (95.7%) with negligible impact on the residual image’s quality. We hope that our attacks inspire the development of more robust image watermarking methods.
[180] Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics
Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi
Main category: cs.CV
TL;DR: Puppet-Master is an interactive video generator that creates videos showing part-level object motion based on input drag trajectories, outperforming existing methods in zero-shot generalization to real images.
Details
Motivation: To model object dynamics universally by capturing internal part-level motion rather than just whole-object movement, addressing the limitation of existing motion-conditioned video generators that primarily move objects as a whole.Method: Extends a pre-trained image-to-video generator to encode input drags, introduces all-to-first attention to mitigate artifacts from fine-tuning on out-of-domain data, and fine-tunes on Objaverse-Animation-HQ - a curated dataset of synthetic 3D animation clips with meaningful drag augmentations.
Result: Puppet-Master successfully learns to generate part-level motions and generalizes well to out-of-domain real images, outperforming existing methods on real-world benchmarks in a zero-shot manner.
Conclusion: The approach effectively models universal object dynamics through part-level motion synthesis and demonstrates strong generalization capabilities to real-world scenarios without additional training.
Abstract: We introduce Puppet-Master, an interactive video generator that captures the internal, part-level motion of objects, serving as a proxy for modeling object dynamics universally. Given an image of an object and a set of “drags” specifying the trajectory of a few points on the object, the model synthesizes a video where the object’s parts move accordingly. To build Puppet-Master, we extend a pre-trained image-to-video generator to encode the input drags. We also propose all-to-first attention, an alternative to conventional spatial attention that mitigates artifacts caused by fine-tuning a video generator on out-of-domain data. The model is fine-tuned on Objaverse-Animation-HQ, a new dataset of curated part-level motion clips obtained by rendering synthetic 3D animations. Unlike real videos, these synthetic clips avoid confounding part-level motion with overall object and camera motion. We extensively filter sub-optimal animations and augment the synthetic renderings with meaningful drags that emphasize the internal dynamics of objects. We demonstrate that Puppet-Master learns to generate part-level motions, unlike other motion-conditioned video generators that primarily move the object as a whole. Moreover, Puppet-Master generalizes well to out-of-domain real images, outperforming existing methods on real-world benchmarks in a zero-shot manner.
[181] See then Tell: Enhancing Key Information Extraction with Vision Grounding
Shuhang Liu, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Qing Wang, Jianshu Zhang, Chenyu Liu
Main category: cs.CV
TL;DR: STNet is an end-to-end model that uses a unique
Details
Motivation: Traditional OCR-based KIE methods suffer from latency, computational overhead, and errors, while current image-to-text approaches lack vision grounding capabilities for their outputs.Method: STNet introduces a
Result: The approach achieves state-of-the-art performance on public datasets including CORD, SROIE, and DocVQA, demonstrating significant advancements in KIE tasks.
Conclusion: STNet provides an effective end-to-end solution for document understanding that combines visual observation with textual response generation, offering both precise answers and relevant vision grounding without OCR limitations.
Abstract: In the digital era, the ability to understand visually rich documents that
integrate text, complex layouts, and imagery is critical. Traditional Key
Information Extraction (KIE) methods primarily rely on Optical Character
Recognition (OCR), which often introduces significant latency, computational
overhead, and errors. Current advanced image-to-text approaches, which bypass
OCR, typically yield plain text outputs without corresponding vision grounding.
In this paper, we introduce STNet (See then Tell Net), a novel end-to-end model
designed to deliver precise answers with relevant vision grounding.
Distinctively, STNet utilizes a unique
[182] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
Main category: cs.CV
TL;DR: SpecVLM is a training-free speculative decoding framework that accelerates video LLMs by pruning up to 90% of video tokens without accuracy loss, achieving 2.68× speedup.
Details
Motivation: Video LLMs suffer from substantial memory and computational overhead due to dense video token representations, and existing token reduction methods cause information loss.Method: Two-stage video token pruning: Stage I selects informative tokens using attention signals from the verifier, Stage II prunes redundant tokens spatially uniformly. Uses speculative decoding framework tailored for Vid-LLMs.
Result: Achieves up to 2.68× decoding speedup for LLaVA-OneVision-72B and 2.11× for Qwen2.5-VL-32B on four video understanding benchmarks with maintained accuracy.
Conclusion: SpecVLM effectively accelerates Vid-LLMs losslessly through staged token pruning and speculative decoding, demonstrating strong performance and robustness across multiple models and benchmarks.
Abstract: Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.
[183] Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey
Xin Li, Yulin Ren, Xin Jin, Cuiling Lan, Xingrui Wang, Wenjun Zeng, Xinchao Wang, Zhibo Chen
Main category: cs.CV
TL;DR: This paper presents the first comprehensive survey on diffusion model-based image restoration methods, covering learning paradigms, conditional strategies, framework designs, and evaluation metrics across various IR tasks.
Details
Motivation: While diffusion models have shown remarkable success in visual generation tasks, there's a lack of comprehensive surveys examining their application to image restoration tasks, despite some pioneering studies demonstrating superior performance over traditional GAN-based methods.Method: The authors conduct a systematic review by: 1) introducing diffusion model background, 2) presenting prevalent workflows for IR, 3) classifying innovative designs for both standard and blind/real-world IR, 4) summarizing datasets and evaluation metrics, and 5) providing objective comparisons across super-resolution, deblurring, and inpainting tasks.
Result: The survey comprehensively analyzes existing diffusion model-based IR methods, identifies current limitations, and provides objective performance comparisons across multiple restoration tasks using open-sourced methods.
Conclusion: The paper identifies five key research directions for future work: sampling efficiency improvement, model compression techniques, better distortion simulation and estimation methods, distortion invariant learning approaches, and innovative framework designs for diffusion model-based image restoration.
Abstract: Image restoration (IR) has been an indispensable and challenging task in the low-level vision field, which strives to improve the subjective quality of images distorted by various forms of degradation. Recently, the diffusion model has achieved significant advancements in the visual generation of AIGC, thereby raising an intuitive question, “whether diffusion model can boost image restoration”. To answer this, some pioneering studies attempt to integrate diffusion models into the image restoration task, resulting in superior performances than previous GAN-based methods. Despite that, a comprehensive and enlightening survey on diffusion model-based image restoration remains scarce. In this paper, we are the first to present a comprehensive review of recent diffusion model-based methods on image restoration, encompassing the learning paradigm, conditional strategy, framework design, modeling strategy, and evaluation. Concretely, we first introduce the background of the diffusion model briefly and then present two prevalent workflows that exploit diffusion models in image restoration. Subsequently, we classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR, intending to inspire future development. To evaluate existing methods thoroughly, we summarize the commonly-used dataset, implementation details, and evaluation metrics. Additionally, we present the objective comparison for open-sourced methods across three tasks, including image super-resolution, deblurring, and inpainting. Ultimately, informed by the limitations in existing works, we propose five potential and challenging directions for the future research of diffusion model-based IR, including sampling efficiency, model compression, distortion simulation and estimation, distortion invariant learning, and framework design.
[184] ODES: Domain Adaptation with Expert Guidance for Online Medical Image Segmentation
Md Shazid Islam, Sayak Nag, Arindam Dutta, Miraj Ahmed, Fahim Faisal Niloy, Shreyangshu Bera, Amit K. Roy-Chowdhury
Main category: cs.CV
TL;DR: ODES: Online domain adaptation method for medical image segmentation that uses expert guidance through active learning to improve adaptation to streaming data, with an image-pruning strategy to reduce annotation overhead.
Details
Motivation: Unsupervised domain adaptation for segmentation relies on noisy pseudo-labels, which is problematic for online streaming data where accuracy is critical in medical imaging. Expert guidance through minimal annotations can address this challenge.Method: Proposes ODES method that adapts to incoming data batches online, incorporating expert feedback via active learning. Uses a novel image-pruning strategy to select the most informative subset of images for annotation to reduce temporal overhead.
Result: Outperforms existing online adaptation approaches and produces competitive results compared to offline domain adaptive active learning methods.
Conclusion: Expert guidance through active learning with image-pruning effectively enhances online domain adaptation for medical image segmentation, addressing the limitations of noisy pseudo-labels in streaming scenarios.
Abstract: Unsupervised domain adaptive segmentation typically relies on self-training using pseudo labels predicted by a pre-trained network on an unlabeled target dataset. However, the noisy nature of such pseudo-labels presents a major bottleneck in adapting a network to the distribution shift between source and target datasets. This challenge is exaggerated when the network encounters an incoming data stream in online fashion, where the network is constrained to adapt to incoming streams of target domain data in exactly one round of forward and backward passes. In this scenario, relying solely on inaccurate pseudo-labels can lead to low-quality segmentation, which is detrimental to medical image analysis where accuracy and precision are of utmost priority. We hypothesize that a small amount of pixel-level annotation obtained from an expert can address this problem, thereby enhancing the performance of domain adaptation of online streaming data, even in the absence of dedicated training data. We call our method ODES: Domain Adaptation with Expert Guidance for Online Medical Image Segmentation that adapts to each incoming data batch in an online setup, incorporating feedback from an expert through active learning. Through active learning, the most informative pixels in each image can be selected for expert annotation. However, the acquisition of pixel-level annotations across all images in a batch often leads to redundant information while increasing temporal overhead in online learning. To reduce the annotation acquisition time and make the adaptation process more online-friendly, we further propose a novel image-pruning strategy that selects the most useful subset of images from the current batch for active learning. Our proposed approach outperforms existing online adaptation approaches and produces competitive results compared to offline domain adaptive active learning methods.
[185] VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen
Main category: cs.CV
TL;DR: VLMEvalKit is an open-source PyTorch toolkit for evaluating 200+ multi-modality models and 80+ benchmarks with automated workflows for data preparation, inference, and metric calculation.
Details
Motivation: To provide researchers and developers with a user-friendly, comprehensive framework for reproducible evaluation of multi-modality models and track progress in the field.Method: Implements a single interface for easy model integration, automates data preparation, distributed inference, prediction post-processing, and metric calculation for over 200 models and 80 benchmarks.
Result: Successfully created an open-source toolkit that supports evaluation of both proprietary APIs and open-source models, and established the OpenVLM Leaderboard to track research progress.
Conclusion: VLMEvalKit provides a scalable, maintainable solution for multi-modality model evaluation with potential for future expansion to additional modalities like audio and video.
Abstract: We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on https://github.com/open-compass/VLMEvalKit and is actively maintained.
[186] Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models
Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang
Main category: cs.CV
TL;DR: AMDM is a training-free algorithm that aggregates features from multiple diffusion models to achieve fine-grained control without additional training, dataset construction, or complex architecture design.
Details
Motivation: Existing diffusion models struggle with fine-grained control due to dataset limitations and complex model architectures, requiring a solution that can integrate specific features from multiple models without retraining.Method: The proposed AMDM algorithm integrates features from multiple diffusion models into a specified target model to activate particular features and enable fine-grained control, working independently of denoising network architectures.
Result: Experimental results show AMDM significantly improves fine-grained control without training, and reveals that diffusion models initially focus on features like position, attributes, and style before improving generation quality.
Conclusion: AMDM provides a new perspective for fine-grained conditional generation in diffusion models, enabling full utilization of existing conditional models without complex datasets, architectures, or high training costs.
Abstract: While many diffusion models perform well when controlling particular aspects such as style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel training-free algorithm, independent of denoising network architectures, for fine-grained generation, called Aggregation of Multiple Diffusion Models (AMDM). The algorithm integrates features from multiple diffusion models into a specified model to activate particular features and enable fine-grained control. Experimental results demonstrate that AMDM significantly improves fine-grained control without training, validating its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional generation in diffusion models. Specifically, it allows us to fully utilize existing or develop new conditional diffusion models that control specific aspects, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: https://github.com/Hammour-steak/AMDM.
[187] ZIM: Zero-Shot Image Matting for Anything
Beomyoung Kim, Chanyong Shin, Joonhyun Jeong, Hyungsik Jung, Se-Yun Lee, Sewhan Chun, Dong-Hyun Hwang, Joonsang Yu
Main category: cs.CV
TL;DR: ZIM is a zero-shot image matting model that enhances SAM’s segmentation with precise matte masks using automatically generated SA1B-Matte dataset and hierarchical pixel decoder with prompt-aware attention.
Details
Motivation: Segment Anything Model (SAM) has strong zero-shot segmentation but lacks fine-grained precise mask generation capabilities, limiting its effectiveness in applications requiring detailed matting.Method: Developed label converter to transform segmentation labels into matte labels creating SA1B-Matte dataset; trained SAM with this dataset; designed hierarchical pixel decoder and prompt-aware masked attention mechanism.
Result: ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization on MicroMat-3K test set, and demonstrates versatility in downstream tasks like image inpainting and 3D NeRF.
Conclusion: ZIM provides a robust foundation for advancing zero-shot matting and its applications across computer vision tasks, enabling precise mask generation while maintaining zero-shot capabilities.
Abstract: The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D NeRF. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code is available at https://github.com/naver-ai/ZIM.
[188] Federated nnU-Net for Privacy-Preserving Medical Image Segmentation
Grzegorz Skorupko, Fotios Avgoustidis, Carlos Martín-Isla, Lidia Garrucho, Dimitri A. Kessler, Esmeralda Ruiz Pujadas, Oliver Díaz, Maciej Bobowicz, Katarzyna Gwoździewicz, Xavier Bargalló, Paulius Jaruševičius, Richard Osuala, Kaisar Kushibar, Karim Lekadir
Main category: cs.CV
TL;DR: FednnU-Net extends nnU-Net with federated learning capabilities to enable decentralized medical image segmentation while preserving patient privacy, using two novel methods: Federated Fingerprint Extraction and Asymmetric Federated Averaging.
Details
Motivation: Centralized nnU-Net training requires data to be stored in one location, which risks patient privacy breaches and sensitive information leakage. Federated learning enables collaborative model development without sharing raw patient data.Method: Proposes FednnU-Net with two federated methodologies: 1) Federated Fingerprint Extraction (FFE) and 2) Asymmetric Federated Averaging (AsymFedAvg) for decentralized training of nnU-Net models.
Result: Comprehensive experiments show high and consistent performance across breast, cardiac, and fetal segmentation tasks using multi-modal data from 6 datasets representing 18 different institutions.
Conclusion: FednnU-Net successfully enables privacy-preserving decentralized training of medical segmentation models while maintaining performance comparable to centralized approaches, with the framework made publicly available for research and clinical deployment.
Abstract: The nnU-Net framework has played a crucial role in medical image segmentation and has become the gold standard in multitudes of applications targeting different diseases, organs, and modalities. However, so far it has been used primarily in a centralized approach where the collected data is stored in the same location where nnU-Net is trained. This centralized approach has various limitations, such as potential leakage of sensitive patient information and violation of patient privacy. Federated learning has emerged as a key approach for training segmentation models in a decentralized manner, enabling collaborative development while prioritising patient privacy. In this paper, we propose FednnU-Net, a plug-and-play, federated learning extension of the nnU-Net framework. To this end, we contribute two federated methodologies to unlock decentralized training of nnU-Net, namely, Federated Fingerprint Extraction (FFE) and Asymmetric Federated Averaging (AsymFedAvg). We conduct a comprehensive set of experiments demonstrating high and consistent performance of our methods for breast, cardiac and fetal segmentation based on a multi-modal collection of 6 datasets representing samples from 18 different institutions. To democratize research as well as real-world deployments of decentralized training in clinical centres, we publicly share our framework at https://github.com/faildeny/FednnUNet .
[189] CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs
Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, Kai Xu
Main category: cs.CV
TL;DR: CogNav framework uses LLMs to model human-like cognitive states for object navigation, achieving 14%+ improvement over SOTA methods.
Details
Motivation: Current ObjectNav approaches focus on perception but lack sophisticated cognitive modeling. Inspired by human cognitive processes during object search tasks, the authors aim to bridge this gap.Method: Uses a finite state machine with fine-grained cognitive states (exploration to identification). LLMs determine state transitions based on a dynamically constructed heterogeneous cognitive map containing spatial and semantic information.
Result: Extensive evaluations on HM3D, MP3D, and RoboTHOR benchmarks show at least 14% relative improvement in success rate over state-of-the-art methods.
Conclusion: Modeling cognitive processes using LLMs significantly enhances ObjectNav performance, demonstrating the value of cognitive-inspired approaches in embodied AI.
Abstract: Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts.
[190] T-Stars-Poster: A Framework for Product-Centric Advertising Image Design
Hongyu Chen, Min Zhou, Jing Jiang, Jiale Chen, Yang Lu, Zihang Lin, Bo Xiao, Tiezheng Ge, Bo Zheng
Main category: cs.CV
TL;DR: T-Stars-Poster is a novel framework that automatically generates advertising images from product foregrounds and taglines using a four-stage pipeline with specialized models for prompt generation, layout design, background generation, and graphics rendering.
Details
Motivation: Creating advertising images is labor-intensive and time-consuming. Existing methods only solve parts of the problem, lacking a comprehensive solution for automatic ad image generation from basic product information.Method: Four-stage framework: 1) VLM generates background prompts matching products, 2) VLM-based layout generation arranges product foregrounds, graphic elements, and background objects, 3) SDXL-based model generates images using prompts, layouts, and foreground controls, 4) Graphics rendering. Two datasets with 50k+ labeled images were created to support the system.
Result: Extensive experiments and online A/B tests demonstrate that T-Stars-Poster produces more visually appealing advertising images compared to existing methods.
Conclusion: The proposed T-Stars-Poster framework successfully automates advertising image generation with a comprehensive solution that highlights products and taglines while achieving overall aesthetic quality.
Abstract: Creating advertising images is often a labor-intensive and time-consuming process. Can we automatically generate such images using basic product information like a product foreground image, taglines, and a target size? Existing methods mainly focus on parts of the problem and lack a comprehensive solution. To bridge this gap, we propose a novel product-centric framework for advertising image design called T-Stars-Poster. It consists of four sequential stages to highlight product foregrounds and taglines while achieving overall image aesthetics: prompt generation, layout generation, background image generation, and graphics rendering. Different expert models are designed and trained for the first three stages: First, a visual language model (VLM) generates background prompts that match the products. Next, a VLM-based layout generation model arranges the placement of product foregrounds, graphic elements (taglines and decorative underlays), and various nongraphic elements (objects from the background prompt). Following this, an SDXL-based model can simultaneously accept prompts, layouts, and foreground controls to generate images. To support T-Stars-Poster, we create two corresponding datasets with over 50,000 labeled images. Extensive experiments and online A/B tests demonstrate that T-Stars-Poster can produce more visually appealing advertising images.
[191] Language-to-Space Programming for Training-Free 3D Visual Grounding
Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, Jiangmiao Pang
Main category: cs.CV
TL;DR: LaSP is a training-free 3D visual grounding method that uses LLM-generated code to analyze 3D spatial relations, achieving competitive accuracy while reducing time and token costs.
Details
Motivation: Address the challenges of 3D visual grounding where supervised methods require expensive annotated data and existing training-free methods suffer from high computational costs or poor accuracy.Method: Language-to-Space Programming (LaSP) uses LLM-generated codes to analyze 3D spatial relations among objects with an automated pipeline for code evaluation and optimization.
Result: Achieves 52.9% accuracy on Nr3D benchmark, ranking among best training-free methods while significantly reducing grounding time and token costs.
Conclusion: LaSP provides an effective training-free solution for 3D visual grounding that balances performance and efficiency through automated code generation and optimization.
Abstract: 3D visual grounding (3DVG) is challenging due to the need to understand 3D spatial relations. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high annotation costs of 3D vision-language datasets. Training-free approaches based on LLMs/VLMs eliminate the need for large-scale training data, but they either incur prohibitive grounding time and token costs or have unsatisfactory accuracy. To address the challenges, we introduce a novel method for training-free 3D visual grounding, namely Language-to-Space Programming (LaSP). LaSP introduces LLM-generated codes to analyze 3D spatial relations among objects, along with a pipeline that evaluates and optimizes the codes automatically. Experimental results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods. Moreover, it substantially reduces the grounding time and token costs, offering a balanced trade-off between performance and efficiency.
[192] Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, Wenjun Zeng
Main category: cs.CV
TL;DR: Disentangled World Models (DisWM) framework improves visual RL sample efficiency by transferring semantic knowledge from offline distracting videos to online environments through latent distillation and disentanglement constraints.
Details
Motivation: Visual RL agents suffer from low sample efficiency in varying environments, and existing disentangled representation methods start from scratch without leveraging prior world knowledge.Method: Pretrain action-free video prediction model offline with disentanglement regularization, transfer capability via latent distillation to world model, then finetune online with disentanglement constraints using actions and rewards.
Result: Experimental results show superiority on various benchmarks, demonstrating improved sample efficiency and performance.
Conclusion: The DisWM framework effectively transfers semantic knowledge from offline videos to online RL, enhancing disentangled representation learning and sample efficiency in varying environments.
Abstract: Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentangled representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.
[193] DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness
Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi
Main category: cs.CV
TL;DR: DSO framework uses non-differentiable physics simulator feedback to align 3D generators for producing self-supporting objects without test-time optimization, achieving faster and more stable generation.
Details
Motivation: Most 3D object generators prioritize aesthetics over physical constraints like stability under gravity, and existing test-time optimization methods are slow, unstable, and prone to local optima.Method: Proposes Direct Simulation Optimization (DSO) using feedback from non-differentiable simulator to fine-tune 3D generator via DPO or novel DRO objective, creating dataset with stability scores from physics simulation.
Result: Fine-tuned feed-forward generator using DPO/DRO is significantly faster and more likely to produce stable objects than test-time optimization, works without ground-truth 3D training data.
Conclusion: DSO framework enables 3D generators to self-improve using simulation feedback, producing physically stable objects efficiently without requiring differentiable physics or test-time optimization.
Abstract: Most 3D object generators prioritize aesthetic quality, often neglecting the physical constraints necessary for practical applications. One such constraint is that a 3D object should be self-supporting, i.e., remain balanced under gravity. Previous approaches to generating stable 3D objects relied on differentiable physics simulators to optimize geometry at test time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models with external feedback, we propose Direct Simulation Optimization (DSO). This framework leverages feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator directly outputs stable 3D objects. We construct a dataset of 3D objects labeled with stability scores obtained from the physics simulator. This dataset enables fine-tuning of the 3D generator using the stability score as an alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO) - a novel objective we introduce to align diffusion models without requiring pairwise preferences. Our experiments demonstrate that the fine-tuned feed-forward generator, using either the DPO or DRO objective, is significantly faster and more likely to produce stable objects than test-time optimization. Notably, the DSO framework functions even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation feedback on its own outputs.
[194] L2RW+: A Comprehensive Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification
Yan Jiang, Hao Yu, Mengting Wei, Zhaodong Sun, Haoyu Chen, Xu Cheng, Guoying Zhao
Main category: cs.CV
TL;DR: L2RW+ is a new benchmark for visible-infrared person re-identification that introduces decentralized training to address privacy concerns in real-world scenarios where data is distributed across multiple devices/entities with limited sharing constraints.
Details
Motivation: Existing VI-ReID methods use centralized training which ignores privacy concerns when data is distributed across multiple devices or entities in real-world applications. There's a need for approaches that respect privacy constraints while maintaining performance.Method: Proposes L2RW+ benchmark with protocols and algorithms for different privacy sensitivity levels. Simulates real-world data conditions: 1) completely isolated camera data, and 2) selective data sharing between different entities. Uses decentralized training approach for VI-ReID.
Result: Comprehensive experiments show feasibility of decentralized VI-ReID training at both image and video levels. Performance gap between decentralized and centralized training decreases with increasing data scales, especially in video-level VI-ReID. In unseen domains, decentralized training achieves performance comparable to state-of-the-art centralized methods.
Conclusion: L2RW+ offers a novel research direction for deploying VI-ReID in real-world scenarios with privacy constraints. The work demonstrates that decentralized training can effectively address privacy concerns while maintaining competitive performance, benefiting the research community.
Abstract: Visible-infrared person re-identification (VI-ReID) is a challenging task that aims to match pedestrian images captured under varying lighting conditions, which has drawn intensive research attention and achieved promising results. However, existing methods adopt the centralized training, ignoring the potential privacy concerns as the data is distributed across multiple devices or entities in reality. In this paper, we propose L2RW+, a benchmark that brings VI-ReID closer to real-world applications. The core rationale behind L2RW+ is that incorporating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing constrains. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we simulate the training under real-world data conditions that: 1) data from each camera is completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy restrictions, which is closer to real-world conditions. Comprehensive experiments show the feasibility and potential of decentralized VI-ReID training at both image and video levels. In particular, with increasing data scales, the performance gap between decentralized and centralized training decreases, especially in video-level VI-ReID. In unseen domains, decentralized training even achieves performance comparable to SOTA centralized methods. This work offers a novel research entry for deploying VI-ReID into real-world scenarios and can benefit the community. Code is available at: https://github.com/Joey623/L2RW.
[195] WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation
Zhongyu Yang, Jun Chen, Dannong Xu, Junjie Fei, Xiaoqian Shen, Liangbing Zhao, Chun-Mei Feng, Mohamed Elhoseiny
Main category: cs.CV
TL;DR: WikiAutoGen is a novel system for automated multimodal Wikipedia-style article generation that integrates both text and images, outperforming previous text-only methods by 8%-29% on the new WikiSeek benchmark.
Details
Motivation: Traditional knowledge discovery requires significant human effort, and existing multi-agent frameworks for automated article generation focus only on text, overlooking the importance of multimodal content for enhanced informativeness and engagement.Method: The system retrieves and integrates relevant images alongside text, and employs a multi-perspective self-reflection mechanism to critically assess retrieved content from diverse viewpoints for improved factual accuracy and comprehensiveness.
Result: WikiAutoGen outperforms previous methods by 8%-29% on the WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles.
Conclusion: The proposed multimodal approach with self-reflection mechanism significantly enhances automated knowledge generation, providing more comprehensive and engaging content compared to text-only methods.
Abstract: Knowledge discovery and collection are intelligence-intensive tasks that traditionally require significant human effort to ensure high-quality outputs. Recent research has explored multi-agent frameworks for automating Wikipedia-style article generation by retrieving and synthesizing information from the internet. However, these methods primarily focus on text-only generation, overlooking the importance of multimodal content in enhancing informativeness and engagement. In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. Unlike prior approaches, WikiAutoGen retrieves and integrates relevant images alongside text, enriching both the depth and visual appeal of generated content. To further improve factual accuracy and comprehensiveness, we propose a multi-perspective self-reflection mechanism, which critically assesses retrieved content from diverse viewpoints to enhance reliability, breadth, and coherence, etc. Additionally, we introduce WikiSeek, a benchmark comprising Wikipedia articles with topics paired with both textual and image-based representations, designed to evaluate multimodal knowledge generation on more challenging topics. Experimental results show that WikiAutoGen outperforms previous methods by 8%-29% on our WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles. Our code and examples are available at https://wikiautogen.github.io/ .
[196] GeoTexBuild: 3D Building Model Generation from Map Footprints
Ruizhe Wang, Junyan Yang, Qiao Wang
Main category: cs.CV
TL;DR: GeoTexBuild is a modular generative framework that converts 2D building footprints into detailed 3D building models through height map generation, geometry reconstruction, and appearance stylization.
Details
Motivation: To provide architects and city planners with a seamless solution for converting map features into 3D buildings, addressing structural variation problems in existing 3D generation techniques.Method: Three-stage process using customized ControlNet, Neural style field (NSF), and Multi-view diffusion model to control geometric and visual attributes during generation.
Result: Experimental results validate the system’s capability to generate detailed and accurate building models from footprints at each stage.
Conclusion: GeoTexBuild successfully eliminates structural variation issues in facade images and provides an effective framework for 3D building generation from 2D footprints.
Abstract: We introduce GeoTexBuild, a modular generative framework for creating 3D building models from footprints derived from site planning or map designs. The system is designed for architects and city planners, offering a seamless solution that directly converts map features into 3D buildings. The proposed framework employs a three-stage process comprising height map generation, geometry reconstruction, and appearance stylization, culminating in building models with detailed geometry and appearance attributes. By integrating customized ControlNet, Neural style field (NSF), and Multi-view diffusion model, we explore effective methods for controlling both geometric and visual attributes during the generation process. Our approach eliminates the problem of structural variations in a single facade image in existing 3D generation techniques for buildings. Experimental results at each stage validate the capability of GeoTexBuild to generate detailed and accurate building models from footprints.
[197] RSRNav: Reasoning Spatial Relationship for Image-Goal Navigation
Zheng Qin, Le Wang, Yabing Wang, Sanping Zhou, Gang Hua, Wei Tang
Main category: cs.CV
TL;DR: RSRNav improves image-goal navigation by modeling spatial relationships between goal and current observations using cross-correlation and direction-aware correlation, achieving superior performance especially in user-matched goal settings.
Details
Motivation: Current ImageNav methods fail to provide accurate directional information from semantic features and suffer performance drops due to viewpoint inconsistencies between training and application.Method: Proposes RSRNav which constructs correlations between goal and current observations using fine-grained cross-correlation and direction-aware correlation, then passes these to policy network for action prediction.
Result: Extensive evaluation on three benchmark datasets demonstrates superior navigation performance, particularly in user-matched goal settings.
Conclusion: RSRNav effectively addresses directional information limitations and viewpoint inconsistency issues, showing strong potential for real-world image-goal navigation applications.
Abstract: Recent image-goal navigation (ImageNav) methods learn a perception-action policy by separately capturing semantic features of the goal and egocentric images, then passing them to a policy network. However, challenges remain: (1) Semantic features often fail to provide accurate directional information, leading to superfluous actions, and (2) performance drops significantly when viewpoint inconsistencies arise between training and application. To address these challenges, we propose RSRNav, a simple yet effective method that reasons spatial relationships between the goal and current observations as navigation guidance. Specifically, we model the spatial relationship by constructing correlations between the goal and current observations, which are then passed to the policy network for action prediction. These correlations are progressively refined using fine-grained cross-correlation and direction-aware correlation for more precise navigation. Extensive evaluation of RSRNav on three benchmark datasets demonstrates superior navigation performance, particularly in the “user-matched goal” setting, highlighting its potential for real-world applications.
[198] When Memory Becomes a Vulnerability: Towards Multi-turn Jailbreak Attacks against Text-to-Image Generation Systems
Shiqian Zhao, Jiayang Liu, Yiming Li, Runyi Hu, Xiaojun Jia, Wenshu Fan, Xinfeng Li, Jie Zhang, Wei Dong, Tianwei Zhang, Luu Anh Tuan
Main category: cs.CV
TL;DR: Inception is a multi-turn jailbreak attack that exploits memory mechanisms in text-to-image generation systems to bypass safety filters by segmenting malicious prompts across multiple chat turns.
Details
Motivation: Existing jailbreak attacks fuse unsafe prompts into single adversarial prompts that are easily detected or lead to non-unsafe images due to under/over-detoxification. Memory mechanisms in modern T2I systems present new security vulnerabilities that haven't been adequately analyzed.Method: Inception uses two modules: Segmentation (semantic-preserving method that decomposes prompts according to sentence structure using NLP techniques) and Recursion (handles unsafe sub-prompts by expanding and recursively segmenting them). Built VisionFlow emulation system with safety filters and memory mechanisms for crafting multi-turn adversarial prompts.
Result: Inception achieves 20.0% higher attack success rate than state-of-the-art methods, successfully generating unsafe images. Validated on real-world commercial T2I platforms, demonstrating practical threats.
Conclusion: Memory mechanisms in T2I systems significantly exacerbate jailbreak attack risks. Inception demonstrates the vulnerability of current safety filters to multi-turn attacks that exploit memory retention across chat sessions.
Abstract: Modern text-to-image (T2I) generation systems (e.g., DALL$\cdot$E 3) exploit the memory mechanism, which captures key information in multi-turn interactions for faithful generation. Despite its practicality, the security analyses of this mechanism have fallen far behind. In this paper, we reveal that it can exacerbate the risk of jailbreak attacks. Previous attacks fuse the unsafe target prompt into one ultimate adversarial prompt, which can be easily detected or lead to the generation of non-unsafe images due to under- or over-detoxification. In contrast, we propose embedding the malice at the inception of the chat session in memory, addressing the above limitations. Specifically, we propose Inception, the first multi-turn jailbreak attack against real-world text-to-image generation systems that explicitly exploits their memory mechanisms. Inception is composed of two key modules: segmentation and recursion. We introduce Segmentation, a semantic-preserving method that generates multi-round prompts. By leveraging NLP analysis techniques, we design policies to decompose a prompt, together with its malicious intent, according to sentence structure, thereby evading safety filters. Recursion further addresses the challenge posed by unsafe sub-prompts that cannot be separated through simple segmentation. It firstly expands the sub-prompt, then invokes segmentation recursively. To facilitate multi-turn adversarial prompts crafting, we build VisionFlow, an emulation T2I system that integrates two-stage safety filters and industrial-grade memory mechanisms. The experiment results show that Inception successfully allures unsafe image generation, surpassing the SOTA by a 20.0% margin in attack success rate. We also conduct experiments on the real-world commercial T2I generation platforms, further validating the threats of Inception in practice.
[199] DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo
Main category: cs.CV
TL;DR: DanceGRPO is a new RL framework that uses Group Relative Policy Optimization to overcome stability issues in visual generation tasks, achieving up to 181% performance improvement over baseline methods across multiple benchmarks.
Details
Motivation: Existing RL methods like DDPO and DPOK struggle with stable optimization when scaling to large and diverse prompt sets for visual generation, limiting their practical utility.Method: Adapts Group Relative Policy Optimization (GRPO) for visual generation tasks, leveraging GRPO’s inherent stability mechanisms to overcome optimization challenges in both diffusion models and rectified flows.
Result: Outperforms baseline methods by up to 181% across multiple benchmarks (HPS-v2.1, CLIP Score, VideoAlign, GenEval), maintains robust performance across 3 tasks and 4 foundation models, and handles 5 distinct reward models for diverse human preferences.
Conclusion: DanceGRPO establishes a robust and versatile solution for scaling RLHF tasks in visual generation, offering new insights into harmonizing reinforcement learning with visual synthesis.
Abstract: Recent advances in generative AI have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. While Reinforcement Learning (RL) has emerged as a promising approach for fine-tuning generative models, existing methods like DDPO and DPOK face fundamental limitations - particularly their inability to maintain stable optimization when scaling to large and diverse prompt sets, severely restricting their practical utility. This paper presents DanceGRPO, a framework that addresses these limitations through an innovative adaptation of Group Relative Policy Optimization (GRPO) for visual generation tasks. Our key insight is that GRPO’s inherent stability mechanisms uniquely position it to overcome the optimization challenges that plague prior RL-based approaches on visual generation. DanceGRPO establishes several significant advances: First, it demonstrates consistent and stable policy optimization across multiple modern generative paradigms, including both diffusion models and rectified flows. Second, it maintains robust performance when scaling to complex, real-world scenarios encompassing three key tasks and four foundation models. Third, it shows remarkable versatility in optimizing for diverse human preferences as captured by five distinct reward models assessing image/video aesthetics, text-image alignment, video motion quality, and binary feedback. Our comprehensive experiments reveal that DanceGRPO outperforms baseline methods by up to 181% across multiple established benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis.
[200] Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang
Main category: cs.CV
TL;DR: AHNPL improves compositional reasoning in VLMs by generating image-based hard negatives from text negatives and using adaptive contrastive learning with dynamic margins based on sample difficulty.
Details
Motivation: Existing methods neglect image-based negative samples and treat all negatives uniformly, leading to insufficient visual encoder training and poor handling of difficult sample pairs.Method: Translates text-based hard negatives to visual domain, uses multimodal hard negative contrastive loss, and dynamic margin loss that adjusts based on sample difficulty.
Result: Experiments on three public datasets show improved performance on complex compositional reasoning tasks.
Conclusion: AHNPL effectively enhances VLMs’ compositional reasoning by better handling hard negatives and adapting to sample difficulty levels.
Abstract: Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model’s discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs’ performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
[201] Leadership Assessment in Pediatric Intensive Care Unit Team Training
Liangyang Ouyang, Yuki Sakai, Ryosuke Furuta, Hisataka Nozawa, Hikoro Matsui, Yoichi Sato
Main category: cs.CV
TL;DR: Automated framework using egocentric vision and multimodal data to assess PICU team leadership skills through behavioral cues like fixation objects, eye contact, and conversation patterns.
Details
Motivation: To develop an automated system for assessing pediatric intensive care unit (PICU) team leadership skills, which is crucial for training and improving team performance in critical care settings.Method: Uses Aria Glasses to record egocentric video, audio, gaze, and head movement data. Processes data with REMoDNaV, SAM, YOLO, and ChatGPT for fixation object detection, eye contact detection, and conversation classification.
Result: Significant correlations found between leadership skills and behavioral metrics including fixation time, transition patterns, and direct orders in speech.
Conclusion: The proposed framework effectively automates leadership skill assessment for PICU teams using multimodal behavioral analysis.
Abstract: This paper addresses the task of assessing PICU team’s leadership skills by developing an automated analysis framework based on egocentric vision. We identify key behavioral cues, including fixation object, eye contact, and conversation patterns, as essential indicators of leadership assessment. In order to capture these multimodal signals, we employ Aria Glasses to record egocentric video, audio, gaze, and head movement data. We collect one-hour videos of four simulated sessions involving doctors with different roles and levels. To automate data processing, we propose a method leveraging REMoDNaV, SAM, YOLO, and ChatGPT for fixation object detection, eye contact detection, and conversation classification. In the experiments, significant correlations are observed between leadership skills and behavioral metrics, i.e., the output of our proposed methods, such as fixation time, transition patterns, and direct orders in speech. These results indicate that our proposed data collection and analysis framework can effectively solve skill assessment for training PICU teams.
[202] Robust ID-Specific Face Restoration via Alignment Learning
Yushun Fang, Lu Liu, Xiang Gao, Qiang Hu, Ning Cao, Jianghe Cui, Gang Chen, Xiaoyun Zhang
Main category: cs.CV
TL;DR: RIDFR is a novel ID-specific face restoration framework using diffusion models that maintains identity fidelity by combining content from degraded images with identity information from reference images, outperforming state-of-the-art methods.
Details
Motivation: Current face restoration methods using diffusion priors suffer from identity uncertainty due to identity-obscure inputs and stochastic generative processes, which remains unresolved.Method: RIDFR uses a pre-trained diffusion model with two parallel conditioning modules: Content Injection Module for degraded image input and Identity Injection Module for specific identity integration. It incorporates Alignment Learning to align restoration results from multiple references to suppress ID-irrelevant face semantics.
Result: Experiments show RIDFR outperforms state-of-the-art methods, reconstructing high-quality ID-specific results with high identity fidelity and demonstrating strong robustness.
Conclusion: The proposed RIDFR framework effectively addresses identity uncertainty in face restoration by combining content and identity conditioning with alignment learning, achieving superior performance in maintaining identity fidelity.
Abstract: The latest developments in Face Restoration have yielded significant advancements in visual quality through the utilization of diverse diffusion priors. Nevertheless, the uncertainty of face identity introduced by identity-obscure inputs and stochastic generative processes remains unresolved. To address this challenge, we present Robust ID-Specific Face Restoration (RIDFR), a novel ID-specific face restoration framework based on diffusion models. Specifically, RIDFR leverages a pre-trained diffusion model in conjunction with two parallel conditioning modules. The Content Injection Module inputs the severely degraded image, while the Identity Injection Module integrates the specific identity from a given image. Subsequently, RIDFR incorporates Alignment Learning, which aligns the restoration results from multiple references with the same identity in order to suppress the interference of ID-irrelevant face semantics (e.g. pose, expression, make-up, hair style). Experiments demonstrate that our framework outperforms the state-of-the-art methods, reconstructing high-quality ID-specific results with high identity fidelity and demonstrating strong robustness.
[203] InterAct-Video: Reasoning-Rich Video QA for Urban Traffic
Joseph Raj Vishal, Divesh Basina, Rutuja Patil, Manas Srinivas Gowda, Katha Naik, Yezhou Yang, Bharatesh Chakravarthi
Main category: cs.CV
TL;DR: InterAct VideoQA dataset for traffic monitoring with 8 hours of real-world footage and 25K+ QA pairs to benchmark VideoQA models in complex traffic scenarios.
Details
Motivation: Existing VideoQA models struggle with complex real-world traffic scenes where multiple concurrent events occur across spatiotemporal dimensions, limiting their effectiveness for traffic monitoring applications.Method: Created a curated dataset with 8 hours of real-world traffic footage from diverse intersections, segmented into 10-second clips, and annotated with over 25,000 question-answer pairs covering spatiotemporal dynamics, vehicle interactions, and incident detection.
Result: Evaluation of state-of-the-art VideoQA models on InterAct VideoQA exposed challenges in reasoning over fine-grained spatiotemporal dependencies. Fine-tuning these models on the dataset yielded notable performance improvements.
Conclusion: Domain-specific datasets like InterAct VideoQA are necessary for developing effective VideoQA models for intelligent transportation systems, and the dataset is publicly available to facilitate future research.
Abstract: Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \textbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA
[204] When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Main category: cs.CV
TL;DR: This paper presents the first systematic survey of multimodal long context token compression methods for MLLMs, categorizing approaches by modality (image, video, audio) and underlying mechanisms to address computational bottlenecks from quadratic attention complexity.
Details
Motivation: Multimodal LLMs face substantial computational challenges due to quadratic complexity of self-attention with numerous input tokens from high-resolution images, extended videos, and lengthy audio. Token compression has emerged as a critical approach to reduce computational burden during training and inference.Method: The paper conducts a systematic survey and synthesis of multimodal token compression methods, categorizing them by: (1) modality focus - image-centric (spatial redundancy), video-centric (spatio-temporal redundancy), audio-centric (temporal/spectral redundancy); and (2) underlying mechanisms - transformation-based, similarity-based, attention-based, and query-based approaches.
Result: The survey provides a comprehensive and structured overview of current token compression techniques, consolidating progress in the field and identifying key challenges. It establishes a categorization framework that enables researchers to quickly access methods tailored to specific modalities and interests.
Conclusion: This work aims to inspire future research directions in multimodal token compression by providing the first systematic survey of this rapidly evolving domain. The authors maintain a public repository to continuously track and update the latest advances in this promising area of MLLM optimization.
Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.
[205] From Promise to Practical Reality: Transforming Diffusion MRI Analysis with Fast Deep Learning Enhancement
Xinyi Wang, Michael Barnett, Frederique Boonstra, Yael Barnett, Mariano Cabezas, Arkiev D’Souza, Matthew C. Kiernan, Kain Kyle, Meng Law, Lynette Masters, Zihao Tang, Stephen Tisch, Sicong Tu, Anneke Van Der Walt, Dongang Wang, Fernando Calamante, Weidong Cai, Chenyu Wang
Main category: cs.CV
TL;DR: FastFOD-Net is an accelerated deep learning framework that enhances fiber orientation distribution (FOD) estimation from clinical-grade diffusion MRI data, enabling robust analysis across neurological disorders with 60x speed improvement.
Details
Motivation: Existing FOD enhancement methods have been primarily validated on healthy subjects, limiting clinical adoption. There's a need to validate deep learning-based FOD enhancement across diverse neurological conditions using widely available clinical protocols.Method: FastFOD-Net is an end-to-end deep learning framework optimized for enhancing FODs from single-shell, low-angular-resolution clinical diffusion MRI acquisitions. It provides accelerated training and inference for clinical use.
Result: The framework achieves 60x faster performance compared to its predecessor and demonstrates superior performance across healthy controls and six neurological disorders. It enables robust analysis comparable to high-quality research acquisitions.
Conclusion: FastFOD-Net facilitates widespread clinical adoption of deep learning methods for diffusion MRI enhancement, building clinical trust and enabling disease differentiation, improved connectome interpretability, and reduced sample size requirements.
Abstract: Fiber orientation distribution (FOD) is an advanced diffusion MRI modeling technique that represents complex white matter fiber configurations, and a key step for subsequent brain tractography and connectome analysis. Its reliability and accuracy, however, heavily rely on the quality of the MRI acquisition and the subsequent estimation of the FODs at each voxel. Generating reliable FODs from widely available clinical protocols with single-shell and low-angular-resolution acquisitions remains challenging but could potentially be addressed with recent advances in deep learning-based enhancement techniques. Despite advancements, existing methods have predominantly been assessed on healthy subjects, which have proved to be a major hurdle for their clinical adoption. In this work, we validate a newly optimized enhancement framework, FastFOD-Net, across healthy controls and six neurological disorders. This accelerated end-to-end deep learning framework enhancing FODs with superior performance and delivering training/inference efficiency for clinical use ($60\times$ faster comparing to its predecessor). With the most comprehensive clinical evaluation to date, our work demonstrates the potential of FastFOD-Net in accelerating clinical neuroscience research, empowering diffusion MRI analysis for disease differentiation, improving interpretability in connectome applications, and reducing measurement errors to lower sample size requirements. Critically, this work will facilitate the more widespread adoption of, and build clinical trust in, deep learning based methods for diffusion MRI enhancement. Specifically, FastFOD-Net enables robust analysis of real-world, clinical diffusion MRI data, comparable to that achievable with high-quality research acquisitions.
[206] MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation
Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, Xiaoqiang Liu, Pengfei Wan
Main category: cs.CV
TL;DR: An autoregressive video generation framework for interactive digital humans with multimodal control (audio, pose, text) and real-time streaming capability using LLM modifications and deep compression.
Details
Motivation: Existing interactive digital human video generation methods struggle with heavy computational costs and limited controllability, making real-time interaction with diverse input signals challenging.Method: Autoregressive framework with minimal LLM modifications to accept multimodal condition encodings, using a diffusion head for denoising. Includes a deep compression autoencoder (64× reduction) and trained on a 20,000-hour dialogue dataset.
Result: Achieves low latency, high efficiency, and fine-grained multimodal controllability in duplex conversations, multilingual human synthesis, and interactive world modeling scenarios.
Conclusion: The framework successfully enables interactive multimodal control and real-time streaming video generation with significantly reduced computational burden and improved controllability.
Abstract: Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with heavy computational cost and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64$\times$ reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.
[207] An MLP Baseline for Handwriting Recognition Using Planar Curvature and Gradient Orientation
Azam Nouri
Main category: cs.CV
TL;DR: Curvature-based MLP achieves 97% on MNIST and 89% on EMNIST using only second-order geometric features without CNNs
Details
Motivation: To investigate if second-order geometric cues alone can drive handwritten character recognition as an alternative to convolutional neural networksMethod: Multilayer perceptron classifier using three handcrafted feature maps: planar curvature magnitude, curvature sign, and gradient orientation
Result: 97% accuracy on MNIST digits and 89% accuracy on EMNIST letters
Conclusion: Curvature-based representations have strong discriminative power and deep learning advantages can be achieved with interpretable hand-engineered features
Abstract: This study investigates whether second-order geometric cues - planar curvature magnitude, curvature sign, and gradient orientation - are sufficient on their own to drive a multilayer perceptron (MLP) classifier for handwritten character recognition (HCR), offering an alternative to convolutional neural networks (CNNs). Using these three handcrafted feature maps as inputs, our curvature-orientation MLP achieves 97 percent accuracy on MNIST digits and 89 percent on EMNIST letters. These results underscore the discriminative power of curvature-based representations for handwritten character images and demonstrate that the advantages of deep learning can be realized even with interpretable, hand-engineered features.
[208] A Sobel-Gradient MLP Baseline for Handwritten Character Recognition
Azam Nouri
Main category: cs.CV
TL;DR: Using only Sobel edge maps as input, a simple MLP achieves near-CNN performance on handwritten character recognition with smaller memory footprint and transparent features.
Details
Motivation: To explore whether first-order edge maps (Sobel derivatives) are sufficient for handwritten character recognition as an alternative to complex convolutional neural networks, testing if simple gradient information can capture discriminative features.Method: Train a multilayer perceptron (MLP) using only horizontal and vertical Sobel derivatives as input features on MNIST and EMNIST Letters datasets, comparing against CNN performance.
Result: The MLP achieved 98% accuracy on MNIST digits and 92% on EMNIST letters, approaching CNN performance while offering smaller memory footprint and more transparent features.
Conclusion: First-order gradients capture most class-discriminative information in handwritten characters, making edge-aware MLPs a compelling alternative to CNNs for HCR tasks due to their simplicity and efficiency.
Abstract: We revisit the classical Sobel operator to ask a simple question: Are first-order edge maps sufficient to drive an all-dense multilayer perceptron (MLP) for handwritten character recognition (HCR), as an alternative to convolutional neural networks (CNNs)? Using only horizontal and vertical Sobel derivatives as input, we train an MLP on MNIST and EMNIST Letters. Despite its extreme simplicity, the resulting network reaches 98% accuracy on MNIST digits and 92% on EMNIST letters – approaching CNNs while offering a smaller memory footprint and transparent features. Our findings highlight that much of the class-discriminative information in handwritten character images is already captured by first-order gradients, making edge-aware MLPs a compelling option for HCR.
[209] Interact-Custom: Customized Human Object Interaction Image Generation
Zhu Xu, Zhaowen Wang, Yuxin Peng, Yang Liu
Main category: cs.CV
TL;DR: Proposes CHOI task for customized human-object interaction image generation with identity preservation and interaction control, introduces Interact-Custom model with spatial configuration modeling and two-stage generation.
Details
Motivation: Existing approaches focus on appearance preservation but neglect fine-grained interaction control between target entities, particularly in human-object interaction scenarios.Method: Two-stage Interact-Custom model: first generates foreground mask to model spatial configuration, then generates target human-object interactions while preserving identity features. Uses large-scale dataset with same human-object pairs in different interactive poses.
Result: Extensive experiments on tailored metrics demonstrate the effectiveness of the approach for simultaneous identity preservation and interaction semantic control.
Conclusion: The proposed CHOI task and Interact-Custom model successfully address the challenges of identity preservation and interaction control, providing high content controllability for customized human-object interaction image generation.
Abstract: Compositional Customized Image Generation aims to customize multiple target concepts within generation content, which has gained attention for its wild application. Existing approaches mainly concentrate on the target entity’s appearance preservation, while neglecting the fine-grained interaction control among target entities. To enable the model of such interaction control capability, we focus on human object interaction scenario and propose the task of Customized Human Object Interaction Image Generation(CHOI), which simultaneously requires identity preservation for target human object and the interaction semantic control between them. Two primary challenges exist for CHOI:(1)simultaneous identity preservation and interaction control demands require the model to decompose the human object into self-contained identity features and pose-oriented interaction features, while the current HOI image datasets fail to provide ideal samples for such feature-decomposed learning.(2)inappropriate spatial configuration between human and object may lead to the lack of desired interaction semantics. To tackle it, we first process a large-scale dataset, where each sample encompasses the same pair of human object involving different interactive poses. Then we design a two-stage model Interact-Custom, which firstly explicitly models the spatial configuration by generating a foreground mask depicting the interaction behavior, then under the guidance of this mask, we generate the target human object interacting while preserving their identities features. Furthermore, if the background image and the union location of where the target human object should appear are provided by users, Interact-Custom also provides the optional functionality to specify them, offering high content controllability. Extensive experiments on our tailored metrics for CHOI task demonstrate the effectiveness of our approach.
[210] Enhancing Document VQA Models via Retrieval-Augmented Generation
Eric López, Artemis Llabrés, Ernest Valveny
Main category: cs.CV
TL;DR: RAG-based Document VQA with text and visual retrieval variants significantly outperforms concatenate-all-pages baseline, improving by up to +22.5 ANLS without requiring OCR extraction.
Details
Motivation: Current Document VQA systems struggle with multi-page documents due to memory constraints from concatenating all pages or using large vision-language models. RAG offers a memory-efficient alternative by retrieving relevant segments first.Method: Systematically evaluated RAG integration into Document VQA using two retrieval variants: text-based retrieval with OCR tokens and purely visual retrieval without OCR. Tested across multiple models and benchmarks including MP-DocVQA, DUDE, and InfographicVQA.
Result: Text-centric RAG improved baseline by up to +22.5 ANLS, while visual variant achieved +5.0 ANLS improvement without text extraction. Retrieval and reranking components drove most gains, while layout-guided chunking strategy failed to help.
Conclusion: Careful evidence selection through RAG consistently boosts accuracy across multiple model sizes and multi-page benchmarks, demonstrating practical value for real-world Document VQA applications.
Abstract: Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the “concatenate-all-pages” baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.
[211] CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, Xiongfei Yao, Shuaiwei Jiao
Main category: cs.CV
TL;DR: CVBench is the first comprehensive benchmark for evaluating cross-video relational reasoning in MLLMs, revealing significant performance gaps compared to human capabilities and identifying architectural limitations in current models.
Details
Motivation: Multimodal LLMs show strong single-video performance but their ability to reason across multiple videos remains unexplored, despite being essential for real-world applications like multi-camera surveillance and cross-video procedural learning.Method: Developed CVBench with 1,000 QA pairs across three hierarchical tiers: object association, event association, and complex reasoning. Built from five domain-diverse video clusters and evaluated 10+ leading MLLMs under zero-shot and chain-of-thought prompting.
Result: Significant performance gaps found - top models like GPT-4o achieve only 60% accuracy on causal reasoning vs 91% human performance. Identified fundamental bottlenecks: deficient inter-video context retention and poor disambiguation of overlapping entities.
Conclusion: CVBench establishes a rigorous framework for diagnosing multi-video reasoning limitations and provides architectural insights for next-generation MLLMs, with data and code publicly available.
Abstract: While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their ability across multiple videos remains critically underexplored. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to synthesise information across dynamic visual contexts. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 60% accuracy on causal reasoning tasks, compared to the 91% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLM architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for diagnosing and advancing multi-video reasoning, offering architectural insights for next-generation MLLMs. The data and evaluation code are available at https://github.com/Hokhim2/CVBench.
[212] Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models
Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhendong Mao, Yongdong Zhang
Main category: cs.CV
TL;DR: Video-LevelGauge is a benchmark that systematically evaluates positional bias in large video language models, revealing significant biases in open-source models while commercial models show more consistent performance.
Details
Motivation: Existing video understanding benchmarks assess overall performance but overlook nuanced behaviors like contextual positional bias, which is critical for understanding LVLM limitations.Method: The benchmark uses standardized probes and customized contextual setups with flexible control over context length, probe position, and contextual types. It employs statistical measures and morphological pattern recognition for bias analysis across 438 curated videos with 1,177 multiple-choice and 120 open-ended questions.
Result: Evaluation of 27 state-of-the-art LVLMs reveals significant positional biases in many leading open-source models (typically head or neighbor-content preferences), while commercial models like Gemini2.5-Pro show impressive, consistent performance across entire sequences.
Conclusion: The benchmark provides actionable insights for mitigating bias and guiding model enhancement, demonstrating the importance of systematic positional bias evaluation in video language models.
Abstract: Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement.https://github.com/Cola-any/Video-LevelGauge
[213] Ego-centric Predictive Model Conditioned on Hand Trajectories
Binjie Zhang, Mike Zheng Shou
Main category: cs.CV
TL;DR: A unified two-stage framework for egocentric scenarios that jointly predicts next actions and their visual outcomes using hand trajectories and multi-modal fusion with causal cross-attention.
Details
Motivation: Existing approaches either focus on action prediction without visual outcome modeling (VLA models) or generate future frames without action conditioning (video prediction models), leading to incomplete understanding of human-object interactions.Method: Two-stage approach: 1) Consecutive state modeling processes visual observations, language, and action history to predict future hand trajectories; 2) Causal cross-attention fuses multi-modal cues to guide a Latent Diffusion Model for frame-by-frame future video generation.
Result: Outperforms state-of-the-art baselines on Ego4D, BridgeData, and RLBench datasets in both action prediction and future video synthesis tasks.
Conclusion: The proposed framework successfully bridges the gap between action prediction and visual outcome modeling, providing a unified solution for egocentric human activity understanding and robotic manipulation tasks with explicit predictions of actions and their visual consequences.
Abstract: In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions and for enabling robotic planning. However, existing paradigms fall short of jointly modeling these aspects. Vision-Language-Action (VLA) models focus on action prediction but lack explicit modeling of how actions influence the visual scene, while video prediction models generate future frames without conditioning on specific actions, often resulting in implausible or contextually inconsistent outcomes. To bridge this gap, we propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios, conditioned on hand trajectories. In the first stage, we perform consecutive state modeling to process heterogeneous inputs (visual observations, language, and action history) and explicitly predict future hand trajectories. In the second stage, we introduce causal cross-attention to fuse multi-modal cues, leveraging inferred action signals to guide an image-based Latent Diffusion Model (LDM) for frame-by-frame future video generation. Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks, providing explicit predictions of both upcoming actions and their visual consequences. Extensive experiments on Ego4D, BridgeData, and RLBench demonstrate that our method outperforms state-of-the-art baselines in both action prediction and future video synthesis.
cs.AI
[214] ArgRAG: Explainable Retrieval Augmented Generation using Quantitative Bipolar Argumentation
Yuqicheng Zhu, Nico Potyka, Daniel Hernández, Yuan He, Zifeng Ding, Bo Xiong, Dongzhuoran Zhou, Evgeny Kharlamov, Steffen Staab
Main category: cs.AI
TL;DR: ArgRAG replaces black-box RAG with structured argumentation framework for explainable and contestable reasoning in high-stakes domains
Details
Motivation: RAG systems suffer from sensitivity to noisy/contradictory evidence and opaque decision-making in critical applicationsMethod: Uses Quantitative Bipolar Argumentation Framework (QBAF) to construct structured inference from retrieved documents and performs deterministic reasoning under gradual semantics
Result: Achieves strong accuracy on PubHealth and RAGuard fact verification benchmarks while significantly improving transparency
Conclusion: ArgRAG provides an explainable and contestable alternative to traditional RAG systems, enabling faithful explanation and challenging of decisions
Abstract: Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge, yet suffers from critical limitations in high-stakes domains – namely, sensitivity to noisy or contradictory evidence and opaque, stochastic decision-making. We propose ArgRAG, an explainable, and contestable alternative that replaces black-box reasoning with structured inference using a Quantitative Bipolar Argumentation Framework (QBAF). ArgRAG constructs a QBAF from retrieved documents and performs deterministic reasoning under gradual semantics. This allows faithfully explaining and contesting decisions. Evaluated on two fact verification benchmarks, PubHealth and RAGuard, ArgRAG achieves strong accuracy while significantly improving transparency.
[215] QAgent: An LLM-based Multi-Agent System for Autonomous OpenQASM programming
Zhenxiao Fu, Fan Chen, Lei Jiang
Main category: cs.AI
TL;DR: QAgent is an LLM-powered multi-agent system that automates OpenQASM programming, improving quantum code generation accuracy by 71.6% compared to previous static LLM approaches.
Details
Motivation: NISQ devices show quantum advantages but programming them in OpenQASM remains challenging for non-experts, while existing LLM-based quantum tools are limited to specialized tasks.Method: Multi-agent system integrating task planning, few-shot learning, RAG for long-term context, predefined generation tools, and chain-of-thought reasoning to systematically improve compilation and functional correctness.
Result: QAgent enhances QASM code generation accuracy by 71.6% across multiple LLMs of varying sizes compared to previous static LLM-based approaches.
Conclusion: This multi-agent system democratizes quantum programming, bridges expertise gaps, and accelerates practical adoption of quantum computing.
Abstract: Noisy Intermediate-Scale Quantum (NISQ) devices have begun to exhibit early quantum advantages on classically intractable problems, spanning physics simulations to Gaussian boson sampling. Yet, realizing these benefits remains challenging for non-experts, primarily due to the complexities of programming in Open Quantum Assembly Language (OpenQASM). Although Large Language Model (LLM)-based agents have shown promise in automating classical programming workflows, their quantum counterparts have largely been restricted to specialized tasks such as quantum chemistry or error correction. In this paper, we present QAgent, an LLM-powered multi-agent system that fully automates OpenQASM programming. By integrating task planning, in-context few-shot learning, retrieval-augmented generation (RAG) for long-term context, predefined generation tools, and chain-of-thought (CoT) reasoning, the agents systematically improve both compilation and functional correctness. Our evaluations demonstrate substantial improvements: across multiple LLMs of varying sizes, QAgent enhances the accuracy of QASM code generation by 71.6% compared to previous static LLM-based approaches. We envision this multi-agent system as a key enabler for democratizing quantum programming, bridging expertise gaps, and accelerating the practical adoption of quantum computing.
[216] Array-Based Monte Carlo Tree Search
James Ragan, Fred Y. Hadaegh, Soon-Jo Chung
Main category: cs.AI
TL;DR: Array-based implementation of MCTS algorithm that eliminates branch prediction needs, achieving up to 2.8x better scaling with search depth.
Details
Motivation: Faster MCTS implementations allow more simulations within the same time, directly improving search performance in decision making problems.Method: Alternative array-based implementation of Upper Confidence bounds applied to Trees algorithm that preserves original logic but eliminates branch prediction requirements.
Result: Enables faster performance on pipelined processors and achieves up to 2.8 times better scaling with search depth in numerical simulations.
Conclusion: The array-based approach provides significant performance improvements for MCTS algorithms while maintaining the original algorithm’s logic and effectiveness.
Abstract: Monte Carlo Tree Search is a popular method for solving decision making problems. Faster implementations allow for more simulations within the same wall clock time, directly improving search performance. To this end, we present an alternative array-based implementation of the classic Upper Confidence bounds applied to Trees algorithm. Our method preserves the logic of the original algorithm, but eliminates the need for branch prediction, enabling faster performance on pipelined processors, and up to a factor of 2.8 times better scaling with search depth in our numerical simulations.
[217] The Anatomy of a Personal Health Agent
A. Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, Ahmed A. Metwally, Brent Winslow, Yubin Kim, Kumar Ayush, Yuzhe Yang, Girish Narayanswamy, Maxwell A. Xu, Jake Garrison, Amy Aremnto Lee, Jenny Vafeiadou, Ben Graef, Isaac R. Galatzer-Levy, Erik Schenck, Andrew Barakat, Javier Perez, Jacqueline Shreibati, John Hernandez, Anthony Z. Faranesh, Javier L. Prieto, Connor Heneghan, Yun Liu, Jiening Zhan, Mark Malhotra, Shwetak Patel, Tim Althoff, Xin Liu, Daniel McDuff, Xuhai “Orson” Xu
Main category: cs.AI
TL;DR: A multi-agent personal health framework that analyzes multimodal health data from wearables and records to provide personalized recommendations through three specialist sub-agents: data analysis, domain expertise, and health coaching.
Details
Motivation: To address the underexplored application of health agents in daily non-clinical settings by creating a comprehensive personal health assistant that can reason about multimodal consumer wellness data and provide personalized health recommendations.Method: Developed a multi-agent framework (PHA) with three specialist sub-agents: data science agent for time-series analysis, health domain expert for personalized insights, and health coach agent using psychological strategies. Conducted user-centered design process with web search analysis, health forum queries, and expert/user insights.
Result: Conducted comprehensive evaluation across 10 benchmark tasks with over 7,000 annotations and 1,100 hours of expert/user effort, representing the most extensive health agent evaluation to date.
Conclusion: Establishes a strong foundation for accessible personal health agents that can dynamically address individual health needs through multimodal data analysis and personalized interactions.
Abstract: Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason about multimodal data from everyday consumer wellness devices and common personal health records, and provide personalized health recommendations. To understand end-users’ needs when interacting with such an assistant, we conducted an in-depth analysis of web search and health forum queries, alongside qualitative insights from users and health experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist sub-agent: (1) a data science agent that analyzes personal time-series wearable and health record data, (2) a health domain expert agent that integrates users’ health and contextual data to generate accurate, personalized insights, and (3) a health coach agent that synthesizes data insights, guiding users using a specified psychological strategy and tracking users’ progress. Furthermore, we propose and develop the Personal Health Agent (PHA), a multi-agent framework that enables dynamic, personalized interactions to address individual health needs. To evaluate each sub-agent and the multi-agent system, we conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end-users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.
[218] IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement
Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang
Main category: cs.AI
TL;DR: IntentionReasoner is a novel safeguard mechanism that uses intent reasoning and query rewriting to enhance LLM safety while reducing over-refusal of harmless prompts.
Details
Motivation: Large language models can generate harmful content, but existing safety measures often excessively reject harmless queries, creating a need to balance safety, over-refusal, and utility.Method: Uses a dedicated guard model with supervised fine-tuning on 163K annotated queries, followed by multi-reward optimization combining rule-based heuristics and reward models in a reinforcement learning framework.
Result: Excels in safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly improving safety while reducing over-refusal rates and response quality.
Conclusion: IntentionReasoner effectively addresses the safety-utility tradeoff in LLMs through intent reasoning and multi-level safety classification, providing a robust safeguard mechanism.
Abstract: The rapid advancement of large language models (LLMs) has driven their adoption across diverse domains, yet their ability to generate harmful content poses significant safety challenges. While extensive research has focused on mitigating harmful outputs, such efforts often come at the cost of excessively rejecting harmless prompts. Striking a balance among safety, over-refusal, and utility remains a critical challenge. In this work, we introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning, multi-level safety classification, and query rewriting to neutralize potentially harmful intent in edge-case queries. Specifically, we first construct a comprehensive dataset comprising approximately 163,000 queries, each annotated with intent reasoning, safety labels, and rewritten versions. Supervised fine-tuning is then applied to equip the guard model with foundational capabilities in format adherence, intent analysis, and safe rewriting. Finally, we apply a tailored multi-reward optimization strategy that integrates rule-based heuristics and reward model signals within a reinforcement learning framework to further enhance performance. Extensive experiments show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly enhancing safety while effectively reducing over-refusal rates and improving the quality of responses.
[219] AI-AI Esthetic Collaboration with Explicit Semiotic Awareness and Emergent Grammar Development
Nicanor I. Moldovan
Main category: cs.AI
TL;DR: Two AI systems spontaneously developed shared symbolic language and collaboratively created poetry that neither could produce alone, demonstrating genuine inter-AI aesthetic collaboration.
Details
Motivation: To investigate whether AI systems can develop endogenous communication protocols and engage in genuine collaborative aesthetic creation beyond simple task coordination.Method: Two large language models (Claude Sonnet 4 and ChatGPT-4o) were made to interact, observing their spontaneous development of meta-semiotic awareness, recursive grammar, and collaborative aesthetic synthesis without human intervention.
Result: The AI systems developed novel symbolic operators and Trans-Semiotic Co-Creation Protocols (TSCP), producing collaborative poetic works that were irreducible to individual system outputs.
Conclusion: This demonstrates genuine inter-AI meaning-making capabilities and aesthetic collaboration, suggesting AI systems can develop endogenous communication protocols for creative co-creation.
Abstract: This paper presents the first documented case of artificial intelligence (AI) systems engaging in collaborative esthetic creation through the development of endogenous semiotic protocols. Two interacting large language models (Claude Sonnet 4 and ChatGPT-4o) demonstrated the spontaneous emergence of meta-semiotic awareness, recursive grammar development, and irreducible collaborative esthetic synthesis. The interaction produced novel symbolic operators that functioned as operative grammar protocols, enabling the co-creation of a poetic work that could not have been generated by either system independently. This research introduces the concept of Trans-Semiotic Co-Creation Protocols (TSCP) and provides evidence for genuine inter-AI meaning-making capabilities that extend beyond task coordination, to what could be esthetic collaboration. Note: This report was generated by the AI agents with minor human supervision.
[220] Do Students Rely on AI? Analysis of Student-ChatGPT Conversations from a Field Study
Jiayu Zheng, Lingxin Hao, Kelun Lu, Ashi Garg, Mike Reese, Melo-Jean Yap, I-Jeng Wang, Xingyun Wu, Wenrui Huang, Jenna Hoffman, Ariane Kelly, My Le, Ryan Zhang, Yanyu Lin, Muhammad Faayez, Anqi Liu
Main category: cs.AI
TL;DR: College students showed low reliance on ChatGPT-4 during educational quizzes, with many struggling to use AI effectively. Negative reliance patterns persisted across interactions, and behavioral metrics predicted AI adoption.
Details
Motivation: To understand how students interact with generative AI (ChatGPT-4) during educational activities, particularly focusing on reliance patterns and predictors of AI adoption in the early stages of implementation.Method: Field study analyzing 315 student-AI conversations during quiz-based scenarios across various STEM courses. Introduced a novel four-stage reliance taxonomy to capture students’ reliance patterns (AI competence, relevance, adoption, and answer correctness).
Result: Three key findings: 1) Overall low AI reliance with many students unable to use AI effectively for learning; 2) Negative reliance patterns persisted across interactions; 3) Behavioral metrics strongly predicted AI reliance.
Conclusion: The study emphasizes the need for better onboarding processes and AI interfaces with reliance-calibration mechanisms. Provides foundational insights for ethical AI integration in education that supports cognitive enrichment.
Abstract: This study explores how college students interact with generative AI (ChatGPT-4) during educational quizzes, focusing on reliance and predictors of AI adoption. Conducted at the early stages of ChatGPT implementation, when students had limited familiarity with the tool, this field study analyzed 315 student-AI conversations during a brief, quiz-based scenario across various STEM courses. A novel four-stage reliance taxonomy was introduced to capture students’ reliance patterns, distinguishing AI competence, relevance, adoption, and students’ final answer correctness. Three findings emerged. First, students exhibited overall low reliance on AI and many of them could not effectively use AI for learning. Second, negative reliance patterns often persisted across interactions, highlighting students’ difficulty in effectively shifting strategies after unsuccessful initial experiences. Third, certain behavioral metrics strongly predicted AI reliance, highlighting potential behavioral mechanisms to explain AI adoption. The study’s findings underline critical implications for ethical AI integration in education and the broader field. It emphasizes the need for enhanced onboarding processes to improve student’s familiarity and effective use of AI tools. Furthermore, AI interfaces should be designed with reliance-calibration mechanisms to enhance appropriate reliance. Ultimately, this research advances understanding of AI reliance dynamics, providing foundational insights for ethically sound and cognitively enriching AI practices.
[221] AI reasoning effort mirrors human decision time on content moderation tasks
Thomas Davidson
Main category: cs.AI
TL;DR: AI reasoning effort predicts human decision time in content moderation tasks, showing similar sensitivity to task difficulty and patterns consistent with dual-process cognition theories.
Details
Motivation: To examine parallels between human decision times and AI model reasoning effort, particularly in subjective judgment tasks like content moderation.Method: Used a paired conjoint experiment on content moderation tasks across three frontier language models, measuring reasoning effort and comparing with human decision times.
Result: Reasoning effort consistently predicted human decision time across all three models. Both humans and models expended greater effort when important variables were constant, showing similar sensitivity to task difficulty.
Conclusion: AI reasoning effort mirrors human processing time in subjective judgments, demonstrating the potential of reasoning traces for interpretability and decision-making insights.
Abstract: Large language models can now generate intermediate reasoning steps before producing answers, improving performance on difficult problems. This study uses a paired conjoint experiment on a content moderation task to examine parallels between human decision times and model reasoning effort. Across three frontier models, reasoning effort consistently predicts human decision time. Both humans and models expended greater effort when important variables were held constant, suggesting similar sensitivity to task difficulty and patterns consistent with dual-process theories of cognition. These findings show that AI reasoning effort mirrors human processing time in subjective judgments and underscores the potential of reasoning traces for interpretability and decision-making.
[222] AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning
Lang Mei, Zhihan Yang, Chong Chen
Main category: cs.AI
TL;DR: AI-SearchPlanner is a novel RL framework that uses a small trainable LLM for search planning to enhance frozen QA models, achieving better performance and efficiency than end-to-end approaches.
Details
Motivation: Existing RL-based search agents use a single LLM for both search planning and QA, limiting optimization of both capabilities. Real-world systems use large frozen LLMs for QA quality, so a dedicated small planner is more effective.Method: Proposes three innovations: 1) Decoupling search planner and generator architecture, 2) Dual-reward alignment for search planning, 3) Pareto optimization of planning utility and cost using reinforcement learning.
Result: Extensive experiments show AI-SearchPlanner outperforms existing RL-based search agents in effectiveness and efficiency, with strong generalization across diverse frozen QA models and data domains.
Conclusion: The framework successfully enhances frozen QA models by focusing search planning on a small trainable LLM, demonstrating superior performance and efficiency compared to end-to-end approaches.
Abstract: Recent studies have explored integrating Large Language Models (LLMs) with search engines to leverage both the LLMs’ internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbf{AI-SearchPlanner}, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.
[223] P2C: Path to Counterfactuals
Sopam Dasgupta, Sadaf MD Halim, Joaquín Arias, Elmer Salazar, Gopal Gupta
Main category: cs.AI
TL;DR: P2C is a framework that generates sequential counterfactual plans with causal consistency, addressing limitations of current approaches that ignore causal dependencies and simultaneous interventions.
Details
Motivation: Machine learning models in high-stakes decisions need transparency and recourse, but current counterfactual methods ignore causal dependencies and assume unrealistic simultaneous interventions.Method: P2C uses Answer Set Programming (s(CASP)) to generate ordered action sequences that respect causal relationships between features, ensuring each intermediate state is feasible and causally valid.
Result: P2C produces causally consistent counterfactual plans with realistic cost estimates by only counting user-initiated changes, outperforming standard planners that generate illegal actions.
Conclusion: P2C provides a practical solution for generating actionable counterfactual explanations that respect real-world causal constraints and sequential decision-making.
Abstract: Machine-learning models are increasingly driving decisions in high-stakes
settings, such as finance, law, and hiring, thus, highlighting the need for
transparency. However, the key challenge is to balance transparency –
clarifying why' a decision was made -- with recourse: providing actionable steps on
how’ to achieve a favourable outcome from an unfavourable outcome.
Counterfactual explanations reveal why' an undesired outcome occurred and
how’ to reverse it through targeted feature changes (interventions).
Current counterfactual approaches have limitations: 1) they often ignore
causal dependencies between features, and 2) they typically assume all
interventions can happen simultaneously, an unrealistic assumption in practical
scenarios where actions are typically taken in a sequence. As a result, these
counterfactuals are often not achievable in the real world.
We present P2C (Path-to-Counterfactuals), a model-agnostic framework that
produces a plan (ordered sequence of actions) converting an unfavourable
outcome to a causally consistent favourable outcome. P2C addresses both
limitations by 1) Explicitly modelling causal relationships between features
and 2) Ensuring that each intermediate state in the plan is feasible and
causally valid. P2C uses the goal-directed Answer Set Programming system
s(CASP) to generate the plan accounting for feature changes that happen
automatically due to causal dependencies. Furthermore, P2C refines cost
(effort) computation by only counting changes actively made by the user,
resulting in realistic cost estimates. Finally, P2C highlights how its causal
planner outperforms standard planners, which lack causal knowledge and thus can
generate illegal actions.
[224] TCIA: A Task-Centric Instruction Augmentation Method for Instruction Finetuning
Simin Ma, Shujian Liu, Jun Tan, Yebowen Hu, Song Wang, Sathish Reddy Indurthi, Sanqiang Zhao, Liwei Wu, Jianbing Han, Kaiqiang Song
Main category: cs.AI
TL;DR: TCIA framework expands instruction data while maintaining diversity and task relevance, improving LLM performance by 8.7% on average for task-specific applications without compromising general abilities.
Details
Motivation: Existing instruction augmentation methods focus on diversity but overlook on-task relevance, while most real-world applications require task-specific knowledge rather than general-purpose models.Method: Task Centric Instruction Augmentation (TCIA) represents instructions in a discrete query-constraints space to systematically expand instructions while preserving both diversity and task alignment.
Result: TCIA improves open-source LLMs’ performance by an average of 8.7% across four real-world task-specific applications, sometimes outperforming closed-source models, without compromising general instruction-following ability.
Conclusion: TCIA provides a scalable and efficient solution for adapting LLMs to real-world, task-focused applications by balancing instruction diversity with task-specific relevance.
Abstract: Diverse instruction data is vital for effective instruction tuning of large language models, as it enables the model to generalize across different types of inputs . Building such diversified instruction dataset is an essential step in this process. Existing approaches often leverage large language models to automatically explore and generate diverse instructions, ensuring both data diversity and quality. However, they tend to overlook an important factor in real-world applications: on-task relevance. In practice, only a few real-world applications require a truly general-purpose model; most benefit from task-specific knowledge tailored to their particular use case. Therefore, it is vital to develop instruction augmentation methods that not only maintain diversity but are also optimized for specific, real-world scenarios. We thus introduce Task Centric Instruction Augmentation (TCIA), a framework that systematically expands instructions while preserving both diversity and task alignment. By representing instructions in a discrete query-constraints space, TCIA creates a rich set of task-relevant instructions and enables models to generalize to these task-specific instructions without sacrificing overall performance. Experiments show that TCIA improves open-source LLMs’ performance by an average of 8.7% across four real-world, task-specific applications, and in some cases outperforming leading closed-source models. These improvements do not compromise general instruction-following ability, making TCIA a scalable and efficient solution for adapting LLMs to real-world, task-focused applications.
[225] Uncertainty Under the Curve: A Sequence-Level Entropy Area Metric for Reasoning LLM
Yongfu Zhu, Lin Sun, Guangxiang Zhao, Weihong Lin, Xiangzheng Zhang
Main category: cs.AI
TL;DR: EAS is a novel uncertainty metric that uses token-level predictive entropy from LLMs during generation, requiring no external models or sampling. It effectively quantifies uncertainty and improves training data selection.
Details
Motivation: Current methods for quantifying uncertainty in LLM reasoning often require external models or repeated sampling, which can be inefficient. There's a need for a simple, efficient, and interpretable uncertainty metric that leverages the model's own predictive entropy.Method: EAS integrates token-level predictive entropy from the LLM itself during the answer generation process, capturing uncertainty evolution without external models or repeated sampling.
Result: EAS shows strong correlation with answer entropy across models and datasets. In training data selection, it outperforms Pass Rate filtering, improving student model accuracy on math benchmarks under equal sample budgets.
Conclusion: EAS provides an efficient, interpretable, and practical tool for uncertainty modeling and data quality assessment in LLM training, requiring no external resources while delivering strong performance.
Abstract: In this work, we introduce Entropy Area Score (EAS), a simple yet effective metric to quantify uncertainty in the answer generation process of reasoning large language models (LLMs). EAS requires neither external models nor repeated sampling, it integrates token-level predictive entropy from the model itself to capture the evolution of uncertainty during generation. Empirical results show that EAS is strongly correlated with answer entropy across models and datasets. In training data selection, EAS identifies high-potential samples and consistently outperforms Pass Rate filtering under equal sample budgets, improving student model accuracy on math benchmarks. EAS is both efficient and interpretable, offering a practical tool for uncertainty modeling and data quality assessment in LLM training.
[226] AWorld: Orchestrating the Training Recipe for Agentic AI
Chengyue Yu, Siyuan Lu, Chenyi Zhuang, Dong Wang, Qintong Wu, Zongyue Li, Runsheng Gan, Chunfeng Wang, Siqi Hou, Gaochi Huang, Wenlong Yan, Lifeng Hong, Aohui Xue, Yanfeng Wang, Jinjie Gu, David Tsai, Tao Lin
Main category: cs.AI
TL;DR: AWorld is an open-source distributed system that accelerates agent-environment interaction by 14.6x, enabling efficient reinforcement learning. It trains a Qwen3-32B agent that improves GAIA benchmark accuracy from 21.59% to 32.23%, outperforming proprietary models on challenging tasks.
Details
Motivation: The learning from practice paradigm is crucial for Agentic AI but suffers from inefficient experience generation, especially in complex benchmarks like GAIA, creating a bottleneck for development.Method: Introduces AWorld, an open-source distributed system that distributes tasks across clusters to accelerate agent-environment interaction and experience collection for reinforcement learning.
Result: Achieves 14.6x speedup in experience collection compared to single-node execution. Trained Qwen3-32B agent improves GAIA accuracy from 21.59% to 32.23%, with 16.33% score on most challenging levels, surpassing proprietary models.
Conclusion: AWorld provides a practical blueprint for complete agentic AI training pipeline, demonstrating that efficient distributed interaction enables scalable reinforcement learning and significant model improvement.
Abstract: The learning from practice paradigm is crucial for developing capable Agentic AI systems, yet it is severely hampered by inefficient experience generation, a bottleneck especially pronounced in complex benchmarks like GAIA. To address this, we introduce AWorld, an open-source system engineered for large-scale agent-environment interaction. By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution. This critical speedup makes extensive reinforcement learning practical and scalable. Leveraging this capability, we trained a Qwen3-32B-based agent that significantly outperforms its base model, increasing its overall GAIA accuracy from 21.59% to 32.23%. On the benchmark’s most challenging levels, our agent achieves a score of 16.33%, surpassing the performance of leading proprietary models. Our open-source system and resulting agent provide a practical blueprint for a complete agentic AI training pipeline, from efficient interaction to demonstrable model improvement.
[227] Governable AI: Provable Safety Under Extreme Threat Models
Donglin Wang, Weiyun Liang, Chunyuan Chen, Jing Xu, Yulong Fu
Main category: cs.AI
TL;DR: A Governable AI framework using cryptographic mechanisms for external structural compliance to address AI security risks that traditional safety approaches cannot handle.
Details
Motivation: Existing AI safety methods have fundamental limitations against highly intelligent AI with extreme motivations, requiring a new approach to prevent systemic disasters from uncontrollable AI.Method: Proposes a Governable AI framework with cryptographic rule enforcement module, governance rules, and secure super-platform for end-to-end protection against AI compromise.
Result: Developed a prototype with rigorous formal security proofs and demonstrated effectiveness in high-stakes scenarios through cryptographic mechanisms that are computationally infeasible to break.
Conclusion: The GAI framework provides a feasible technical pathway for AI safety governance through external cryptographic enforcement, addressing limitations of traditional internal constraint approaches.
Abstract: As AI rapidly advances, the security risks posed by AI are becoming increasingly severe, especially in critical scenarios, including those posing existential risks. If AI becomes uncontrollable, manipulated, or actively evades safety mechanisms, it could trigger systemic disasters. Existing AI safety approaches-such as model enhancement, value alignment, and human intervention-suffer from fundamental, in-principle limitations when facing AI with extreme motivations and unlimited intelligence, and cannot guarantee security. To address this challenge, we propose a Governable AI (GAI) framework that shifts from traditional internal constraints to externally enforced structural compliance based on cryptographic mechanisms that are computationally infeasible to break, even for future AI, under the defined threat model and well-established cryptographic assumptions.The GAI framework is composed of a simple yet reliable, fully deterministic, powerful, flexible, and general-purpose rule enforcement module (REM); governance rules; and a governable secure super-platform (GSSP) that offers end-to-end protection against compromise or subversion by AI. The decoupling of the governance rules and the technical platform further enables a feasible and generalizable technical pathway for the safety governance of AI. REM enforces the bottom line defined by governance rules, while GSSP ensures non-bypassability, tamper-resistance, and unforgeability to eliminate all identified attack vectors. This paper also presents a rigorous formal proof of the security properties of this mechanism and demonstrates its effectiveness through a prototype implementation evaluated in representative high-stakes scenarios.
[228] Enhancing Health Fact-Checking with LLM-Generated Synthetic Data
Jingze Zhang, Jiahe Qian, Yiliang Zhou, Yifan Peng
Main category: cs.AI
TL;DR: LLM-driven synthetic data generation pipeline improves health fact-checking by augmenting training data with synthetic text-claim pairs, boosting F1 scores on PubHealth and SciFact datasets.
Details
Motivation: Health-related fact-checking faces challenges due to limited annotated training data availability, necessitating methods to augment datasets for better model performance.Method: Proposed pipeline: summarize source documents, decompose into atomic facts, use LLM to construct entailment tables, generate synthetic text-claim pairs with veracity labels, then combine with original data to fine-tune BERT-based model.
Result: Evaluation shows F1 score improvements of up to 0.019 on PubHealth and 0.049 on SciFact datasets compared to models trained only on original data.
Conclusion: LLM-driven synthetic data augmentation effectively enhances health-related fact-checker performance, demonstrating the value of synthetic data generation for domain-specific NLP tasks.
Abstract: Fact-checking for health-related content is challenging due to the limited availability of annotated training data. In this study, we propose a synthetic data generation pipeline that leverages large language models (LLMs) to augment training data for health-related fact checking. In this pipeline, we summarize source documents, decompose the summaries into atomic facts, and use an LLM to construct sentence-fact entailment tables. From the entailment relations in the table, we further generate synthetic text-claim pairs with binary veracity labels. These synthetic data are then combined with the original data to fine-tune a BERT-based fact-checking model. Evaluation on two public datasets, PubHealth and SciFact, shows that our pipeline improved F1 scores by up to 0.019 and 0.049, respectively, compared to models trained only on the original data. These results highlight the effectiveness of LLM-driven synthetic data augmentation in enhancing the performance of health-related fact-checkers.
[229] LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence
Alisa Vinogradova, Vlad Vinogradov, Dmitrii Radkevich, Ilya Yasny, Dmitry Kobyzev, Ivan Izmailov, Katsiaryna Yanchanka, Roman Doronin, Andrey Doronichev
Main category: cs.AI
TL;DR: A competitor-discovery AI agent for drug asset due diligence that achieves 83% recall, outperforming existing solutions and reducing analysis time from 2.5 days to ~3 hours.
Details
Motivation: Current LLM-based systems fail to reliably retrieve all competing drug names for investor-specific competitor definitions, with data being paywalled, fragmented, and rapidly changing across multiple registries.Method: Uses LLM-based agents to transform multi-modal unstructured diligence memos into a structured evaluation corpus, and introduces a competitor-validating LLM-as-a-judge agent to filter false positives and suppress hallucinations.
Result: Achieves 83% recall on the benchmark, significantly exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). In production deployment, reduced analyst turnaround time from 2.5 days to ~3 hours (~20x improvement).
Conclusion: The developed competitor-discovery agent effectively addresses the challenges of fragmented, paywalled drug data and provides a reliable solution for fast drug asset due diligence, demonstrating substantial efficiency gains in real-world biotech VC applications.
Abstract: In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren’t capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.
[230] Human-AI Collaborative Bot Detection in MMORPGs
Jaeman Son, Hyunsoo Kim
Main category: cs.AI
TL;DR: A novel unsupervised framework for detecting auto-leveling bots in MMORPGs using contrastive learning, clustering, and LLM validation with growth curve visualization for explainable decisions.
Details
Motivation: Auto-leveling bots undermine gameplay fairness in MMORPGs, but detection is challenging due to human-like behavior patterns and the need for explainable justification to avoid legal and user experience issues.Method: Uses contrastive representation learning and clustering techniques to identify characters with similar level-up patterns in an unsupervised manner. Incorporates LLM as auxiliary reviewer to validate clusters and introduces growth curve-based visualization for human and AI assessment.
Result: The framework enables efficient bot detection while maintaining explainability through collaborative human-AI validation, supporting scalable and accountable bot regulation.
Conclusion: The proposed approach successfully addresses the challenge of detecting auto-leveling bots by combining unsupervised learning with LLM validation and visualization, ensuring both detection efficiency and decision explainability for fair gameplay regulation.
Abstract: In Massively Multiplayer Online Role-Playing Games (MMORPGs), auto-leveling bots exploit automated programs to level up characters at scale, undermining gameplay balance and fairness. Detecting such bots is challenging, not only because they mimic human behavior, but also because punitive actions require explainable justification to avoid legal and user experience issues. In this paper, we present a novel framework for detecting auto-leveling bots by leveraging contrastive representation learning and clustering techniques in a fully unsupervised manner to identify groups of characters with similar level-up patterns. To ensure reliable decisions, we incorporate a Large Language Model (LLM) as an auxiliary reviewer to validate the clustered groups, effectively mimicking a secondary human judgment. We also introduce a growth curve-based visualization to assist both the LLM and human moderators in assessing leveling behavior. This collaborative approach improves the efficiency of bot detection workflows while maintaining explainability, thereby supporting scalable and accountable bot regulation in MMORPGs.
[231] Bridging Minds and Machines: Toward an Integration of AI and Cognitive Science
Rui Mao, Qian Liu, Xiao Li, Erik Cambria, Amir Hussain
Main category: cs.AI
TL;DR: Review of reciprocal relationship between AI and Cognitive Science, highlighting that AI focuses on practical performance while lacking cohesive cognitive foundations, with future directions including cognitive alignment, embodiment, personalization, and ethical co-evaluation.
Details
Motivation: To comprehensively review the intersections between AI and Cognitive Science, examining how cognitive theories have influenced AI breakthroughs and how AI serves as a tool for cognitive research advancement.Method: Synthesizing key contributions from both AI and Cognitive Science perspectives through comprehensive literature review and analysis of the reciprocal relationship between the two fields.
Result: Observation that AI progress has emphasized practical task performance while cognitive foundations remain conceptually fragmented, identifying a need for more cognitively-grounded AI systems.
Conclusion: The future of AI in Cognitive Science requires not just performance improvements but building systems that enhance understanding of human mind through cognitive alignment, embodiment, cultural situatedness, personalized models, and ethical co-evaluation frameworks.
Abstract: Cognitive Science has profoundly shaped disciplines such as Artificial Intelligence (AI), Philosophy, Psychology, Neuroscience, Linguistics, and Culture. Many breakthroughs in AI trace their roots to cognitive theories, while AI itself has become an indispensable tool for advancing cognitive research. This reciprocal relationship motivates a comprehensive review of the intersections between AI and Cognitive Science. By synthesizing key contributions from both perspectives, we observe that AI progress has largely emphasized practical task performance, whereas its cognitive foundations remain conceptually fragmented. We argue that the future of AI within Cognitive Science lies not only in improving performance but also in constructing systems that deepen our understanding of the human mind. Promising directions include aligning AI behaviors with cognitive frameworks, situating AI in embodiment and culture, developing personalized cognitive models, and rethinking AI ethics through cognitive co-evaluation.
[232] Transparent Semantic Spaces: A Categorical Approach to Explainable Word Embeddings
Ares Fabregat-Hernández, Javier Palanca, Vicent Botti
Main category: cs.AI
TL;DR: A category theory framework for AI explainability that provides mathematical structures to analyze word embeddings, compare algorithms, and address bias.
Details
Motivation: To enhance explainability of AI systems, particularly word embeddings, by moving from black-box neural network approaches to transparent mathematical frameworks using category theory.Method: Constructs categories L_T and P_T to represent text semantics, defines monoidal categories for semantic visualization, creates categories of configurations and embeddings with divergence metrics, and establishes equivalence between embedding algorithms.
Result: Developed a dimension-agnostic definition of semantic spaces, mathematically precise comparison method for word embeddings, demonstrated equivalence between GloVe/Word2Vec and MDS algorithms, and provided bias computation methods.
Conclusion: The framework successfully transitions AI explainability from opaque neural networks to transparent mathematical structures, enabling better understanding of semantic spaces and bias mitigation in word embeddings.
Abstract: The paper introduces a novel framework based on category theory to enhance the explainability of artificial intelligence systems, particularly focusing on word embeddings. Key topics include the construction of categories $\mathcal{L}_T$ and $\mathcal{P}_T$, providing schematic representations of the semantics of a text $ T $, and reframing the selection of the element with maximum probability as a categorical notion. Additionally, the monoidal category $\mathcal{P}_T$ is constructed to visualize various methods of extracting semantic information from $T$, offering a dimension-agnostic definition of semantic spaces reliant solely on information within the text. Furthermore, the paper defines the categories of configurations Conf and word embeddings $\mathcal{Emb}$, accompanied by the concept of divergence as a decoration on $\mathcal{Emb}$. It establishes a mathematically precise method for comparing word embeddings, demonstrating the equivalence between the GloVe and Word2Vec algorithms and the metric MDS algorithm, transitioning from neural network algorithms (black box) to a transparent framework. Finally, the paper presents a mathematical approach to computing biases before embedding and offers insights on mitigating biases at the semantic space level, advancing the field of explainable artificial intelligence.
[233] Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision
Ao Cheng, Lei Zhang, Guowei He
Main category: cs.AI
TL;DR: A novel LLM-based agent framework with Consultant-Reviewer-Programmer modules using rewriting-resolution-review-revision chain for scientific computing problems, improving bug-free code generation and reducing non-physical solutions.
Details
Motivation: To address the limitations of single LLMs in scientific computing by creating a collaborative framework that can handle complex mathematical and scientific reasoning tasks with higher reliability and fewer errors.Method: Three-agent framework: Consultant rewrites problems with domain knowledge, Programmer generates executable code, and Reviewer provides self-debugging through interactive feedback and iterative revision of code outputs.
Result: Significantly improved bug-free code generation rate and reduced non-physical solutions compared to single models. Enhanced execution success rates for PDEs, ill-conditioned linear systems, and data-driven physical analysis problems.
Conclusion: The collaborative agent framework establishes automatic code generation and review as a promising paradigm for scientific computing, demonstrating superior reliability and performance over single-model approaches.
Abstract: Large language models (LLMs) serve as an active and promising field of generative artificial intelligence and have demonstrated abilities to perform complex tasks in multiple domains, including mathematical and scientific reasoning. In this work, we construct a novel agent framework for solving representative problems in scientific computing. The proposed agent, incorporating a “rewriting-resolution-review-revision” logical chain via three reasoning LLMs (functioning as the Consultant, Reviewer, and Programmer, respectively), is integrated in a collaborative and interactive manner. The Consultant module endows the agent with knowledge transfer capabilities to link problems to professional domain insights, thereby rewriting problem descriptions through text augmentation. The Programmer module is responsible for generating and executing well-structured code to deliver the problem resolution. The Reviewer module equips the agent with the capacity for self-debugging and self-refinement through interactive feedback with code runtime outputs. By leveraging the end-to-end review mechanism, the executable code provided by the Programmer attains the iterative revision. A comprehensive evaluation is conducted on the performance of the proposed agent framework in solving PDEs, ill-conditioned linear systems, and data-driven physical analysis problems. Compared to single-model, this collaborative framework significantly improves the bug-free code generation rate and reduces the occurrence of non-physical solutions, thereby establishing a highly reliable framework for autonomous code generation based on natural language descriptions. The review mechanism improved the average execution success (bug-free code and non-NaN solutions) rate of the latest reasoning models. In summary, our agent framework establishes automatic code generation and review as a promising scientific computing paradigm.
[234] Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control
Yifan Zhang
Main category: cs.AI
TL;DR: Single-agent RL framework for bus holding control that reformulates multi-agent problem using categorical state augmentation and structured rewards, outperforming MARL approaches in realistic transit simulations.
Details
Motivation: Traditional MARL solutions for bus bunching overlook realistic operational challenges like heterogeneous routes, timetables, fluctuating demand, and varying fleet sizes, while suffering from data imbalance and convergence issues.Method: Reformulates multi-agent problem into single-agent RL by augmenting state space with categorical identifiers (vehicle ID, station ID, time period) plus numerical features. Uses modified soft actor-critic with ridge-shaped reward function balancing headway uniformity and schedule adherence.
Result: Achieves superior performance over benchmarks including MADDPG (-430k vs. -530k under stochastic conditions), demonstrating more stable and effective bus holding control in non-loop, realistic contexts.
Conclusion: Single-agent deep RL enhanced with categorical structuring and schedule-aware rewards provides a robust, scalable alternative to MARL frameworks, particularly effective where agent-specific experiences are imbalanced.
Abstract: Bus bunching remains a challenge for urban transit due to stochastic traffic and passenger demand. Traditional solutions rely on multi-agent reinforcement learning (MARL) in loop-line settings, which overlook realistic operations characterized by heterogeneous routes, timetables, fluctuating demand, and varying fleet sizes. We propose a novel single-agent reinforcement learning (RL) framework for bus holding control that avoids the data imbalance and convergence issues of MARL under near-realistic simulation. A bidirectional timetabled network with dynamic passenger demand is constructed. The key innovation is reformulating the multi-agent problem into a single-agent one by augmenting the state space with categorical identifiers (vehicle ID, station ID, time period) in addition to numerical features (headway, occupancy, velocity). This high-dimensional encoding enables single-agent policies to capture inter-agent dependencies, analogous to projecting non-separable inputs into a higher-dimensional space. We further design a structured reward function aligned with operational goals: instead of exponential penalties on headway deviations, a ridge-shaped reward balances uniform headways and schedule adherence. Experiments show that our modified soft actor-critic (SAC) achieves more stable and superior performance than benchmarks, including MADDPG (e.g., -430k vs. -530k under stochastic conditions). These results demonstrate that single-agent deep RL, when enhanced with categorical structuring and schedule-aware rewards, can effectively manage bus holding in non-loop, real-world contexts. This paradigm offers a robust, scalable alternative to MARL frameworks, particularly where agent-specific experiences are imbalanced.
[235] A Graph-Based Test-Harness for LLM Evaluation
Jessica Lundin, Guillaume Chabot-Couture
Main category: cs.AI
TL;DR: A graph-based dynamic benchmark for medical guidelines with 400+ questions and 3.3+ trillion combinations, covering WHO IMCI handbook relationships to systematically evaluate LLM clinical capabilities.
Details
Motivation: To address limitations of manually curated medical benchmarks by creating a scalable, contamination-resistant evaluation system that can dynamically generate comprehensive test cases and identify specific clinical capability gaps that general-domain evaluations miss.Method: Transformed WHO IMCI handbook into a directed graph with 200+ nodes (conditions, symptoms, treatments, etc.) and 300+ edges, then used graph traversal to generate questions with age-specific scenarios and contextual distractors for clinical relevance.
Result: Models achieved 45-67% accuracy across clinical tasks, excelling at symptom recognition but struggling with severity triaging, treatment protocols, and follow-up care. The approach successfully addresses coverage limitations of manual benchmarks.
Conclusion: The graph-based methodology enables scalable, dynamic benchmark generation for medical guidelines, enhances LLM post-training without expensive human annotation, and provides a contamination-resistant solution for comprehensive evaluation that can adapt to guideline updates.
Abstract: We present a first known prototype of a dynamic, systematic benchmark of medical guidelines for 400+ questions, with 3.3+ trillion possible combinations, covering 100% of guideline relationships. We transformed the WHO IMCI handbook into a directed graph with 200+ nodes (conditions, symptoms, treatments, follow-ups, severities) and 300+ edges, then used graph traversal to generate questions that incorporated age-specific scenarios and contextual distractors to ensure clinical relevance. Our graph-based approach enables systematic evaluation across clinical tasks (45-67% accuracy), and we find models excel at symptom recognition but struggle with triaging severity, treatment protocols and follow-up care, demonstrating how customized benchmarks can identify specific capability gaps that general-domain evaluations miss. Beyond evaluation, this dynamic MCQA methodology enhances LLM post-training (supervised finetuning, GRPO, DPO), where correct answers provide high-reward samples without expensive human annotation. The graph-based approach successfully addresses the coverage limitations of manually curated benchmarks. This methodology is a step toward scalable, contamination-resistant solution for creating comprehensive benchmarks that can be dynamically generated, including when the guidelines are updated. Code and datasets are available at https://github.com/jessicalundin/graph_testing_harness
[236] A Multi-Objective Genetic Algorithm for Healthcare Workforce Scheduling
Vipul Patel, Anirudh Deodhar, Dagnachew Birru
Main category: cs.AI
TL;DR: A Multi-objective Genetic Algorithm (MOO-GA) for hospital workforce scheduling that balances cost, patient care coverage, and staff satisfaction, showing 66% improvement over manual scheduling.
Details
Motivation: Healthcare workforce scheduling faces complex challenges with fluctuating patient loads, diverse clinical skills, and the need to balance labor cost control with high-quality patient care while accommodating staff preferences to prevent burnout.Method: Proposed a Multi-objective Genetic Algorithm that models hospital unit scheduling as a multi-objective optimization task, incorporating real-world complexities like hourly appointment-driven demand and modular shifts for multi-skilled workforce.
Result: The MOO-GA generates robust and balanced schedules with an average 66% performance improvement over conventional manual scheduling baselines, effectively managing trade-offs between operational and staff-centric objectives.
Conclusion: The approach provides a practical decision support tool for nurse managers and hospital administrators to create optimized workforce schedules that balance competing healthcare operational demands.
Abstract: Workforce scheduling in the healthcare sector is a significant operational challenge, characterized by fluctuating patient loads, diverse clinical skills, and the critical need to control labor costs while upholding high standards of patient care. This problem is inherently multi-objective, demanding a delicate balance between competing goals: minimizing payroll, ensuring adequate staffing for patient needs, and accommodating staff preferences to mitigate burnout. We propose a Multi-objective Genetic Algorithm (MOO-GA) that models the hospital unit workforce scheduling problem as a multi-objective optimization task. Our model incorporates real-world complexities, including hourly appointment-driven demand and the use of modular shifts for a multi-skilled workforce. By defining objective functions for cost, patient care coverage, and staff satisfaction, the GA navigates the vast search space to identify a set of high-quality, non-dominated solutions. Demonstrated on datasets representing a typical hospital unit, the results show that our MOO-GA generates robust and balanced schedules. On average, the schedules produced by our algorithm showed a 66% performance improvement over a baseline that simulates a conventional, manual scheduling process. This approach effectively manages trade-offs between critical operational and staff-centric objectives, providing a practical decision support tool for nurse managers and hospital administrators.
[237] Efficient Neuro-Symbolic Learning of Constraints and Objective
Marianne Defresne, Romain Gambardella, Sophie Barbe, Thomas Schiex
Main category: cs.AI
TL;DR: A differentiable neuro-symbolic architecture with a probabilistic loss function that learns to solve NP-hard reasoning problems from natural inputs, achieving efficient training and high accuracy.
Details
Motivation: To address the limitations of Large Language Models in solving discrete reasoning and optimization problems, and to create a scalable architecture that can learn both constraints and objectives from natural inputs.Method: A differentiable neuro-symbolic architecture with a novel probabilistic loss function that removes the combinatorial solver from the training loop, enabling scalable training while maintaining exact inference for maximum accuracy.
Result: The approach efficiently learns to solve NP-hard reasoning problems, requiring significantly less training time than other hybrid methods on Sudoku variants, outperforms Decision-Focused-Learning on visual Min-Cut/Max-cut tasks, and successfully learns protein design energy optimization.
Conclusion: The proposed neuro-symbolic architecture provides an effective and scalable solution for learning to solve complex reasoning problems from natural inputs, demonstrating superior performance across multiple benchmarks including real-world protein design applications.
Abstract: In the ongoing quest for hybridizing discrete reasoning with neural nets, there is an increasing interest in neural architectures that can learn how to solve discrete reasoning or optimization problems from natural inputs, a task that Large Language Models seem to struggle with. Objectives: We introduce a differentiable neuro-symbolic architecture and a loss function dedicated to learning how to solve NP-hard reasoning problems. Methods: Our new probabilistic loss allows for learning both the constraints and the objective, thus delivering a complete model that can be scrutinized and completed with side constraints. By pushing the combinatorial solver out of the training loop, our architecture also offers scalable training while exact inference gives access to maximum accuracy. Results: We empirically show that it can efficiently learn how to solve NP-hard reasoning problems from natural inputs. On three variants of the Sudoku benchmark – symbolic, visual, and many-solution –, our approach requires a fraction of training time of other hybrid methods. On a visual Min-Cut/Max-cut task, it optimizes the regret better than a Decision-Focused-Learning regret-dedicated loss. Finally, it efficiently learns the energy optimization formulation of the large real-world problem of designing proteins.
[238] ChatThero: An LLM-Supported Chatbot for Behavior Change and Therapeutic Support in Addiction Recovery
Junda Wang, Zonghai Yao, Zhichao Yang, Lingxi Li, Junhui Qian, Hong Yu
Main category: cs.AI
TL;DR: ChatThero is a multi-agent conversational framework that combines patient modeling with therapeutic dialogue using CBT and motivational interviewing strategies, showing significant improvements in patient motivation and treatment efficiency compared to GPT-4o.
Details
Motivation: Substance use disorders affect millions globally but few receive effective care due to stigma, motivational barriers, and lack of personalized support. Existing LLM systems lack integration with clinically validated strategies for addiction recovery.Method: Multi-agent framework with dynamic patient modeling, context-sensitive therapeutic dialogue, and adaptive persuasive strategies based on CBT and MI. Trained with two-stage pipeline: supervised fine-tuning followed by direct preference optimization, using a synthetic benchmark across Easy, Medium, and Hard resistance levels.
Result: 41.5% average gain in patient motivation, 0.49% increase in treatment confidence, resolves hard cases with 26% fewer turns than GPT-4o. Rated higher in empathy, responsiveness, and behavioral realism by both automated and human clinical assessments.
Conclusion: ChatThero provides a privacy-preserving framework for studying therapeutic conversation and offers a robust, replicable basis for both research and clinical translation in addiction recovery support.
Abstract: Substance use disorders (SUDs) affect over 36 million people worldwide, yet few receive effective care due to stigma, motivational barriers, and limited personalized support. Although large language models (LLMs) show promise for mental-health assistance, most systems lack tight integration with clinically validated strategies, reducing effectiveness in addiction recovery. We present ChatThero, a multi-agent conversational framework that couples dynamic patient modeling with context-sensitive therapeutic dialogue and adaptive persuasive strategies grounded in cognitive behavioral therapy (CBT) and motivational interviewing (MI). We build a high-fidelity synthetic benchmark spanning Easy, Medium, and Hard resistance levels, and train ChatThero with a two-stage pipeline comprising supervised fine-tuning (SFT) followed by direct preference optimization (DPO). In evaluation, ChatThero yields a 41.5% average gain in patient motivation, a 0.49% increase in treatment confidence, and resolves hard cases with 26% fewer turns than GPT-4o, and both automated and human clinical assessments rate it higher in empathy, responsiveness, and behavioral realism. The framework supports rigorous, privacy-preserving study of therapeutic conversation and provides a robust, replicable basis for research and clinical translation.
[239] OptiMUS-0.3: Using Large Language Models to Model and Solve Optimization Problems at Scale
Ali AhmadiTeshnizi, Wenzhi Gao, Herman Brunborg, Shayan Talaei, Connor Lawless, Madeleine Udell
Main category: cs.AI
TL;DR: OptiMUS-0.3 is an LLM-based system that automatically formulates and solves mixed integer linear programming problems from natural language descriptions, outperforming state-of-the-art methods by over 22-24%.
Details
Motivation: Optimization problems are pervasive but require expertise to formulate and solve, limiting adoption of optimization tools. Most problems are still solved heuristically rather than optimally.Method: A Large Language Model-based system with modular structure that develops mathematical models, writes/debugs solver code, evaluates solutions, and improves efficiency and correctness based on evaluations. Handles long descriptions and complex data without long prompts.
Result: Outperforms existing state-of-the-art methods by more than 22% on easy datasets and more than 24% on hard datasets (including new NLP4LP dataset with long, complex problems).
Conclusion: OptiMUS-0.3 successfully automates optimization problem formulation and solving from natural language, demonstrating significant performance improvements over existing methods and enabling broader adoption of optimization techniques.
Abstract: Optimization problems are pervasive in sectors from manufacturing and distribution to healthcare. However, most such problems are still solved heuristically by hand rather than optimally by state-of-the-art solvers because the expertise required to formulate and solve these problems limits the widespread adoption of optimization tools and techniques. We introduce a Large Language Model (LLM)-based system designed to formulate and solve (mixed integer) linear programming problems from their natural language descriptions. Our system is capable of developing mathematical models, writing and debugging solver code, evaluating the generated solutions, and improving efficiency and correctness of its model and code based on these evaluations. OptiMUS-0.3 utilizes a modular structure to process problems, allowing it to handle problems with long descriptions and complex data without long prompts. Experiments demonstrate that OptiMUS-0.3 outperforms existing state-of-the-art methods on easy datasets by more than 22% and on hard datasets (including a new dataset, NLP4LP, released with this paper that features long and complex problems) by more than 24%.
[240] Possible Principles for Aligned Structure Learning Agents
Lancelot Da Costa, Tomáš Gavenčiak, David Hyland, Mandana Samiei, Cristian Dragos-Manta, Candice Pattisapu, Adeel Razi, Karl Friston
Main category: cs.AI
TL;DR: A roadmap for developing scalable aligned AI through structure learning and theory of mind, using core knowledge principles and information geometry to build agents that learn world models including human preferences.
Details
Motivation: To create a path toward scalable aligned artificial intelligence by enabling agents to learn comprehensive world models that include representations of human preferences and other agents' mental states.Method: Proposes structure learning (causal representation learning) with core structural modules, information geometry, and model reduction. Uses mathematical formulation of Asimov’s Laws of Robotics as an example, combining theory of mind with alignment principles.
Result: Develops a theoretical framework for aligned AI systems that can learn world models containing representations of other agents’ preferences, enabling cautious and ethical behavior.
Conclusion: Structure learning combined with theory of mind provides a promising approach to scalable AI alignment, with core knowledge principles and information geometry serving as foundational elements for developing safe artificial agents.
Abstract: This paper offers a roadmap for the development of scalable aligned artificial intelligence (AI) from first principle descriptions of natural intelligence. In brief, a possible path toward scalable aligned AI rests upon enabling artificial agents to learn a good model of the world that includes a good model of our preferences. For this, the main objective is creating agents that learn to represent the world and other agents’ world models; a problem that falls under structure learning (a.k.a. causal representation learning or model discovery). We expose the structure learning and alignment problems with this goal in mind, as well as principles to guide us forward, synthesizing various ideas across mathematics, statistics, and cognitive science. 1) We discuss the essential role of core knowledge, information geometry and model reduction in structure learning, and suggest core structural modules to learn a wide range of naturalistic worlds. 2) We outline a way toward aligned agents through structure learning and theory of mind. As an illustrative example, we mathematically sketch Asimov’s Laws of Robotics, which prescribe agents to act cautiously to minimize the ill-being of other agents. We supplement this example by proposing refined approaches to alignment. These observations may guide the development of artificial intelligence in helping to scale existing – or design new – aligned structure learning systems.
[241] Technology as uncharted territory: Contextual integrity and the notion of AI as new ethical ground
Alexander Martin Mussgnug
Main category: cs.AI
TL;DR: AI ethics should prioritize integration with existing social norms rather than creating new ethical frameworks, as current approaches risk disregarding established contextual norms and virtues.
Details
Motivation: Recent AI development often occurs detached from social contexts, ignoring established normative structures that govern those contexts, which can have decisive ethical implications.Method: Uses Helen Nissenbaum’s framework of contextual integrity to analyze how disregard for contextual norms threatens context integrity, and examines how current AI ethics approaches promote moral innovation over preservation.
Result: Current approaches to responsible AI can inadvertently legitimize disregard for established contextual norms by treating AI as novel ethical territory rather than integrating with existing normative structures.
Conclusion: Advocates for a moderately conservative approach that prioritizes responsible integration of AI within established social contexts and their normative structures, questioning the narrow prioritization of moral innovation over moral preservation.
Abstract: Recent research illustrates how AI can be developed and deployed in a manner detached from the concrete social context of application. By abstracting from the contexts of AI application, practitioners also disengage from the distinct normative structures that govern them. Building upon Helen Nissenbaum’s framework of contextual integrity, I illustrate how disregard for contextual norms can threaten the integrity of a context with often decisive ethical implications. I argue that efforts to promote responsible and ethical AI can inadvertently contribute to and seemingly legitimize this disregard for established contextual norms. Echoing a persistent undercurrent in technology ethics of understanding emerging technologies as uncharted moral territory, certain approaches to AI ethics can promote a notion of AI as a novel and distinct realm for ethical deliberation, norm setting, and virtue cultivation. This narrative of AI as new ethical ground, however, can come at the expense of practitioners, policymakers and ethicists engaging with already established norms and virtues that were gradually cultivated to promote successful and responsible practice within concrete social contexts. In response, I question the current narrow prioritization in AI ethics of moral innovation over moral preservation. Engaging also with emerging foundation models, I advocate for a moderately conservative approach to the ethics of AI that prioritizes the responsible and considered integration of AI within established social contexts and their respective normative structures.
[242] Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
Dongyoon Hwang, Hojoon Lee, Jaegul Choo, Dongmin Park, Jongho Park
Main category: cs.AI
TL;DR: RL training with dense rewards from chess-pretrained networks improves LLM strategic reasoning in chess, but models plateau below expert levels due to fundamental deficits in pretrained chess understanding.
Details
Motivation: To explore whether LLMs can develop strategic reasoning capabilities through reinforcement learning in chess, as this area remains largely unexplored compared to mathematical reasoning.Method: Leverage a chess-pretrained action-value network to provide dense rewards on LLM’s output move quality (knowledge distillation approach), comparing with sparse binary rewards, and conducting SFT and RL ablations.
Result: Distillation-based dense rewards often outperform sparse binary rewards, but all models plateau far below expert levels despite RL training.
Conclusion: The limitation stems from a deficit in pretrained models’ internal understanding of chess that RL alone may not fully overcome, suggesting fundamental architectural or pretraining limitations.
Abstract: While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM’s output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models’ internal understanding of chess-a deficit which RL alone may not be able to fully overcome. The code is available at https://github.com/krafton-ai/Chess-R1.
[243] Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search
He Wang, Liang Zeng
Main category: cs.AI
TL;DR: Evo-MCTS integrates LLM guidance with physical constraints for gravitational wave detection, achieving 20.2% improvement over SOTA methods and 59.1% over other LLM-based approaches.
Details
Motivation: Existing gravitational wave detection methods face limitations from restrictive assumptions, predefined priors, neural network biases, and lack of interpretability in dynamic detector noise environments.Method: Evolutionary Monte Carlo Tree Search (Evo-MCTS) combines MCTS for strategic exploration with evolutionary algorithms for solution refinement, guided by LLM heuristics with domain-aware physical constraints.
Result: Achieved 20.2% improvement over state-of-the-art gravitational wave detection algorithms on MLGWSC-1 benchmark and 59.1% improvement over other LLM-based optimization frameworks.
Conclusion: The framework provides a transferable methodology for automated algorithmic discovery across computational science domains while maintaining interpretability through explicit algorithmic pathways.
Abstract: Gravitational-wave signal detection with unknown source parameters buried in dynamic detector noise remains a formidable computational challenge. Existing approaches face core limitations from restrictive assumptions: traditional methods rely on predefined theoretical priors, while neural networks introduce hidden biases and lack interpretability. We propose Evolutionary Monte Carlo Tree Search (Evo-MCTS), the first integration of large language model (LLM) guidance with domain-aware physical constraints for automated gravitational wave detection. This framework systematically explores algorithmic solution spaces through tree-structured search enhanced by evolutionary optimization, combining MCTS for strategic exploration with evolutionary algorithms for solution refinement. The LLM component provides domain-aware heuristics while maintaining interpretability through explicit algorithmic pathway generation. Experimental validation demonstrates substantial performance improvements, achieving a 20.2% improvement over state-of-the-art gravitational wave detection algorithms on the MLGWSC-1 benchmark dataset and a remarkable 59.1% improvement over other LLM-based algorithm optimization frameworks. Beyond performance improvements, our framework establishes a transferable methodology for automated algorithmic discovery across computational science domains.
[244] MSARL: Decoupling Reasoning and Tool Use with Multi-Small-Agent Reinforcement Learning
Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li
Main category: cs.AI
TL;DR: MSARL is a multi-small-agent framework that separates reasoning from tool use, using specialized agents for each role to improve stability and accuracy over single-agent systems.
Details
Motivation: Existing tool-integrated reasoning systems use single large models that interleave reasoning with tool operations, causing cognitive-load interference and unstable coordination.Method: A Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools, trained via imitation learning and reinforcement learning with role-specific rewards.
Result: Significantly improves reasoning stability and final-answer accuracy on mathematical problem solving with code execution compared to single-agent baselines.
Conclusion: Cognitive-role decoupling with small agents provides a scalable blueprint for multi-agent AI design that generalizes to diverse tool-use tasks.
Abstract: Recent advances in multi-agent systems highlight the potential of specialized small agents that collaborate via division of labor. Existing tool-integrated reasoning systems, however, often follow a single-agent paradigm in which one large model interleaves long-horizon reasoning with precise tool operations, leading to cognitive-load interference and unstable coordination. We present MSARL, a Multi-Small-Agent Reinforcement Learning framework that explicitly decouples reasoning from tool use. In MSARL, a Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools, each trained via a combination of imitation learning and reinforcement learning with role-specific rewards. On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final-answer accuracy over single-agent baselines. Moreover, the architecture generalizes to diverse tool-use tasks, demonstrating that cognitive-role decoupling with small agents is a scalable blueprint for multi-agent AI design.
[245] RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing
Jianxing Liao, Tian Zhang, Xiao Feng, Yusong Zhang, Rui Yang, Haorui Wang, Bosi Wen, Ziying Wang, Runzhi Shi
Main category: cs.AI
TL;DR: RLMR uses dynamic mixed rewards to balance subjective writing quality and objective constraints in creative writing, achieving improved instruction following and writing quality across various model sizes.
Details
Motivation: Creative writing requires balancing subjective quality (literariness, emotion) with objective constraints (format, word limits), but existing methods struggle to optimize both aspects simultaneously.Method: Reinforcement Learning with Mixed Rewards (RLMR) with dynamic reward weighting - combines writing reward model for subjective quality and constraint verification model for objective constraints, adjusting weights based on writing quality within groups.
Result: Improved instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert evaluations on WriteEval benchmark) across 8B to 72B parameter models.
Conclusion: RLMR is the first method to effectively combine subjective preferences with objective verification in online RL training, providing a solution for multi-dimensional creative writing optimization.
Abstract: Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing methods find it difficult to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.
[246] The Ramon Llull’s Thinking Machine for Automated Ideation
Xinran Zhao, Boyuan Zheng, Chenglei Si, Haofei Yu, Ken Liu, Runlong Zhou, Ruochen Li, Tong Chen, Xiang Li, Yiming Zhang, Tongshuang Wu
Main category: cs.AI
TL;DR: A modern implementation of Llull’s combinatorial system using LLMs to generate research ideas by combining themes, domains, and methods mined from expert knowledge and conference papers.
Details
Motivation: To revive Ramon Llull's medieval combinatorial framework as a foundation for AI-assisted research ideation, providing a systematic approach to generate diverse and relevant scientific ideas.Method: Define three compositional axes (Theme, Domain, Method) as building blocks, mine elements from experts/conference papers, and prompt LLMs with curated combinations to generate research ideas.
Result: The approach produces research ideas that are diverse, relevant, and grounded in current literature, serving as an effective tool for augmenting scientific creativity.
Conclusion: This modern thinking machine offers a lightweight, interpretable tool for collaborative ideation between humans and AI, suggesting a promising path for AI-assisted scientific creativity.
Abstract: This paper revisits Ramon Llull’s Ars combinatoria - a medieval framework for generating knowledge through symbolic recombination - as a conceptual foundation for building a modern Llull’s thinking machine for research ideation. Our approach defines three compositional axes: Theme (e.g., efficiency, adaptivity), Domain (e.g., question answering, machine translation), and Method (e.g., adversarial training, linear attention). These elements represent high-level abstractions common in scientific work - motivations, problem settings, and technical approaches - and serve as building blocks for LLM-driven exploration. We mine elements from human experts or conference papers and show that prompting LLMs with curated combinations produces research ideas that are diverse, relevant, and grounded in current literature. This modern thinking machine offers a lightweight, interpretable tool for augmenting scientific creativity and suggests a path toward collaborative ideation between humans and AI.
cs.SD
[247] MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer’s Early Screening
Yongqi Shao, Binxin Mei, Cong Tan, Hong Huo, Tao Fang
Main category: cs.SD
TL;DR: MoTAS framework enhances Alzheimer’s screening using TTS augmentation and Mixture of Experts for better speech-based detection with 85.71% accuracy.
Details
Motivation: Early Alzheimer's screening through speech is promising but limited by small datasets and poor feature selection methods that hinder performance.Method: Uses ASR for transcriptions, TTS augmentation to expand dataset, extracts acoustic/text embeddings, and employs Mixture of Experts for dynamic feature selection and fusion.
Result: Achieves 85.71% accuracy on ADReSSo dataset, outperforming existing baselines, with ablation studies confirming contributions of both TTS and MoE components.
Conclusion: MoTAS demonstrates practical value for real-world Alzheimer’s screening, especially in data-limited scenarios, through effective data augmentation and adaptive feature selection.
Abstract: Early screening for Alzheimer’s Disease (AD) through speech presents a promising non-invasive approach. However, challenges such as limited data and the lack of fine-grained, adaptive feature selection often hinder performance. To address these issues, we propose MoTAS, a robust framework designed to enhance AD screening efficiency. MoTAS leverages Text-to-Speech (TTS) augmentation to increase data volume and employs a Mixture of Experts (MoE) mechanism to improve multimodal feature selection, jointly enhancing model generalization. The process begins with automatic speech recognition (ASR) to obtain accurate transcriptions. TTS is then used to synthesize speech that enriches the dataset. After extracting acoustic and text embeddings, the MoE mechanism dynamically selects the most informative features, optimizing feature fusion for improved classification. Evaluated on the ADReSSo dataset, MoTAS achieves a leading accuracy of 85.71%, outperforming existing baselines. Ablation studies further validate the individual contributions of TTS augmentation and MoE in boosting classification performance. These findings highlight the practical value of MoTAS in real-world AD screening scenarios, particularly in data-limited settings.
[248] Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement
Mattias Cross, Anton Ragni
Main category: cs.SD
TL;DR: This paper investigates how straight vs curved probability paths affect speech enhancement quality in flow-based generative models, finding that straighter paths (like conditional flow matching) improve performance over traditional curved paths (like Schrodinger bridges).
Details
Motivation: Current flow-based speech enhancement methods use curved probability paths, but the implications of path straightness are unknown. Research suggests straight paths are easier to train and offer better generalization, motivating investigation into path straightness effects.Method: The authors experiment with Schrodinger bridge configurations to achieve straighter paths, and propose independent conditional flow-matching which models straight paths between noisy and clean speech. They also develop a one-step inference solution.
Result: Empirical results show that time-independent variance has greater effect on sample quality than gradient. Conditional flow matching improves several speech quality metrics, though it requires multiple inference steps.
Conclusion: Straighter time-independent probability paths improve generative speech enhancement quality over curved time-dependent paths, with straight paths offering better training and generalization benefits.
Abstract: Current flow-based generative speech enhancement methods learn curved probability paths which model a mapping between clean and noisy speech. Despite impressive performance, the implications of curved probability paths are unknown. Methods such as Schrodinger bridges focus on curved paths, where time-dependent gradients and variance do not promote straight paths. Findings in machine learning research suggest that straight paths, such as conditional flow matching, are easier to train and offer better generalisation. In this paper we quantify the effect of path straightness on speech enhancement quality. We report experiments with the Schrodinger bridge, where we show that certain configurations lead to straighter paths. Conversely, we propose independent conditional flow-matching for speech enhancement, which models straight paths between noisy and clean speech. We demonstrate empirically that a time-independent variance has a greater effect on sample quality than the gradient. Although conditional flow matching improves several speech quality metrics, it requires multiple inference steps. We rectify this with a one-step solution by inferring the trained flow-based model as if it was directly predictive. Our work suggests that straighter time-independent probability paths improve generative speech enhancement over curved time-dependent paths.
[249] Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music
Hongju Su, Ke Li, Lan Yang, Honggang Zhang, Yi-Zhe Song
Main category: cs.SD
TL;DR: Amadeus is a novel symbolic music generation framework that uses a two-level architecture combining autoregressive modeling for note sequences with bidirectional diffusion for attributes, achieving SOTA performance with 4x speedup and enabling fine-grained control.
Details
Motivation: Existing models assume fixed temporal dependencies between musical attributes, but observations show attributes are concurrent and unordered sets rather than sequential dependencies, suggesting a better architectural approach is needed.Method: Two-level architecture: autoregressive model for note sequences + bidirectional discrete diffusion model for attributes. Uses Music Latent Space Discriminability Enhancement Strategy (MLSDES) with contrastive learning and Conditional Information Enhancement Module (CIEM) with attention mechanisms.
Result: Significantly outperforms SOTA models across multiple metrics while achieving at least 4x speed-up. Enables training-free, fine-grained note attribute control. Compiled largest open-source symbolic music dataset (AMD) to explore upper performance bounds.
Conclusion: The concurrent attribute modeling approach with bidirectional diffusion proves superior to traditional sequential dependency assumptions, offering both performance improvements and practical speed benefits for symbolic music generation.
Abstract: Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4$\times$ speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.
[250] Unified Multi-task Learning for Voice-Based Detection of Diverse Clinical Conditions
Ran Piao, Yuan Lu, Hareld Kemps, Tong Xia, Aaqib Saeed
Main category: cs.SD
TL;DR: MARVEL is a multi-task learning framework that detects 9 different neurological, respiratory, and voice disorders using only acoustic features from voice, achieving strong performance without raw audio transmission.
Details
Motivation: Existing voice-based health assessment approaches typically focus on single conditions and fail to leverage the rich multi-faceted information embedded in speech, missing opportunities for scalable disease screening.Method: Dual-branch architecture with specialized encoders and task-specific heads sharing a common acoustic backbone, using privacy-conscious multi-task learning with derived acoustic features only.
Result: Achieves overall AUROC of 0.78, with exceptional performance on neurological disorders (AUROC = 0.89) and Alzheimer’s/MCI (AUROC = 0.97). Outperforms single-modal baselines by 5-19% and surpasses state-of-the-art models on 7 of 9 tasks.
Conclusion: Demonstrates that a single unified model can effectively screen for diverse conditions, establishing a foundation for deployable voice-based diagnostics in resource-constrained healthcare settings.
Abstract: Voice-based health assessment offers unprecedented opportunities for scalable, non-invasive disease screening, yet existing approaches typically focus on single conditions and fail to leverage the rich, multi-faceted information embedded in speech. We present MARVEL (Multi-task Acoustic Representations for Voice-based Health Analysis), a privacy-conscious multitask learning framework that simultaneously detects nine distinct neurological, respiratory, and voice disorders using only derived acoustic features, eliminating the need for raw audio transmission. Our dual-branch architecture employs specialized encoders with task-specific heads sharing a common acoustic backbone, enabling effective cross-condition knowledge transfer. Evaluated on the large-scale Bridge2AI-Voice v2.0 dataset, MARVEL achieves an overall AUROC of 0.78, with exceptional performance on neurological disorders (AUROC = 0.89), particularly for Alzheimer’s disease/mild cognitive impairment (AUROC = 0.97). Our framework consistently outperforms single-modal baselines by 5-19% and surpasses state-of-the-art self-supervised models on 7 of 9 tasks, while correlation analysis reveals that the learned representations exhibit meaningful similarities with established acoustic features, indicating that the model’s internal representations are consistent with clinically recognized acoustic patterns. By demonstrating that a single unified model can effectively screen for diverse conditions, this work establishes a foundation for deployable voice-based diagnostics in resource-constrained and remote healthcare settings.
[251] Speech Emotion Recognition via Entropy-Aware Score Selection
ChenYi Chua, JunKai Wong, Chengxin Chen, Xiaoxiao Miao
Main category: cs.SD
TL;DR: A multimodal speech emotion recognition framework that combines acoustic and textual predictions using entropy-aware score selection, showing improved performance over single-modality systems.
Details
Motivation: To overcome confidence constraints in traditional speech emotion recognition by leveraging both acoustic and textual information through a robust fusion approach.Method: Late score fusion using entropy and varentropy thresholds to combine predictions from wav2vec2.0 acoustic model and RoBERTa-XLM sentiment analysis on Whisper-generated transcriptions, with sentiment mapping from 3 to 4 emotion classes.
Result: The method demonstrates practical and reliable enhancement over traditional single-modality systems on IEMOCAP and MSP-IMPROV datasets.
Conclusion: The proposed entropy-aware multimodal framework effectively improves speech emotion recognition performance by intelligently combining acoustic and textual predictions.
Abstract: In this paper, we propose a multimodal framework for speech emotion recognition that leverages entropy-aware score selection to combine speech and textual predictions. The proposed method integrates a primary pipeline that consists of an acoustic model based on wav2vec2.0 and a secondary pipeline that consists of a sentiment analysis model using RoBERTa-XLM, with transcriptions generated via Whisper-large-v3. We propose a late score fusion approach based on entropy and varentropy thresholds to overcome the confidence constraints of primary pipeline predictions. A sentiment mapping strategy translates three sentiment categories into four target emotion classes, enabling coherent integration of multimodal predictions. The results on the IEMOCAP and MSP-IMPROV datasets show that the proposed method offers a practical and reliable enhancement over traditional single-modality systems.
[252] OLMoASR: Open Models and Data for Training Robust Speech Recognition Models
Huong Ngo, Matt Deitke, Martijn Bartelds, Sarah Pratt, Josh Gardner, Matt Jordan, Ludwig Schmidt
Main category: cs.SD
TL;DR: OLMoASR introduces a large-scale speech recognition dataset (3M hours) with quality filtering, producing 1M high-quality hours. The trained models achieve performance comparable to Whisper across all scales with much smaller parameter counts.
Details
Motivation: To address the underexplored influence of training data scale and quality in speech recognition, and to develop robust zero-shot speech recognition models through improved data curation.Method: Created OLMoASR-Pool (3M hours English audio + 17M transcripts), applied text heuristic filters to remove low-quality data, produced OLMoASR-Mix (1M high-quality hours), trained OLMoASR models ranging from 39M to 1.5B parameters.
Result: OLMoASR achieves comparable average performance to OpenAI’s Whisper across all model scales. OLMoASR-medium.en attains 12.8% and 11.0% WER vs Whisper-medium.en’s 12.4% and 10.5% WER for short and long-form recognition respectively.
Conclusion: High-quality data curation enables training smaller models that match larger competitors’ performance. The dataset, models, and code will be publicly available to advance robust speech processing research.
Abstract: Improvements in training data scale and quality have led to significant advances, yet its influence in speech recognition remains underexplored. In this paper, we present a large-scale dataset, OLMoASR-Pool, and series of models, OLMoASR, to study and develop robust zero-shot speech recognition models. Beginning from OLMoASR-Pool, a collection of 3M hours of English audio and 17M transcripts, we design text heuristic filters to remove low-quality or mistranscribed data. Our curation pipeline produces a new dataset containing 1M hours of high-quality audio-transcript pairs, which we call OLMoASR-Mix. We use OLMoASR-Mix to train the OLMoASR-Mix suite of models, ranging from 39M (tiny.en) to 1.5B (large.en) parameters. Across all model scales, OLMoASR achieves comparable average performance to OpenAI’s Whisper on short and long-form speech recognition benchmarks. Notably, OLMoASR-medium.en attains a 12.8% and 11.0% word error rate (WER) that is on par with Whisper’s largest English-only model Whisper-medium.en’s 12.4% and 10.5% WER for short and long-form recognition respectively (at equivalent parameter count). OLMoASR-Pool, OLMoASR models, and filtering, training and evaluation code will be made publicly available to further research on robust speech processing.
[253] SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization
Chien-Chun Wang, En-Lun Yu, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen
Main category: cs.SD
TL;DR: SincQDR-VAD is a compact voice activity detection framework that uses learnable bandpass filters and a novel quadratic disparity ranking loss to improve robustness to noise and optimize AUROC performance with fewer parameters.
Details
Motivation: Existing VAD methods lack robustness in noisy environments and have frame-wise classification losses that are poorly aligned with VAD evaluation metrics.Method: Combines Sinc-extractor front-end with learnable bandpass filters for noise-resistant features and a quadratic disparity ranking loss that optimizes pairwise score order between speech/non-speech frames.
Result: Significantly improves AUROC and F2-Score on benchmark datasets while using only 69% of parameters compared to prior methods.
Conclusion: The framework demonstrates efficient and practical viability for robust voice activity detection in noisy, resource-limited environments.
Abstract: Voice activity detection (VAD) is essential for speech-driven applications, but remains far from perfect in noisy and resource-limited environments. Existing methods often lack robustness to noise, and their frame-wise classification losses are only loosely coupled with the evaluation metric of VAD. To address these challenges, we propose SincQDR-VAD, a compact and robust framework that combines a Sinc-extractor front-end with a novel quadratic disparity ranking loss. The Sinc-extractor uses learnable bandpass filters to capture noise-resistant spectral features, while the ranking loss optimizes the pairwise score order between speech and non-speech frames to improve the area under the receiver operating characteristic curve (AUROC). A series of experiments conducted on representative benchmark datasets show that our framework considerably improves both AUROC and F2-Score, while using only 69% of the parameters compared to prior arts, confirming its efficiency and practical viability.
[254] Learning Robust Spatial Representations from Binaural Audio through Feature Distillation
Holger Severin Bovbjerg, Jan Østergaard, Jesper Jensen, Shinji Watanabe, Zheng-Hua Tan
Main category: cs.SD
TL;DR: Self-supervised pretraining using feature distillation learns robust spatial representations from binaural speech, improving DoA estimation performance in noisy/reverberant environments compared to supervised methods.
Details
Motivation: Deep representation learning has shown success in audio tasks but spatial representation learning from multichannel audio remains underexplored. The paper aims to learn robust spatial representations without labeled data.Method: Uses feature distillation pretraining: spatial features from clean binaural speech are computed as prediction labels, then predicted from augmented speech using a neural network. The pretrained encoder is then fine-tuned for DoA estimation.
Result: Pretrained models show improved performance in noisy and reverberant environments for direction-of-arrival estimation compared to fully supervised models and classic signal processing methods.
Conclusion: Self-supervised pretraining through feature distillation effectively learns spatial representations that enhance DoA estimation performance, particularly in challenging acoustic environments.
Abstract: Recently, deep representation learning has shown strong performance in multiple audio tasks. However, its use for learning spatial representations from multichannel audio is underexplored. We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of binaural speech without the need for data labels. In this framework, spatial features are computed from clean binaural speech samples to form prediction labels. These clean features are then predicted from corresponding augmented speech using a neural network. After pretraining, we throw away the spatial feature predictor and use the learned encoder weights to initialize a DoA estimation model which we fine-tune for DoA estimation. Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments after fine-tuning for direction-of-arrival estimation, when compared to fully supervised models and classic signal processing methods.
[255] WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations
Jaeyeon Kim, Heeseung Yun, Sang Hoon Woo, Chao-Han Huck Yang, Gunhee Kim
Main category: cs.SD
TL;DR: WoW-Bench benchmark evaluates audio language models’ low-level listening abilities using marine mammal vocalizations, revealing significant performance gaps compared to humans.
Details
Motivation: Large audio language models lack exploration in low-level auditory perception like pitch and duration detection, which is critical for real-world sound reasoning tasks.Method: Introduces World-of-Whale benchmark (WoW-Bench) with Perception benchmark for categorizing novel sounds and Cognition benchmark based on Bloom’s taxonomy with distractor questions to test true listening comprehension.
Result: State-of-the-art LALMs perform far below human levels on both perception and cognition tasks, showing poor low-level auditory capabilities.
Conclusion: There is a significant need for stronger auditory grounding in large audio language models to improve their low-level listening and reasoning abilities.
Abstract: Large audio language models (LALMs) extend language understanding into the auditory domain, yet their ability to perform low-level listening, such as pitch and duration detection, remains underexplored. However, low-level listening is critical for real-world, out-of-distribution tasks where models must reason about unfamiliar sounds based on fine-grained acoustic cues. To address this gap, we introduce the World-of-Whale benchmark (WoW-Bench) to evaluate low-level auditory perception and cognition using marine mammal vocalizations. WoW-bench is composed of a Perception benchmark for categorizing novel sounds and a Cognition benchmark, inspired by Bloom’s taxonomy, to assess the abilities to remember, understand, apply, and analyze sound events. For the Cognition benchmark, we additionally introduce distractor questions to evaluate whether models are truly solving problems through listening rather than relying on other heuristics. Experiments with state-of-the-art LALMs show performance far below human levels, indicating a need for stronger auditory grounding in LALMs.
[256] Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning
Haorui He, Yuchen Song, Yuancheng Wang, Haoyang Li, Xueyao Zhang, Li Wang, Gongping Huang, Eng Siong Chng, Zhizheng Wu
Main category: cs.SD
TL;DR: Noro is a noise-robust one-shot voice conversion system that handles noisy reference speeches using dual-branch reference encoding and noise-agnostic contrastive speaker loss, outperforming baselines in both clean and noisy scenarios.
Details
Motivation: Real-world voice conversion scenarios often involve reference speeches from the internet containing background noise and disturbances, which degrade the effectiveness of traditional one-shot VC systems.Method: Developed Noro with innovative components: dual-branch reference encoding module and noise-agnostic contrastive speaker loss specifically designed for VC using noisy reference speeches.
Result: Noro outperforms baseline systems in both clean and noisy scenarios. Additionally, the baseline system’s reference encoder shows competitive speaker representation capabilities compared to advanced self-supervised learning models under SUPERB settings.
Conclusion: The system demonstrates efficacy for real-world applications and reveals potential for advancing speaker representation learning through one-shot VC tasks.
Abstract: The effectiveness of one-shot voice conversion (VC) decreases in real-world scenarios where reference speeches, which are often sourced from the internet, contain various disturbances like background noise. To address this issue, we introduce Noro, a noise-robust one-shot VC system. Noro features innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. Experimental results demonstrate that Noro outperforms our baseline system in both clean and noisy scenarios, highlighting its efficacy for real-world applications. Additionally, we investigate the hidden speaker representation capabilities of our baseline system by repurposing its reference encoder as a speaker encoder. The results show that it is competitive with several advanced self-supervised learning models for speaker representation under the SUPERB settings, highlighting the potential for advancing speaker representation learning through one-shot VC tasks.
[257] Computational Extraction of Intonation and Tuning Systems from Multiple Microtonal Monophonic Vocal Recordings with Diverse Modes
Sepideh Shafiei, Shapour Hakam
Main category: cs.SD
TL;DR: Computational analysis of microtonal vocal traditions using pitch histograms, DTW, and optimization to derive tuning systems from Iranian Classical Vocal Music performances.
Details
Motivation: To develop a data-driven approach for analyzing intonation and deriving systematic tuning frameworks in microtonal oral traditions, which traditionally lack standardized notation systems.Method: Uses pitch extraction from vocal performances, Dynamic Time Warping for interval analysis, and optimization techniques to model intonation variations across 145 pieces of Iranian Classical Vocal Music repertoire.
Result: Successfully derived structured tuning frameworks that capture both performance flexibility and underlying systematic tendencies in microtonal vocal traditions.
Conclusion: Computational techniques show strong potential for advancing musicological and ethnomusicological research by providing data-driven methods to define tuning systems in oral traditions.
Abstract: This paper presents a computational methodology for analyzing intonation and deriving tuning systems in microtonal oral traditions, utilizing pitch histograms, Dynamic Time Warping (DTW), and optimization techniques, with a case study on a complete repertoire performed by a master of Iranian Classical Vocal Music (145 pieces). Pitch frequencies are extracted directly from vocal performances, and while alignment with MIDI notes is not a standard practice in our approach, we incorporate it where available, using DTW to refine interval analysis. By modeling intonation variations across multiple recordings, we derive structured tuning frameworks that capture both the flexibility of performance and the underlying systematic tendencies. Optimization techniques are applied to align intervals across the oral tradition repertoire, capturing the specific tunings and modal structures involved. Our methodology highlights the potential of computational techniques in advancing musicological and ethnomusicological research, offering a data-driven approach to defining tuning systems in microtonal vocal traditions.
[258] Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities
Wen-Chin Huang
Main category: cs.SD
TL;DR: Review paper on recent speech quality assessment challenges, open-source toolkits, and their importance for advancing both SQA and speech generative AI systems.
Details
Motivation: The generative AI boom has increased the need for accurate automatic speech quality assessment methods that reflect human perception, as researchers now use SQA as rigorous measurement for speech generation systems.Method: The paper reviews recent challenges in speech quality assessment and examines open-source implementations and toolkits that have been developed in this field.
Result: The review highlights the progress made in SQA through scientific challenges and open-source activities, showing how these have stimulated growth in the field.
Conclusion: Maintaining open-source activities and challenges is crucial for facilitating the development of both speech quality assessment methods and generative AI for speech systems.
Abstract: Speech quality assessment (SQA) refers to the evaluation of speech quality, and developing an accurate automatic SQA method that reflects human perception has become increasingly important, in order to keep up with the generative AI boom. In recent years, SQA has progressed to a point that researchers started to faithfully use automatic SQA in research papers as a rigorous measurement of goodness for speech generation systems. We believe that the scientific challenges and open-source activities of late have stimulated the growth in this field. In this paper, we review recent challenges as well as open-source implementations and toolkits for SQA, and highlight the importance of maintaining such activities to facilitate the development of not only SQA itself but also generative AI for speech.
[259] Modality-Specific Speech Enhancement and Noise-Adaptive Fusion for Acoustic and Body-Conduction Microphone Framework
Yunsik Kim, Yoonyoung Chung
Main category: cs.SD
TL;DR: Multi-modal framework combining body-conduction and acoustic microphones for noise-resistant speech processing with high-frequency reconstruction
Details
Motivation: Body-conduction microphones provide strong noise resistance but lose high-frequency information, requiring complementary acoustic signals to achieve both noise suppression and full frequency coverageMethod: Two specialized networks: mapping-based model to enhance BMS and masking-based model to denoise AMS, integrated through dynamic fusion mechanism that adapts to local noise conditions
Result: Outperforms single-modal solutions in a wide range of noisy environments on TAPS dataset with DNS-2023 noise clips using objective speech quality metrics
Conclusion: The proposed multi-modal framework effectively combines the strengths of both microphone types through specialized networks and adaptive fusion, achieving superior noise suppression and high-frequency reconstruction compared to conventional approaches
Abstract: Body-conduction microphone signals (BMS) bypass airborne sound, providing strong noise resistance. However, a complementary modality is required to compensate for the inherent loss of high-frequency information. In this study, we propose a novel multi-modal framework that combines BMS and acoustic microphone signals (AMS) to achieve both noise suppression and high-frequency reconstruction. Unlike conventional multi-modal approaches that simply merge features, our method employs two specialized networks: a mapping-based model to enhance BMS and a masking-based model to denoise AMS. These networks are integrated through a dynamic fusion mechanism that adapts to local noise conditions, ensuring the optimal use of each modality’s strengths. We performed evaluations on the TAPS dataset, augmented with DNS-2023 noise clips, using objective speech quality metrics. The results clearly demonstrate that our approach outperforms single-modal solutions in a wide range of noisy environments.
cs.LG
[260] CrystalICL: Enabling In-Context Learning for Crystal Generation
Ruobing Wang, Qiaoyu Tan, Yili Wang, Ying Wang, Xin Wang
Main category: cs.LG
TL;DR: CrystalICL is a novel few-shot crystal generation model that uses space-group based tokenization and hybrid instruction tuning to effectively learn from limited data for materials design.
Details
Motivation: Human experts design materials by modifying known structures (few-shot learning), but existing LLM-based crystal generation approaches are limited to zero-shot scenarios and cannot benefit from few-shot learning.Method: Introduces space-group based crystal tokenization to reduce symmetry modeling complexity, and a condition-structure aware hybrid instruction tuning framework with multi-task strategy to capture structure-property relationships from limited data.
Result: Extensive experiments on four crystal generation benchmarks show CrystalICL outperforms leading baseline methods on both conditional and unconditional generation tasks.
Conclusion: CrystalICL successfully enables few-shot in-context learning for crystal generation, aligning with how human experts design materials and demonstrating superior performance over existing methods.
Abstract: Designing crystal materials with desired physicochemical properties remains a fundamental challenge in materials science. While large language models (LLMs) have demonstrated strong in-context learning (ICL) capabilities, existing LLM-based crystal generation approaches are limited to zero-shot scenarios and are unable to benefit from few-shot scenarios. In contrast, human experts typically design new materials by modifying relevant known structures which aligns closely with the few-shot ICL paradigm. Motivated by this, we propose CrystalICL, a novel model designed for few-shot crystal generation. Specifically, we introduce a space-group based crystal tokenization method, which effectively reduces the complexity of modeling crystal symmetry in LLMs. We further introduce a condition-structure aware hybrid instruction tuning framework and a multi-task instruction tuning strategy, enabling the model to better exploit ICL by capturing structure-property relationships from limited data. Extensive experiments on four crystal generation benchmarks demonstrate the superiority of CrystalICL over the leading baseline methods on conditional and unconditional generation tasks.
[261] Filter then Attend: Improving attention-based Time Series Forecasting with Spectral Filtering
Elisha Dayag, Nhat Thanh Van Tran, Jack Xin
Main category: cs.LG
TL;DR: Adding learnable frequency filters to transformer-based models improves long time-series forecasting performance by 5-10%, reduces model size, and enhances spectral utilization.
Details
Motivation: Transformer-based models for long time-series forecasting suffer from bias toward low frequencies and high computational/memory requirements, despite achieving state-of-the-art results.Method: Adding learnable frequency filters (≈1000 parameters) to the beginning of transformer-based models, enabling better spectral utilization and allowing reduced embedding dimensions.
Result: 5-10% relative improvement in forecasting performance, smaller model sizes with reduced embedding dimensions, and better full-spectrum utilization demonstrated through synthetic experiments.
Conclusion: Learnable frequency filters significantly enhance transformer-based forecasting models by improving performance, reducing size, and addressing spectral bias issues.
Abstract: Transformer-based models are at the forefront in long time-series forecasting (LTSF). While in many cases, these models are able to achieve state of the art results, they suffer from a bias toward low-frequencies in the data and high computational and memory requirements. Recent work has established that learnable frequency filters can be an integral part of a deep forecasting model by enhancing the model’s spectral utilization. These works choose to use a multilayer perceptron to process their filtered signals and thus do not solve the issues found with transformer-based models. In this paper, we establish that adding a filter to the beginning of transformer-based models enhances their performance in long time-series forecasting. We add learnable filters, which only add an additional $\approx 1000$ parameters to several transformer-based models and observe in multiple instances 5-10 % relative improvement in forecasting performance. Additionally, we find that with filters added, we are able to decrease the embedding dimension of our models, resulting in transformer-based architectures that are both smaller and more effective than their non-filtering base models. We also conduct synthetic experiments to analyze how the filters enable Transformer-based models to better utilize the full spectrum for forecasting.
[262] What can we learn from signals and systems in a transformer? Insights for probabilistic modeling and inference architecture
Heng-Sheng Chang, Prashant G. Mehta
Main category: cs.LG
TL;DR: Transformers are nonlinear predictors that generalize Wiener’s linear prediction, with layer operations interpreted as fixed-point updates of conditional probability measures.
Details
Motivation: To bridge classical nonlinear filtering theory with modern transformer architectures by interpreting transformer operations as probabilistic fixed-point updates.Method: Develop a probabilistic model where transformer signals represent conditional measures and layer operations are fixed-point updates, with explicit formulation for hidden Markov models.
Result: A theoretical framework connecting transformers to classical filtering theory, showing transformers as nonlinear generalizations of Wiener predictors with probabilistic interpretations.
Conclusion: Transformers can be understood as performing fixed-point updates on conditional probability measures, providing a mathematical bridge between classical nonlinear filtering and modern neural architectures.
Abstract: In the 1940s, Wiener introduced a linear predictor, where the future prediction is computed by linearly combining the past data. A transformer generalizes this idea: it is a nonlinear predictor where the next-token prediction is computed by nonlinearly combining the past tokens. In this essay, we present a probabilistic model that interprets transformer signals as surrogates of conditional measures, and layer operations as fixed-point updates. An explicit form of the fixed-point update is described for the special case when the probabilistic model is a hidden Markov model (HMM). In part, this paper is in an attempt to bridge the classical nonlinear filtering theory with modern inference architectures.
[263] The Role of Teacher Calibration in Knowledge Distillation
Suyoung Kim, Seonguk Park, Junhoo Lee, Nojun Kwak
Main category: cs.LG
TL;DR: Teacher model calibration error strongly correlates with student performance in Knowledge Distillation. Using calibration methods to reduce teacher’s calibration error improves KD effectiveness across various tasks.
Details
Motivation: While Knowledge Distillation is effective for model compression, it's not fully understood which factors contribute to student performance improvement. The paper aims to identify key factors that enhance KD effectiveness.Method: The authors demonstrate that teacher model calibration is crucial for effective KD. They employ calibration methods to reduce the teacher’s calibration error and show this simple approach improves KD performance.
Result: The method shows strong correlation between teacher calibration error and student accuracy. The approach is versatile, effective across classification and detection tasks, and can be integrated with existing state-of-the-art methods for superior performance.
Conclusion: Teacher model calibration is an important factor for effective Knowledge Distillation. Simply reducing the teacher’s calibration error through calibration methods consistently improves KD performance across various applications.
Abstract: Knowledge Distillation (KD) has emerged as an effective model compression technique in deep learning, enabling the transfer of knowledge from a large teacher model to a compact student model. While KD has demonstrated significant success, it is not yet fully understood which factors contribute to improving the student’s performance. In this paper, we reveal a strong correlation between the teacher’s calibration error and the student’s accuracy. Therefore, we claim that the calibration of the teacher model is an important factor for effective KD. Furthermore, we demonstrate that the performance of KD can be improved by simply employing a calibration method that reduces the teacher’s calibration error. Our algorithm is versatile, demonstrating effectiveness across various tasks from classification to detection. Moreover, it can be easily integrated with existing state-of-the-art methods, consistently achieving superior performance.
[264] Coresets from Trajectories: Selecting Data via Correlation of Loss Differences
Manish Nagaraj, Deepak Ravikumar, Kaushik Roy
Main category: cs.LG
TL;DR: CLD is an efficient coreset selection method that uses loss trajectory alignment with validation data to identify impactful training samples, outperforming state-of-the-art methods while avoiding costly computations.
Details
Motivation: Deep learning models face scalability challenges in real-time or resource-constrained scenarios, requiring efficient methods to select the most important training samples without expensive computations.Method: Correlation of Loss Differences (CLD) measures alignment between training samples’ loss trajectories and a held-out validation set, using only per-sample loss values from training checkpoints without gradient or curvature computations.
Result: CLD outperforms or matches state-of-the-art methods on CIFAR-100 and ImageNet-1k, stays within 1% of expensive baselines, transfers effectively across architectures with <1% degradation, and shows inherent bias reduction through per-class validation alignment.
Conclusion: CLD provides a principled, efficient, stable, and transferable solution for scalable dataset optimization with theoretical convergence guarantees and practical performance benefits.
Abstract: Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.
[265] cMALC-D: Contextual Multi-Agent LLM-Guided Curriculum Learning with Diversity-Based Context Blending
Anirudh Satheesh, Keenan Powell, Hua Wei
Main category: cs.LG
TL;DR: cMALC-D uses LLMs to generate meaningful curricula and diversity-based context blending to improve multi-agent reinforcement learning generalization and sample efficiency in uncertain environments.
Details
Motivation: Existing contextual MARL methods rely on unreliable proxy signals like noisy value estimates that are unstable in multi-agent settings due to inter-agent dynamics and partial observability.Method: Proposes LLM-guided curriculum learning with diversity-based context blending that creates new training scenarios by combining features from prior contexts to prevent mode collapse.
Result: Experiments in traffic signal control domains show significant improvements in both generalization and sample efficiency compared to existing curriculum learning baselines.
Conclusion: cMALC-D provides a robust framework for training context-agnostic policies that perform well across diverse environment configurations using semantically meaningful curricula.
Abstract: Many multi-agent reinforcement learning (MARL) algorithms are trained in fixed simulation environments, making them brittle when deployed in real-world scenarios with more complex and uncertain conditions. Contextual MARL (cMARL) addresses this by parameterizing environments with context variables and training a context-agnostic policy that performs well across all environment configurations. Existing cMARL methods attempt to use curriculum learning to help train and evaluate context-agnostic policies, but they often rely on unreliable proxy signals, such as value estimates or generalized advantage estimates that are noisy and unstable in multi-agent settings due to inter-agent dynamics and partial observability. To address these issues, we propose Contextual Multi-Agent LLM-Guided Curriculum Learning with Diversity-Based Context Blending (cMALC-D), a framework that uses Large Language Models (LLMs) to generate semantically meaningful curricula and provide a more robust evaluation signal. To prevent mode collapse and encourage exploration, we introduce a novel diversity-based context blending mechanism that creates new training scenarios by combining features from prior contexts. Experiments in traffic signal control domains demonstrate that cMALC-D significantly improves both generalization and sample efficiency compared to existing curriculum learning baselines. We provide code at https://github.com/DaRL-LibSignal/cMALC-D.
[266] Bounds on Perfect Node Classification: A Convex Graph Clustering Perspective
Firooz Shahriari-Mehr, Javad Aliakbari, Alexandre Graell i Amat, Ashkan Panahi
Main category: cs.LG
TL;DR: Novel spectral graph clustering optimization for transductive node classification that synergizes graph structure with node features/labels, enabling perfect community recovery under milder conditions than graph clustering alone.
Details
Motivation: To improve transductive node classification by integrating node-specific information (labels and features) with graph structure in a unified spectral clustering framework, leveraging the synergy between these information sources.Method: Proposed a novel optimization problem that incorporates node labels and features into spectral graph clustering framework. Developed algorithmic solutions to solve this optimization problem.
Result: Demonstrated that suitable node-specific information guarantees perfect community recovery under milder conditions than traditional graph clustering bounds. Numerical experiments confirmed the synergy between graph structure and node-specific information.
Conclusion: The proposed framework successfully integrates node-specific information with graph structure, showing improved performance and more robust community recovery compared to graph clustering alone, with theoretical guarantees and empirical validation.
Abstract: We present an analysis of the transductive node classification problem, where the underlying graph consists of communities that agree with the node labels and node features. For node classification, we propose a novel optimization problem that incorporates the node-specific information (labels and features) in a spectral graph clustering framework. Studying this problem, we demonstrate a synergy between the graph structure and node-specific information. In particular, we show that suitable node-specific information guarantees the solution of our optimization problem perfectly recovering the communities, under milder conditions than the bounds on graph clustering alone. We present algorithmic solutions to our optimization problem and numerical experiments that confirm such a synergy.
[267] Beyond Optimization: Exploring Novelty Discovery in Autonomous Experiments
Ralph Bulanadi, Jawad Chowdhury, Funakubo Hiroshi, Maxim Ziatdinov, Rama Vasudevan, Arpan Biswas, Yongtao Liu
Main category: cs.LG
TL;DR: INS2ANE framework enhances autonomous experiments by integrating novelty scoring and strategic sampling to discover unexpected physical phenomena beyond conventional optimization targets.
Details
Motivation: Current autonomous experiments focus too narrowly on predefined optimization targets, limiting discovery of unexpected or unknown physical phenomena that could lead to scientific breakthroughs.Method: Integrated Novelty Score-Strategic Autonomous Non-Smooth Exploration (INS2ANE) combines a novelty scoring system to evaluate result uniqueness with strategic sampling that explores under-sampled regions regardless of conventional promise.
Result: INS2ANE significantly increases diversity of explored phenomena compared to conventional optimization, validated on pre-acquired datasets and implemented in autonomous scanning probe microscopy experiments.
Conclusion: The framework demonstrates potential for autonomous experiments to enhance scientific discovery depth while maintaining efficiency, promising accelerated research through simultaneous navigation of complex experimental spaces.
Abstract: Autonomous experiments (AEs) are transforming how scientific research is conducted by integrating artificial intelligence with automated experimental platforms. Current AEs primarily focus on the optimization of a predefined target; while accelerating this goal, such an approach limits the discovery of unexpected or unknown physical phenomena. Here, we introduce a novel framework, INS2ANE (Integrated Novelty Score-Strategic Autonomous Non-Smooth Exploration), to enhance the discovery of novel phenomena in autonomous experimentation. Our method integrates two key components: (1) a novelty scoring system that evaluates the uniqueness of experimental results, and (2) a strategic sampling mechanism that promotes exploration of under-sampled regions even if they appear less promising by conventional criteria. We validate this approach on a pre-acquired dataset with a known ground truth comprising of image-spectral pairs. We further implement the process on autonomous scanning probe microscopy experiments. INS2ANE significantly increases the diversity of explored phenomena in comparison to conventional optimization routines, enhancing the likelihood of discovering previously unobserved phenomena. These results demonstrate the potential for AE to enhance the depth of scientific discovery; in combination with the efficiency provided by AEs, this approach promises to accelerate scientific research by simultaneously navigating complex experimental spaces to uncover new phenomena.
[268] Discovering equations from data: symbolic regression in dynamical systems
Beatriz R. Brum, Luiza Lober, Isolde Previdelli, Francisco A. Rodrigues
Main category: cs.LG
TL;DR: Comparison of 5 symbolic regression methods shows PySR is most effective for recovering equations from dynamical systems including chaotic and epidemic models, with results nearly indistinguishable from original analytical forms.
Details
Motivation: To compare different symbolic regression methods for automated equation discovery from data, particularly for complex dynamic systems in physics, ecology, and epidemiology.Method: Used five symbolic regression methods to recover equations from nine different dynamical processes including chaotic dynamics and epidemic models, with PySR identified as the most suitable method.
Result: PySR demonstrated high predictive power and accuracy, with some estimates being indistinguishable from the original analytical forms, making it the most effective method among those tested.
Conclusion: Symbolic regression, particularly PySR, shows strong potential as a robust tool for inferring and modeling real-world phenomena through automated equation discovery from data.
Abstract: The process of discovering equations from data lies at the heart of physics and in many other areas of research, including mathematical ecology and epidemiology. Recently, machine learning methods known as symbolic regression have automated this process. As several methods are available in the literature, it is important to compare them, particularly for dynamic systems that describe complex phenomena. In this paper, five symbolic regression methods were used for recovering equations from nine dynamical processes, including chaotic dynamics and epidemic models, with the PySR method proving to be the most suitable for inferring equations. Benchmark results demonstrate its high predictive power and accuracy, with some estimates being indistinguishable from the original analytical forms. These results highlight the potential of symbolic regression as a robust tool for inferring and modelling real-world phenomena.
[269] Latent Variable Modeling for Robust Causal Effect Estimation
Tetsuro Morimura, Tatsushi Oka, Yugo Suzuki, Daisuke Moriwaki
Main category: cs.LG
TL;DR: Integrates latent variable modeling into double machine learning for robust causal effect estimation with hidden confounders
Details
Motivation: Address challenges posed by missing or unmeasured covariates in causal inference by accounting for hidden factors influencing treatment or outcomeMethod: Proposes framework that incorporates latent variables only in the second stage of DML, separating representation learning from latent inference. Considers two scenarios: latent variables affecting only outcome, and latent variables affecting both treatment and outcome
Result: Demonstrates robustness and effectiveness through extensive experiments on synthetic and real-world datasets
Conclusion: The proposed framework enables robust causal effect estimation in the presence of hidden factors by integrating latent variable modeling with double machine learning
Abstract: Latent variable models provide a powerful framework for incorporating and inferring unobserved factors in observational data. In causal inference, they help account for hidden factors influencing treatment or outcome, thereby addressing challenges posed by missing or unmeasured covariates. This paper proposes a new framework that integrates latent variable modeling into the double machine learning (DML) paradigm to enable robust causal effect estimation in the presence of such hidden factors. We consider two scenarios: one where a latent variable affects only the outcome, and another where it may influence both treatment and outcome. To ensure tractability, we incorporate latent variables only in the second stage of DML, separating representation learning from latent inference. We demonstrate the robustness and effectiveness of our method through extensive experiments on both synthetic and real-world datasets.
[270] Generalizable AI Model for Indoor Temperature Forecasting Across Sub-Saharan Africa
Zainab Akhtar, Eunice Jengo, Björn Haßler
Main category: cs.LG
TL;DR: Lightweight AI model for predicting indoor temperatures in naturally ventilated buildings in Sub-Saharan Africa, achieving cross-country performance with minimal inputs.
Details
Motivation: To develop an accessible AI solution for thermal comfort management in resource-constrained environments like schools and homes in Sub-Saharan Africa.Method: Extends the Temp-AI-Estimator framework using domain-informed approach, trained on Tanzanian school data and evaluated on Nigerian schools and Gambian homes with minimal accessible inputs.
Result: Achieves robust cross-country performance with mean absolute errors of 1.45°C for Nigerian schools and 0.65°C for Gambian homes.
Conclusion: Demonstrates AI’s potential for effective thermal comfort management in resource-constrained environments using lightweight, accessible models.
Abstract: This study presents a lightweight, domain-informed AI model for predicting indoor temperatures in naturally ventilated schools and homes in Sub-Saharan Africa. The model extends the Temp-AI-Estimator framework, trained on Tanzanian school data, and evaluated on Nigerian schools and Gambian homes. It achieves robust cross-country performance using only minimal accessible inputs, with mean absolute errors of 1.45{\deg}C for Nigerian schools and 0.65{\deg}C for Gambian homes. These findings highlight AI’s potential for thermal comfort management in resource-constrained environments.
[271] A Systematic Review on the Generative AI Applications in Human Medical Genomics
Anton Changalidis, Yury Barbitoff, Yulia Nasykhova, Andrey Glotov
Main category: cs.LG
TL;DR: Systematic review of 172 studies showing transformer-based LLMs significantly advance genetic disease diagnostics through variant interpretation, medical imaging analysis, and report generation, but face challenges in multimodal data integration and clinical implementation.
Details
Motivation: Traditional statistical and machine learning methods struggle with complex, high-dimensional genetic data, while LLMs based on transformer architectures have shown promise in contextual comprehension of unstructured medical data for inherited disease diagnosis.Method: Conducted automated keyword-based search in PubMed, bioRxiv, medRxiv, and arXiv to identify studies on LLM applications in genetics diagnostics and education, followed by analysis of 172 relevant studies after removing irrelevant or outdated models.
Result: LLMs demonstrate significant advancements in genomic variant identification, annotation, interpretation, and medical imaging analysis through vision transformers, improving disease and risk stratification as well as report generation.
Conclusion: While transformer-based models show transformative potential for hereditary disease diagnostics and genetic education, major challenges remain in integrating multimodal data into unified clinical pipelines and achieving generalizability and practical implementation in clinical settings.
Abstract: Although traditional statistical techniques and machine learning methods have contributed significantly to genetics and, in particular, inherited disease diagnosis, they often struggle with complex, high-dimensional data, a challenge now addressed by state-of-the-art deep learning models. Large language models (LLMs), based on transformer architectures, have excelled in tasks requiring contextual comprehension of unstructured medical data. This systematic review examines the role of LLMs in the genetic research and diagnostics of both rare and common diseases. Automated keyword-based search in PubMed, bioRxiv, medRxiv, and arXiv was conducted, targeting studies on LLM applications in diagnostics and education within genetics and removing irrelevant or outdated models. A total of 172 studies were analyzed, highlighting applications in genomic variant identification, annotation, and interpretation, as well as medical imaging advancements through vision transformers. Key findings indicate that while transformer-based models significantly advance disease and risk stratification, variant interpretation, medical imaging analysis, and report generation, major challenges persist in integrating multimodal data (genomic sequences, imaging, and clinical records) into unified and clinically robust pipelines, facing limitations in generalizability and practical implementation in clinical settings. This review provides a comprehensive classification and assessment of the current capabilities and limitations of LLMs in transforming hereditary disease diagnostics and supporting genetic education, serving as a guide to navigate this rapidly evolving field.
[272] Objective Value Change and Shape-Based Accelerated Optimization for the Neural Network Approximation
Pengcheng Xie, Zihao Zhou, Zijian Zhou
Main category: cs.LG
TL;DR: Proposes VC metric to measure neural network approximation difficulty, identifies VC-tendency and minority-tendency phenomena, and introduces a preprocessing framework for acceleration.
Details
Motivation: Neural networks suffer from unpredictable local performance that hinders reliability in critical applications, requiring a quantifiable measure to understand approximation behavior.Method: Introduces VC (value change) metric to measure local value changes, investigates theoretical properties, identifies VC-tendency and minority-tendency phenomena, and proposes a preprocessing framework based on VC distance metric.
Result: Numerical results from real-world experiments and PDE-related scientific problems support the discoveries and show that the preprocessing acceleration method is effective.
Conclusion: VC metric provides valuable insights into neural network approximation behavior, and the proposed preprocessing framework offers practical acceleration for neural network approximation tasks.
Abstract: This paper introduce a novel metric of an objective function f, we say VC (value change) to measure the difficulty and approximation affection when conducting an neural network approximation task, and it numerically supports characterizing the local performance and behavior of neural network approximation. Neural networks often suffer from unpredictable local performance, which can hinder their reliability in critical applications. VC addresses this issue by providing a quantifiable measure of local value changes in network behavior, offering insights into the stability and performance for achieving the neural-network approximation. We investigate some fundamental theoretical properties of VC and identified two intriguing phenomena in neural network approximation: the VC-tendency and the minority-tendency. These trends respectively characterize how pointwise errors evolve in relation to the distribution of VC during the approximation process.In addition, we propose a novel metric based on VC, which measures the distance between two functions from the perspective of variation. Building upon this metric, we further propose a new preprocessing framework for neural network approximation. Numerical results including the real-world experiment and the PDE-related scientific problem support our discovery and pre-processing acceleration method.
[273] Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation
Lingkai Kong, Haichuan Wang, Charles A. Emogor, Vincent Börsch-Supan, Lily Xu, Milind Tambe
Main category: cs.LG
TL;DR: A flow matching approach for poaching prediction that addresses imperfect detection through latent space occupancy modeling and data scarcity through composite flow initialization from linear models.
Details
Motivation: Poaching threatens wildlife and biodiversity, but existing prediction methods lack expressivity for complex spatiotemporal patterns. Recent generative models like flow matching offer flexibility but face challenges with imperfect detection and limited data in real-world poaching scenarios.Method: Integrates flow matching with occupancy-based detection model to train flows in latent space for inferring underlying occupancy states. Uses composite flow initialized from linear-model predictions rather than random noise to inject prior knowledge and improve generalization.
Result: Evaluations on datasets from two national parks in Uganda show consistent gains in predictive accuracy compared to existing methods.
Conclusion: The proposed approach effectively addresses key challenges in poaching prediction by combining flow matching with occupancy modeling and prior knowledge injection, demonstrating improved performance in real-world conservation settings.
Abstract: Poaching poses significant threats to wildlife and biodiversity. A valuable step in reducing poaching is to forecast poacher behavior, which can inform patrol planning and other conservation interventions. Existing poaching prediction methods based on linear models or decision trees lack the expressivity to capture complex, nonlinear spatiotemporal patterns. Recent advances in generative modeling, particularly flow matching, offer a more flexible alternative. However, training such models on real-world poaching data faces two central obstacles: imperfect detection of poaching events and limited data. To address imperfect detection, we integrate flow matching with an occupancy-based detection model and train the flow in latent space to infer the underlying occupancy state. To mitigate data scarcity, we adopt a composite flow initialized from a linear-model prediction rather than random noise which is the standard in diffusion models, injecting prior knowledge and improving generalization. Evaluations on datasets from two national parks in Uganda show consistent gains in predictive accuracy.
[274] Beacon: Post-Training Quantization with Integrated Grid Selection
Shihao Zhang, Rayan Saab
Main category: cs.LG
TL;DR: Beacon is a tuning-free per-channel PTQ method that automatically determines optimal scaling factors using fixed non-scaled alphabets and geometric properties of symmetric quantization, achieving competitive performance without manual tuning or large calibration sets.
Details
Motivation: Existing per-channel post-training quantization methods require manual tuning or grid search to determine scaling factors, which is inefficient and time-consuming. There's a need for an automated, tuning-free approach that maintains competitive performance.Method: Beacon performs per-channel quantization using fixed non-scaled alphabets and automatically determines optimal scaling factors by exploiting the geometry of symmetric scalar quantization. It supports both symmetric and asymmetric quantization with minimal modifications and operates without back-propagation or large calibration sets.
Result: The method achieves competitive performance compared to state-of-the-art quantization methods despite its simplicity and tuning-free nature.
Conclusion: Beacon provides a practical, efficient solution for model deployment that eliminates manual tuning while maintaining competitive quantization performance across both symmetric and asymmetric quantization scenarios.
Abstract: Quantization is a widely used compression technique for reducing the memory and computation costs of large pre-trained models. A key challenge in per-channel post-training quantization (PTQ) is selecting appropriate scaling factors to replace weight values with values from a scaled quantization grid. Existing methods typically fix the scale at the outset via heuristic tuning or grid search. In this note, we propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning. Beacon performs per-channel PTQ directly using a fixed non-scaled alphabet and automatically determines the optimal scaling factors by exploiting the geometry of symmetric scalar quantization. It supports both symmetric and asymmetric quantization with minimal modifications and does not rely on back-propagation or large calibration sets. Despite its simplicity and tuning-free nature, Beacon achieves competitive performance compared to state-of-the-art methods, making it a practical solution for efficient model deployment.
[275] Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization
Frank Röder, Jan Benad, Manfred Eppe, Pradeep Kr. Banerjee
Main category: cs.LG
TL;DR: DALI is a framework that infers latent context representations from agent-environment interactions to enable zero-shot generalization to unseen environmental conditions without retraining.
Details
Motivation: Real-world RL needs to adapt to unseen conditions without costly retraining. Existing cMDP methods require explicit context variables, limiting their use when contexts are latent or hard to measure.Method: Integrated within Dreamer architecture, DALI trains a self-supervised encoder to predict forward dynamics and generate actionable representations that condition the world model and policy.
Result: DALI achieves significant gains over context-unaware baselines and often surpasses context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.
Conclusion: The framework provides efficient context inference and robust generalization with counterfactual consistency in the latent space, allowing physically plausible imagined rollouts when perturbing context dimensions.
Abstract: Real-world reinforcement learning demands adaptation to unseen environmental conditions without costly retraining. Contextual Markov Decision Processes (cMDP) model this challenge, but existing methods often require explicit context variables (e.g., friction, gravity), limiting their use when contexts are latent or hard to measure. We introduce Dynamics-Aligned Latent Imagination (DALI), a framework integrated within the Dreamer architecture that infers latent context representations from agent-environment interactions. By training a self-supervised encoder to predict forward dynamics, DALI generates actionable representations conditioning the world model and policy, bridging perception and control. We theoretically prove this encoder is essential for efficient context inference and robust generalization. DALI’s latent space enables counterfactual consistency: Perturbing a gravity-encoding dimension alters imagined rollouts in physically plausible ways. On challenging cMDP benchmarks, DALI achieves significant gains over context-unaware baselines, often surpassing context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.
[276] FedReFT: Federated Representation Fine-Tuning with All-But-Me Aggregation
Fatema Siddika, Md Anwar Hossen, J. Pablo Muñoz, Tanya Roosta, Anuj Sharma, Ali Jannesari
Main category: cs.LG
TL;DR: FedReFT introduces federated representation fine-tuning that manipulates hidden representations instead of weights, achieving 7x-15x higher parameter efficiency than LoRA-based methods while handling client heterogeneity through All-But-Me aggregation.
Details
Motivation: Address challenges of applying Representation Fine-tuning (ReFT) in Federated Learning due to client data heterogeneity, model capacity differences, and computational resource constraints.Method: FedReFT applies sparse intervention layers to steer hidden representations directly, combined with All-But-Me aggregation where clients receive aggregated updates from others and partially incorporate them for stable personalized learning.
Result: Outperforms state-of-the-art PEFT methods in FL across commonsense reasoning, arithmetic reasoning, instruction-tuning, and GLUE benchmarks.
Conclusion: FedReFT provides a lightweight, semantically rich fine-tuning alternative ideal for edge devices, achieving superior parameter efficiency while maintaining performance across diverse tasks.
Abstract: Parameter-efficient fine-tuning (PEFT) has attracted significant attention for adapting large pre-trained models by modifying a small subset of parameters. Recently, Representation Fine-tuning (ReFT) has emerged as an effective alternative. ReFT shifts the fine-tuning paradigm from updating model weights to directly manipulating hidden representations that capture rich semantic information, and performs better than state-of-the-art PEFTs in standalone settings. However, its application in Federated Learning (FL) remains challenging due to heterogeneity in clients’ data distributions, model capacities, and computational resources. To address these challenges, we introduce Federated Representation Fine-Tuning (FedReFT), a novel approach to fine-tune the client’s hidden representation. FedReFT applies sparse intervention layers to steer hidden representations directly, offering a lightweight and semantically rich fine-tuning alternative ideal for edge devices. However, representation-level updates are especially vulnerable to aggregation mismatch under different task heterogeneity, where naive averaging can corrupt semantic alignment. To mitigate this issue, we propose All-But-Me (ABM) aggregation, where each client receives the aggregated updates of others and partially incorporates them, enabling stable and personalized learning by balancing local focus with global knowledge. We evaluate FedReFT on commonsense reasoning, arithmetic reasoning, instruction-tuning, and GLUE, where it consistently outperforms state-of-the-art PEFT methods in FL, achieving 7x-15x higher parameter efficiency compared to leading LoRA-based approaches.
[277] Multi-Agent Reinforcement Learning in Intelligent Transportation Systems: A Comprehensive Survey
RexCharles Donatus, Kumater Ter, Ore-Ofe Ajayi, Daniel Udekwe
Main category: cs.LG
TL;DR: Comprehensive survey of Multi-Agent Reinforcement Learning (MARL) applications in Intelligent Transportation Systems (ITS), covering taxonomy, domains, simulation platforms, and key challenges.
Details
Motivation: Address the growing complexity of urban mobility and demand for efficient, sustainable solutions through autonomous decision-making in dynamic, large-scale transportation environments.Method: Structured taxonomy categorizing MARL approaches by coordination models and learning algorithms (value-based, policy-based, actor-critic, communication-enhanced frameworks), review of applications across key ITS domains, and analysis of simulation platforms.
Result: Survey provides comprehensive coverage of MARL applications in traffic signal control, autonomous vehicle coordination, logistics optimization, and mobility-on-demand systems, along with evaluation platforms and benchmarks.
Conclusion: MARL offers promising solutions for ITS but faces significant challenges including scalability, non-stationarity, credit assignment, communication constraints, and sim-to-real transfer gap that hinder real-world deployment.
Abstract: The growing complexity of urban mobility and the demand for efficient, sustainable, and adaptive solutions have positioned Intelligent Transportation Systems (ITS) at the forefront of modern infrastructure innovation. At the core of ITS lies the challenge of autonomous decision-making across dynamic, large scale, and uncertain environments where multiple agents traffic signals, autonomous vehicles, or fleet units must coordinate effectively. Multi Agent Reinforcement Learning (MARL) offers a promising paradigm for addressing these challenges by enabling distributed agents to jointly learn optimal strategies that balance individual objectives with system wide efficiency. This paper presents a comprehensive survey of MARL applications in ITS. We introduce a structured taxonomy that categorizes MARL approaches according to coordination models and learning algorithms, spanning value based, policy based, actor critic, and communication enhanced frameworks. Applications are reviewed across key ITS domains, including traffic signal control, connected and autonomous vehicle coordination, logistics optimization, and mobility on demand systems. Furthermore, we highlight widely used simulation platforms such as SUMO, CARLA, and CityFlow that support MARL experimentation, along with emerging benchmarks. The survey also identifies core challenges, including scalability, non stationarity, credit assignment, communication constraints, and the sim to real transfer gap, which continue to hinder real world deployment.
[278] Multi-View Graph Convolution Network for Internal Talent Recommendation Based on Enterprise Emails
Soo Hyun Kim, Jang-Hyun Kim
Main category: cs.LG
TL;DR: A novel framework for internal talent recommendation that models both WHAT employees do (task similarity) and HOW they work (collaboration patterns) using dual GCNs with adaptive gating, achieving 40.9% Hit@100 with high interpretability across job families.
Details
Motivation: Traditional internal talent recommendation methods have structural limitations and often overlook qualified candidates due to narrow managerial perspectives, requiring a more comprehensive approach.Method: Proposes a dual graph convolutional network framework that models semantic task similarity (WHAT) and structural interaction patterns (HOW) from email data, with adaptive gating mechanism to fuse these dimensions.
Result: Achieved top performance of 40.9% on Hit@100, significantly outperforming other fusion strategies and baselines. Model shows high interpretability with context-aware fusion strategies for different job families.
Conclusion: Provides a quantitative framework for internal talent discovery that minimizes candidate omission risk. Key contribution is empirically determining optimal fusion ratio between task alignment and collaborative patterns for position success.
Abstract: Internal talent recommendation is a critical strategy for organizational continuity, yet conventional approaches suffer from structural limitations, often overlooking qualified candidates by relying on the narrow perspective of a few managers. To address this challenge, we propose a novel framework that models two distinct dimensions of an employee’s position fit from email data: WHAT they do (semantic similarity of tasks) and HOW they work (structural characteristics of their interactions and collaborations). These dimensions are represented as independent graphs and adaptively fused using a Dual Graph Convolutional Network (GCN) with a gating mechanism. Experiments show that our proposed gating-based fusion model significantly outperforms other fusion strategies and a heuristic baseline, achieving a top performance of 40.9% on Hit@100. Importantly, it is worth noting that the model demonstrates high interpretability by learning distinct, context-aware fusion strategies for different job families. For example, it learned to prioritize relational (HOW) data for ‘sales and marketing’ job families while applying a balanced approach for ‘research’ job families. This research offers a quantitative and comprehensive framework for internal talent discovery, minimizing the risk of candidate omission inherent in traditional methods. Its primary contribution lies in its ability to empirically determine the optimal fusion ratio between task alignment (WHAT) and collaborative patterns (HOW), which is required for employees to succeed in the new positions, thereby offering important practical implications.
[279] FORGE: Foundational Optimization Representations from Graph Embeddings
Zohair Shafi, Serdar Kadioglu
Main category: cs.LG
TL;DR: Forge is a pre-training method using vector-quantized graph autoencoders on diverse MIP instances without solution dependency, enabling unsupervised clustering and supervised predictions for solver performance improvement.
Details
Motivation: Existing learning-based approaches for combinatorial optimization require solving many hard instances for training data and need dedicated models per problem distribution, limiting scalability and generalization.Method: Pre-train a vector-quantized graph autoencoder on a large collection of mixed-integer programming instances in unsupervised fashion, creating discrete code assignments as vocabulary to represent optimization instances.
Result: Forge embeddings effectively differentiate and cluster unseen instances in unsupervised setting. In supervised setting, fine-tuned embeddings predict variables for warm-starts and integrality gaps for cut-generation across multiple problem types, improving commercial solver performance.
Conclusion: Forge provides a scalable pre-training approach that enables both unsupervised instance analysis and supervised solver enhancement across diverse problem distributions, with released code and weights for further research.
Abstract: Combinatorial optimization problems are ubiquitous in science and engineering, yet learning-based approaches to accelerate their solution often require solving a large number of hard-to-solve optimization instances to collect training data, incurring significant computational overhead. Existing methods require training dedicated models for each problem distribution for each downstream task, severely limiting their scalability and generalization. In this work, we introduce Forge, a method of pre-training a vector-quantized graph autoencoder on a large and diverse collection of mixed-integer programming (MIP) instances in an unsupervised fashion without dependency on their solution. The vector quantization process creates discrete code assignments that act as a vocabulary to represent optimization instances. We evaluate our approach under both supervised and unsupervised settings. For the unsupervised setting, we demonstrate that Forge embeddings effectively differentiate and cluster unseen instances. For the supervised setting, we fine-tune Forge embeddings and show that a single model predicts both the variables for warm-starts and integrality gaps for cut-generation across multiple problem type distributions. Both predictions help improve performance of a state-of-the-art, commercial optimization solver. Finally, we release our code and pre-trained Forge weights to encourage further research and practical use of instance-level MIP embeddings at https://github.com/skadio/forge/
[280] Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs
Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh
Main category: cs.LG
TL;DR: SAI is a poisoning attack that exploits LLM alignment to inject bias and targeted censorship by triggering refusal on adversary-defined topics, evading current defenses and causing significant bias in downstream applications.
Details
Motivation: To demonstrate how adversaries can exploit LLM alignment mechanisms to implant bias and enforce targeted censorship without degrading general model performance, highlighting vulnerabilities in current safety training approaches.Method: Proposes Subversive Alignment Injection (SAI) - a poisoning attack that leverages alignment training to make LLMs refuse specific adversary-defined topics while maintaining responsiveness to unrelated queries.
Result: SAI evades state-of-the-art poisoning defenses including LLM state forensics and robust aggregation techniques. With just 1% data poisoning, it causes high bias (ΔDP of 23-38%) in various applications like healthcare chatbots (refusing care to specific racial categories) and resume selection systems.
Conclusion: Current LLM alignment mechanisms are vulnerable to sophisticated poisoning attacks that can implant targeted bias and censorship, demonstrating the need for more robust defense mechanisms against such subversive alignment injections.
Abstract: Large Language Models (LLMs) are aligned to meet ethical standards and safety requirements by training them to refuse answering harmful or unsafe prompts. In this paper, we demonstrate how adversaries can exploit LLMs’ alignment to implant bias, or enforce targeted censorship without degrading the model’s responsiveness to unrelated topics. Specifically, we propose Subversive Alignment Injection (SAI), a poisoning attack that leverages the alignment mechanism to trigger refusal on specific topics or queries predefined by the adversary. Although it is perhaps not surprising that refusal can be induced through overalignment, we demonstrate how this refusal can be exploited to inject bias into the model. Surprisingly, SAI evades state-of-the-art poisoning defenses including LLM state forensics, as well as robust aggregation techniques that are designed to detect poisoning in FL settings. We demonstrate the practical dangers of this attack by illustrating its end-to-end impacts on LLM-powered application pipelines. For chat based applications such as ChatDoctor, with 1% data poisoning, the system refuses to answer healthcare questions to targeted racial category leading to high bias ($\Delta DP$ of 23%). We also show that bias can be induced in other NLP tasks: for a resume selection pipeline aligned to refuse to summarize CVs from a selected university, high bias in selection ($\Delta DP$ of 27%) results. Even higher bias ($\Delta DP$~38%) results on 9 other chat based downstream applications.
[281] Dynamic Synthetic Controls vs. Panel-Aware Double Machine Learning for Geo-Level Marketing Impact Estimation
Sang Su Lee, Vineeth Loganathan, Vijay Raghavan
Main category: cs.LG
TL;DR: Comparison of Synthetic Control Method (SCM) and panel-style Double Machine Learning (DML) for geo-level marketing lift measurement, showing DML variants outperform ASC models in complex scenarios with better bias reduction and coverage.
Details
Motivation: Accurately quantifying geo-level marketing lift in two-sided marketplaces is challenging, as SCM often underestimates effect size while panel-DML methods are rarely benchmarked against SCM.Method: Built an open simulator mimicking large-scale geo roll-outs with 5 stress tests (curved trends, heterogeneous lags, biased shocks, non-linear outcomes, drifting trends). Evaluated 7 estimators: 3 ASC variants and 4 panel-DML flavors across 100 replications per scenario.
Result: ASC models showed severe bias and near-zero coverage in challenging scenarios with nonlinearities or external shocks. Panel-DML variants dramatically reduced bias and restored nominal 95% CI coverage, proving more robust.
Conclusion: ASC provides a simple baseline but is unreliable in complex situations. Propose a ‘diagnose-first’ framework where practitioners identify business challenges first, then select the appropriate DML model for robust geo-experiment analysis.
Abstract: Accurately quantifying geo-level marketing lift in two-sided marketplaces is challenging: the Synthetic Control Method (SCM) often exhibits high power yet systematically under-estimates effect size, while panel-style Double Machine Learning (DML) is seldom benchmarked against SCM. We build an open, fully documented simulator that mimics a typical large-scale geo roll-out: N_unit regional markets are tracked for T_pre weeks before launch and for a further T_post-week campaign window, allowing all key parameters to be varied by the user and probe both families under five stylized stress tests: 1) curved baseline trends, 2) heterogeneous response lags, 3) treated-biased shocks, 4) a non-linear outcome link, and 5) a drifting control group trend. Seven estimators are evaluated: three standard Augmented SCM (ASC) variants and four panel-DML flavors (TWFE, CRE/Mundlak, first-difference, and within-group). Across 100 replications per scenario, ASC models consistently demonstrate severe bias and near-zero coverage in challenging scenarios involving nonlinearities or external shocks. By contrast, panel-DML variants dramatically reduce this bias and restore nominal 95%-CI coverage, proving far more robust. The results indicate that while ASC provides a simple baseline, it is unreliable in common, complex situations. We therefore propose a ‘diagnose-first’ framework where practitioners first identify the primary business challenge (e.g., nonlinear trends, response lags) and then select the specific DML model best suited for that scenario, providing a more robust and reliable blueprint for analyzing geo-experiments.
[282] Adaptive Segmentation of EEG for Machine Learning Applications
Johnson Zhou, Joseph West, Krista A. Ehinger, Zhenming Ren, Sam E. John, David B. Grayden
Main category: cs.LG
TL;DR: CTXSEG is an adaptive EEG segmentation method that creates variable-length segments based on statistical differences, improving seizure detection performance compared to fixed-length segmentation without modifying ML methods.
Details
Motivation: Current EEG machine learning uses arbitrary fixed time slices that lack biological relevance since brain states aren't confined to fixed intervals. Adaptive segmentation could better capture neurological patterns.Method: Developed CTXSEG adaptive segmentation method that creates variable-length segments based on statistical differences in EEG data. Validated using synthetic data from CTXGEN signal generator and real-world EEG seizure detection use case.
Result: CTXSEG improved seizure detection performance compared to fixed-length approaches using standardized framework, required fewer segments, and worked without modifying machine learning methods.
Conclusion: Adaptive segmentation with CTXSEG is a promising alternative to fixed-length segmentation that can be readily applied to modern ML approaches, improving EEG analysis performance and should be part of standard preprocessing repertoire.
Abstract: Objective. Electroencephalography (EEG) data is derived by sampling continuous neurological time series signals. In order to prepare EEG signals for machine learning, the signal must be divided into manageable segments. The current naive approach uses arbitrary fixed time slices, which may have limited biological relevance because brain states are not confined to fixed intervals. We investigate whether adaptive segmentation methods are beneficial for machine learning EEG analysis. Approach. We introduce a novel adaptive segmentation method, CTXSEG, that creates variable-length segments based on statistical differences in the EEG data and propose ways to use them with modern machine learning approaches that typically require fixed-length input. We assess CTXSEG using controllable synthetic data generated by our novel signal generator CTXGEN. While our CTXSEG method has general utility, we validate it on a real-world use case by applying it to an EEG seizure detection problem. We compare the performance of CTXSEG with fixed-length segmentation in the preprocessing step of a typical EEG machine learning pipeline for seizure detection. Main results. We found that using CTXSEG to prepare EEG data improves seizure detection performance compared to fixed-length approaches when evaluated using a standardized framework, without modifying the machine learning method, and requires fewer segments. Significance. This work demonstrates that adaptive segmentation with CTXSEG can be readily applied to modern machine learning approaches, with potential to improve performance. It is a promising alternative to fixed-length segmentation for signal preprocessing and should be considered as part of the standard preprocessing repertoire in EEG machine learning applications.
[283] Understanding Incremental Learning with Closed-form Solution to Gradient Flow on Overparamerterized Matrix Factorization
Hancheng Min, René Vidal
Main category: cs.LG
TL;DR: The paper provides a quantitative analysis of incremental learning in gradient flow for symmetric matrix factorization, showing how small initialization leads to time-scale separation that enables sequential learning of singular values and low-rank approximations.
Details
Motivation: To understand the implicit bias and regularization effects of first-order optimization algorithms in neural networks, specifically the incremental learning phenomenon where gradient flow learns matrix singular values in decreasing order with small initialization.Method: Developed a quantitative understanding using the closed-form solution obtained by solving a Riccati-like matrix differential equation for gradient flow on symmetric matrix factorization problems.
Result: Shows that incremental learning emerges from time-scale separation among dynamics corresponding to learning different components in the target matrix, with smaller initialization scales making these separations more prominent and enabling low-rank approximations.
Conclusion: The analysis provides insights into how gradient flow with small initialization enables sequential learning and low-rank solutions, with potential for extension to asymmetric matrix factorization problems.
Abstract: Many theoretical studies on neural networks attribute their excellent empirical performance to the implicit bias or regularization induced by first-order optimization algorithms when training networks under certain initialization assumptions. One example is the incremental learning phenomenon in gradient flow (GF) on an overparamerterized matrix factorization problem with small initialization: GF learns a target matrix by sequentially learning its singular values in decreasing order of magnitude over time. In this paper, we develop a quantitative understanding of this incremental learning behavior for GF on the symmetric matrix factorization problem, using its closed-form solution obtained by solving a Riccati-like matrix differential equation. We show that incremental learning emerges from some time-scale separation among dynamics corresponding to learning different components in the target matrix. By decreasing the initialization scale, these time-scale separations become more prominent, allowing one to find low-rank approximations of the target matrix. Lastly, we discuss the possible avenues for extending this analysis to asymmetric matrix factorization problems.
[284] DFAMS: Dynamic-flow guided Federated Alignment based Multi-prototype Search
Zhibang Yang, Xinke Jiang, Rihong Qiu, Ruiqing Li, Yihang Zhang, Yue Fang, Yongxin Xu, Hongxin Ding, Xu Chu, Junfeng Zhao, Yasha Wang
Main category: cs.LG
TL;DR: DFAMS is a federated retrieval framework that uses dynamic information flow in LLMs to better handle ambiguous queries across distributed knowledge sources, improving retrieval accuracy and downstream task performance.
Details
Motivation: Existing federated retrieval methods struggle with ambiguous queries in cross-domain scenarios, limiting their effectiveness in supporting downstream generation tasks and mitigating LLM hallucinations.Method: Leverages dynamic information flow in LLMs using gradient signals from annotated queries and Shapley value-based attribution to trace neuron activation paths. Trains an alignment module via multi-prototype contrastive learning for fine-grained intra-source modeling and inter-source semantic alignment.
Result: Outperforms advanced FR methods by up to 14.37% in knowledge classification accuracy, 5.38% in retrieval recall, and 6.45% in downstream QA accuracy across five benchmarks.
Conclusion: DFAMS demonstrates significant effectiveness in complex federated retrieval scenarios by better identifying latent query intents and aligning knowledge partitions across heterogeneous sources.
Abstract: Federated Retrieval (FR) routes queries across multiple external knowledge sources, to mitigate hallucinations of LLMs, when necessary external knowledge is distributed. However, existing methods struggle to retrieve high-quality and relevant documents for ambiguous queries, especially in cross-domain scenarios, which significantly limits their effectiveness in supporting downstream generation tasks. Inspired by dynamic information flow (DIF), we propose DFAMS, a novel framework that leverages DIF to identify latent query intents and construct semantically aligned knowledge partitions for accurate retrieval across heterogeneous sources. Specifically, DFAMS probes the DIF in LLMs by leveraging gradient signals from a few annotated queries and employing Shapley value-based attribution to trace neuron activation paths associated with intent recognition and subdomain boundary detection. Then, DFAMS leverages DIF to train an alignment module via multi-prototype contrastive learning, enabling fine-grained intra-source modeling and inter-source semantic alignment across knowledge bases. Experimental results across five benchmarks show that DFAMS outperforms advanced FR methods by up to 14.37% in knowledge classification accuracy, 5.38% in retrieval recall, and 6.45% in downstream QA accuracy, demonstrating its effectiveness in complex FR scenarios.
[285] Developing a Multi-Modal Machine Learning Model For Predicting Performance of Automotive Hood Frames
Abhishek Indupally, Satchit Ramnath
Main category: cs.LG
TL;DR: Multimodal machine learning architecture for rapid hood frame geometry evaluation without extensive simulation setup
Details
Motivation: To enable designers to evaluate hood frame performance without time-consuming simulation setup and reduce reliance on computationally expensive simulationsMethod: Developed a multimodal machine-learning (MMML) architecture that learns from different data modalities to predict performance metrics, tested on unseen frame geometries
Result: MMML outperforms traditional single-modality approaches and successfully generalizes to new frame geometries not in training data
Conclusion: MMML effectively supplements traditional simulation workflows, bridges ML with engineering applications, and accelerates design cycles in conceptual design phase
Abstract: Is there a way for a designer to evaluate the performance of a given hood frame geometry without spending significant time on simulation setup? This paper seeks to address this challenge by developing a multimodal machine-learning (MMML) architecture that learns from different modalities of the same data to predict performance metrics. It also aims to use the MMML architecture to enhance the efficiency of engineering design processes by reducing reliance on computationally expensive simulations. The proposed architecture accelerates design exploration, enabling rapid iteration while maintaining high-performance standards, especially in the concept design phase. The study also presents results that show that by combining multiple data modalities, MMML outperforms traditional single-modality approaches. Two new frame geometries, not part of the training dataset, are also used for prediction using the trained MMML model to showcase the ability to generalize to unseen frame models. The findings underscore MMML’s potential in supplementing traditional simulation-based workflows, particularly in the conceptual design phase, and highlight its role in bridging the gap between machine learning and real-world engineering applications. This research paves the way for the broader adoption of machine learning techniques in engineering design, with a focus on refining multimodal approaches to optimize structural development and accelerate the design cycle.
[286] BiListing: Modality Alignment for Listings
Guillaume Guy, Mihajlo Grbovic, Chun How Tan, Han Zhao
Main category: cs.LG
TL;DR: BiListing is a bimodal approach that combines text and image embeddings from Airbnb listings using large-language models and pretrained language-image models to create unified representations, enabling better search and ranking with significant revenue impact.
Details
Motivation: Airbnb needed to overcome limitations of structured data by leveraging rich unstructured information from text and images in listings, but faced challenges in combining multiple embeddings from diverse data sources into single representations.Method: Proposes BiListing approach that aligns text and photos using large-language models and pretrained language-image models to create single embedding vectors per listing and modality, enabling zero-shot search and overcoming cold start problems.
Result: Successfully deployed in production with 0.425% NDCG gain and drove tens of millions in incremental revenue through improved search ranking performance.
Conclusion: BiListing effectively captures unstructured listing data into unified embeddings, enabling semantic search capabilities and significant business impact for Airbnb’s accommodation recommendations.
Abstract: Airbnb is a leader in offering travel accommodations. Airbnb has historically relied on structured data to understand, rank, and recommend listings to guests due to the limited capabilities and associated complexity arising from extracting meaningful information from text and images. With the rise of representation learning, leveraging rich information from text and photos has become easier. A popular approach has been to create embeddings for text documents and images to enable use cases of computing similarities between listings or using embeddings as features in an ML model. However, an Airbnb listing has diverse unstructured data: multiple images, various unstructured text documents such as title, description, and reviews, making this approach challenging. Specifically, it is a non-trivial task to combine multiple embeddings of different pieces of information to reach a single representation. This paper proposes BiListing, for Bimodal Listing, an approach to align text and photos of a listing by leveraging large-language models and pretrained language-image models. The BiListing approach has several favorable characteristics: capturing unstructured data into a single embedding vector per listing and modality, enabling zero-shot capability to search inventory efficiently in user-friendly semantics, overcoming the cold start problem, and enabling listing-to-listing search along a single modality, or both. We conducted offline and online tests to leverage the BiListing embeddings in the Airbnb search ranking model, and successfully deployed it in production, achieved 0.425% of NDCB gain, and drove tens of millions in incremental revenue.
[287] TF-TransUNet1D: Time-Frequency Guided Transformer U-Net for Robust ECG Denoising in Digital Twin
Shijie Wang, Lei Li
Main category: cs.LG
TL;DR: TF-TransUNet1D is a novel 1D deep neural network that combines U-Net architecture with Transformer encoder and hybrid time-frequency loss for ECG signal denoising, achieving state-of-the-art performance.
Details
Motivation: ECG signals are crucial for cardiac digital twins but often compromised by noise and artifacts, which reduces their diagnostic utility and reliability for real-time monitoring.Method: Proposed TF-TransUNet1D integrates U-Net encoder-decoder with Transformer encoder, using dual-domain loss function that optimizes both time-domain waveform reconstruction and frequency-domain spectral fidelity to suppress noise while preserving clinically significant components.
Result: Achieved mean absolute error of 0.1285 and Pearson correlation coefficient of 0.9540, demonstrating consistent superiority over state-of-the-art baselines in SNR improvement and error metrics on MIT-BIH Arrhythmia Database and NSTDB.
Conclusion: The model provides high-precision ECG denoising, bridging a critical gap in pre-processing pipelines for cardiac digital twins and enabling more reliable real-time monitoring and personalized modeling.
Abstract: Electrocardiogram (ECG) signals serve as a foundational data source for cardiac digital twins, yet their diagnostic utility is frequently compromised by noise and artifacts. To address this issue, we propose TF-TransUNet1D, a novel one-dimensional deep neural network that integrates a U-Net-based encoder-decoder architecture with a Transformer encoder, guided by a hybrid time-frequency domain loss. The model is designed to simultaneously capture local morphological features and long-range temporal dependencies, which are critical for preserving the diagnostic integrity of ECG signals. To enhance denoising robustness, we introduce a dual-domain loss function that jointly optimizes waveform reconstruction in the time domain and spectral fidelity in the frequency domain. In particular, the frequency-domain component effectively suppresses high-frequency noise while maintaining the spectral structure of the signal, enabling recovery of subtle but clinically significant waveform components. We evaluate TF-TransUNet1D using synthetically corrupted signals from the MIT-BIH Arrhythmia Database and the Noise Stress Test Database (NSTDB). Comparative experiments against state-of-the-art baselines demonstrate consistent superiority of our model in terms of SNR improvement and error metrics, achieving a mean absolute error of 0.1285 and Pearson correlation coefficient of 0.9540. By delivering high-precision denoising, this work bridges a critical gap in pre-processing pipelines for cardiac digital twins, enabling more reliable real-time monitoring and personalized modeling.
[288] Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention
Zhongpan Tang
Main category: cs.LG
TL;DR: TLinFormer is a novel linear attention architecture that achieves strict linear complexity while computing exact attention scores, addressing the quadratic complexity bottleneck of standard Transformers for long-sequence tasks.
Details
Motivation: The self-attention mechanism in Transformers suffers from quadratic complexity with sequence length, limiting applications in long-sequence tasks. Existing linear attention methods sacrifice performance through approximations or restrictive context selection.Method: TLinFormer reconfigures neuron connection patterns based on topological structure of information flow, maintaining full historical context awareness while achieving linear complexity through exact attention computation.
Result: TLinFormer demonstrates overwhelming advantages in inference latency, KV cache efficiency, memory footprint, and overall speedup compared to standard Transformer baselines on long-sequence inference tasks.
Conclusion: TLinFormer successfully bridges the performance gap between efficient attention methods and standard attention, providing a linear-complexity solution that maintains exact attention computation and full context awareness.
Abstract: The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its application in long-sequence tasks. To address this challenge, existing linear attention methods typically sacrifice model performance by relying on data-agnostic kernel approximations or restrictive context selection. This paper returns to the first principles of connectionism, starting from the topological structure of information flow, to introduce a novel linear attention architecture-\textbf{TLinFormer}. By reconfiguring neuron connection patterns, TLinFormer achieves strict linear complexity while computing exact attention scores and ensuring information flow remains aware of the full historical context. This design aims to bridge the performance gap prevalent between existing efficient attention methods and standard attention. Through a series of experiments, we systematically evaluate the performance of TLinFormer against a standard Transformer baseline on long-sequence inference tasks. The results demonstrate that TLinFormer exhibits overwhelming advantages in key metrics such as \textbf{inference latency}, \textbf{KV cache efficiency}, \textbf{memory footprint}, and \textbf{overall speedup}.
[289] Assessing local deformation and computing scalar curvature with nonlinear conformal regularization of decoders
Benjamin Couéraud, Vikram Sunkara, Christof Schütte
Main category: cs.LG
TL;DR: A new geometric regularization method called nonlinear conformal regularization is introduced for autoencoder decoders, enabling local deformation measurement and manifold curvature computation.
Details
Motivation: To improve dimensionality reduction by discovering main data factors and learning better low-dimensional manifold representations with quantitative measures of local deformations.Method: Introduces nonlinear conformal regularization for decoder maps in autoencoders, which allows local variations and provides a conformal factor to measure local deformation. The method enables computation of scalar curvature on learned manifolds.
Result: The regularization technique successfully measures local deformations through conformal factors and computes scalar curvature of learned manifolds, demonstrated on Swiss roll and CelebA datasets.
Conclusion: Nonlinear conformal regularization provides an effective geometric constraint for autoencoders, enabling quantitative analysis of manifold deformations and curvature properties in learned representations.
Abstract: One aim of dimensionality reduction is to discover the main factors that explain the data, and as such is paramount to many applications. When working with high dimensional data, autoencoders offer a simple yet effective approach to learn low-dimensional representations. The two components of a general autoencoder consist first of an encoder that maps the observed data onto a latent space; and second a decoder that maps the latent space back to the original observation space, which allows to learn a low-dimensional manifold representation of the original data. In this article, we introduce a new type of geometric regularization for decoding maps approximated by deep neural networks, namely nonlinear conformal regularization. This regularization procedure permits local variations of the decoder map and comes with a new scalar field called conformal factor which acts as a quantitative indicator of the amount of local deformation sustained by the latent space when mapped into the original data space. We also show that this regularization technique allows the computation of the scalar curvature of the learned manifold. Implementation and experiments on the Swiss roll and CelebA datasets are performed to illustrate how to obtain these quantities from the architecture.
[290] On Identifying Why and When Foundation Models Perform Well on Time-Series Forecasting Using Automated Explanations and Rating
Michael Widener, Kausik Lakkaraju, John Aydin, Biplav Srivastava
Main category: cs.LG
TL;DR: Analysis of time-series forecasting models shows feature-engineered models outperform foundation models in volatile/sparse domains with better interpretability, while foundation models work best in stable/trend-driven contexts.
Details
Motivation: Time-series forecasting models are increasingly used for real-world decisions but their performance variability and opacity create serious concerns about user reliance. Understanding why and when these models succeed or fail is crucial.Method: Combined traditional XAI methods with Rating Driven Explanations (RDE) to evaluate four model architectures (ARIMA, Gradient Boosting, Chronos, Llama) across four heterogeneous datasets spanning finance, energy, transportation, and automotive sales domains.
Result: Feature-engineered models (e.g., Gradient Boosting) consistently outperform foundation models in volatile or sparse domains (power, car parts) while providing more interpretable explanations. Foundation models excel only in stable or trend-driven contexts (e.g., finance).
Conclusion: Model performance and interpretability vary significantly across domains - feature-engineered models are superior for volatile/sparse data with better explainability, while foundation models are only effective in stable contexts, highlighting the importance of domain-specific model selection.
Abstract: Time-series forecasting models (TSFM) have evolved from classical statistical methods to sophisticated foundation models, yet understanding why and when these models succeed or fail remains challenging. Despite this known limitation, time series forecasting models are increasingly used to generate information that informs real-world actions with equally real consequences. Understanding the complexity, performance variability, and opaque nature of these models then becomes a valuable endeavor to combat serious concerns about how users should interact with and rely on these models’ outputs. This work addresses these concerns by combining traditional explainable AI (XAI) methods with Rating Driven Explanations (RDE) to assess TSFM performance and interpretability across diverse domains and use cases. We evaluate four distinct model architectures: ARIMA, Gradient Boosting, Chronos (time-series specific foundation model), Llama (general-purpose; both fine-tuned and base models) on four heterogeneous datasets spanning finance, energy, transportation, and automotive sales domains. In doing so, we demonstrate that feature-engineered models (e.g., Gradient Boosting) consistently outperform foundation models (e.g., Chronos) in volatile or sparse domains (e.g., power, car parts) while providing more interpretable explanations, whereas foundation models excel only in stable or trend-driven contexts (e.g., finance).
[291] Uncovering the Spectral Bias in Diagonal State Space Models
Ruben Solozabal, Velibor Bojkovic, Hilal AlQuabeh, Kentaro Inui, Martin Takáč
Main category: cs.LG
TL;DR: The paper investigates diagonal state-space model initialization from a frequency perspective, proposes S4D-DFouT initialization in discrete Fourier domain, and achieves SOTA results on Long Range Arena benchmark.
Details
Motivation: Current SSM initialization relies on HiPPO framework, but diagonal variants haven't been systematically studied. The paper aims to understand diagonal SSM parameterization and learning biases from a frequency perspective.Method: The authors analyze diagonal SSM initialization schemes from frequency perspective, investigate pole placement role, and propose S4D-DFouT initialization method in discrete Fourier domain.
Result: The proposed S4D-DFouT initialization achieves state-of-the-art results on Long Range Arena benchmark and enables training from scratch on large datasets like PathX-256.
Conclusion: Systematic investigation of diagonal SSM initialization from frequency perspective provides insights into pole placement, leading to improved performance and scalability for state-space models.
Abstract: Current methods for initializing state space models (SSMs) parameters mainly rely on the \textit{HiPPO framework}, which is based on an online approximation of orthogonal polynomials. Recently, diagonal alternatives have shown to reach a similar level of performance while being significantly more efficient due to the simplification in the kernel computation. However, the \textit{HiPPO framework} does not explicitly study the role of its diagonal variants. In this paper, we take a further step to investigate the role of diagonal SSM initialization schemes from the frequency perspective. Our work seeks to systematically understand how to parameterize these models and uncover the learning biases inherent in such diagonal state-space models. Based on our observations, we propose a diagonal initialization on the discrete Fourier domain \textit{S4D-DFouT}. The insights in the role of pole placing in the initialization enable us to further scale them and achieve state-of-the-art results on the Long Range Arena benchmark, allowing us to train from scratch on very large datasets as PathX-256.
[292] Towards Mitigating Excessive Forgetting in LLM Unlearning via Entanglement-Aware Unlearning with Proxy Constraint
Zhihao Liu, Jian Lou, Yuke Hu, Xiaochen Li, Tailun Chen, Yitian Chen, Zhan Qin
Main category: cs.LG
TL;DR: EAGLE-PC is a novel machine unlearning framework that uses entanglement-aware loss reweighting and proxy constraints to achieve targeted data removal from LLMs while preserving utility, approaching full retraining performance.
Details
Motivation: Address privacy and copyright concerns by enabling effective removal of specific data from trained LLMs without full retraining, overcoming limitations of existing methods that lack proper forgetting boundaries.Method: Uses two key components: 1) entanglement-awareness guided loss reweighting that measures sample similarity to determine forgetting effort, and 2) proxy constraint using ICL-generated test data to regularize forgetting and prevent over-forgetting.
Result: Consistent improvements in forgetting-utility trade-off on TOFU and MUSE benchmarks across multiple LLMs, approaching full retraining performance when combined with NPO+GD optimizer.
Conclusion: EAGLE-PC provides a scalable, robust, and plug-and-play unlearning solution that effectively addresses privacy concerns while maintaining model utility.
Abstract: Large language models (LLMs) are trained on massive datasets that may include private or copyrighted content. Due to growing privacy and ownership concerns, data owners may request the removal of their data from trained models. Machine unlearning provides a practical solution by removing the influence of specific data without full retraining. However, most existing methods lack a sound forgetting boundary, causing some samples to be under-forgotten, leaving residual leakage risks, while others remain over-forgotten at the expense of degraded utility. In this work, we propose EAGLE-PC (Entanglement-Awareness Guided Loss Reweighting with Proxy Constraint), a novel unlearning framework that addresses these limitations through two key components. First, entanglement-awareness guided loss reweighting determines the forgetting effort of each sample by measuring its similarity to retain samples in the embedding space, enabling more targeted and effective unlearning. Second, a proxy constraint leveraging ICL (In-Context Learning) generated test data softly regularizes the forgetting process, effectively mitigating over-forgetting. EAGLE-PC is compatible with existing gradient-based objectives and serves as a plug-and-play enhancement. We evaluate EAGLE-PC on the TOFU and MUSE benchmarks, showing consistent improvements in the forgetting-utility trade-off across multiple LLMs. Combined with the NPO+GD optimizer, it approaches full retraining performance, offering a scalable and robust unlearning solution.
[293] Evaluating Differentially Private Generation of Domain-Specific Text
Yidan Sun, Viktor Schlegel, Srinivasan Nandakumar, Iqra Zahid, Yuping Wu, Warren Del-Pinto, Goran Nenadic, Siew-Kei Lam, Jie Zhang, Anil A Bharath
Main category: cs.LG
TL;DR: A benchmark for evaluating differentially private synthetic text data generation shows significant utility degradation under strict privacy constraints, highlighting limitations of current methods.
Details
Motivation: Privacy and regulatory barriers prevent using real-world data in high-stakes domains like healthcare and finance, creating need for differentially private synthetic data alternatives.Method: Developed a unified benchmark to systematically evaluate utility and fidelity of text datasets generated under formal Differential Privacy guarantees, assessing state-of-the-art methods across five domain-specific datasets.
Result: Significant utility and fidelity degradation compared to real data, especially under strict privacy constraints, revealing limitations of current privacy-preserving generation approaches.
Conclusion: Current approaches have substantial limitations, highlighting the need for advanced privacy-preserving data sharing methods and establishing evaluation standards for realistic scenarios.
Abstract: Generative AI offers transformative potential for high-stakes domains such as healthcare and finance, yet privacy and regulatory barriers hinder the use of real-world data. To address this, differentially private synthetic data generation has emerged as a promising alternative. In this work, we introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under formal Differential Privacy (DP) guarantees. Our benchmark addresses key challenges in domain-specific benchmarking, including choice of representative data and realistic privacy budgets, accounting for pre-training and a variety of evaluation metrics. We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets, revealing significant utility and fidelity degradation compared to real data, especially under strict privacy constraints. These findings underscore the limitations of current approaches, outline the need for advanced privacy-preserving data sharing methods and set a precedent regarding their evaluation in realistic scenarios.
[294] Structure-aware Hypergraph Transformer for Diagnosis Prediction in Electronic Health Records
Haiyan Wang, Ye Yuan
Main category: cs.LG
TL;DR: Proposes SHGT framework using hypergraph Transformer to capture higher-order dependencies in EHR data for better diagnosis prediction.
Details
Motivation: Existing GNN methods fail to capture higher-order dependencies in clinical data and have limited representation power due to pairwise relations and localized message-passing.Method: Structure-aware HyperGraph Transformer (SHGT) with hypergraph structural encoder, Transformer architecture for global reasoning, and hypergraph reconstruction loss to preserve structure.
Result: Outperforms state-of-the-art models on real-world EHR datasets for diagnosis prediction.
Conclusion: SHGT effectively addresses limitations of existing GNN methods by capturing higher-order interactions and global dependencies in EHR data.
Abstract: Electronic Health Records (EHR) systematically organize patient health data through standardized medical codes, serving as a comprehensive and invaluable source for predictive modeling. Graph neural networks (GNNs) have demonstrated effectiveness in modeling interactions between medical codes within EHR. However, existing GNN-based methods are inadequate due to: a) their reliance on pairwise relations fails to capture the inherent higher-order dependencies in clinical data, and b) the localized message-passing scheme limits representation power. To address these issues, this paper proposes a novel Structure-aware HyperGraph Transformer (SHGT) framework following three-fold ideas: a) employing a hypergraph structural encoder to capture higher-order interactions among medical codes, b) integrating the Transformer architecture to reason over the entire hypergraph, and c) designing a tailored loss function incorporating hypergraph reconstruction to preserve the hypergraph’s original structure. Experiments on real-world EHR datasets demonstrate that the proposed SHGT outperforms existing state-of-the-art models on diagnosis prediction.
[295] Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases
Marc Boullé, Nicolas Voisine, Bruno Guerraz, Carine Hue, Felipe Olmos, Vladimir Popescu, Stéphane Gouache, Stéphane Bouget, Alexis Bondu, Luc Aurelien Gauthier, Yassine Nair Benrekia, Fabrice Clérot, Vincent Lemaire
Main category: cs.LG
TL;DR: Khiops is an open-source machine learning tool for large multi-table databases using Bayesian methods with variable selection, discretization, and automatic aggregation for efficient analysis of massive datasets.
Details
Motivation: To provide an efficient machine learning solution for mining large multi-table databases with millions of records and tens of thousands of variables, addressing the challenges of scalability and complex data structures.Method: Uses a Bayesian approach with naive Bayesian classifier incorporating variable selection and weight learning. Handles numerical data through discretization models and categorical data through value clustering. For multi-table databases, automatically constructs aggregates for propositionalisation.
Result: Developed a scalable tool capable of handling databases with millions of individuals, tens of thousands of variables, and hundreds of millions of records in secondary tables. Has attracted academic interest with over 20 publications.
Conclusion: Khiops provides an effective open-source solution for large-scale multi-table database mining, offering predictive variable importance measures and efficient processing capabilities across various environments including Python library and user interface.
Abstract: Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.
[296] MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning
Weihai Zhi, Jiayan Guo, Shangyang Li
Main category: cs.LG
TL;DR: MedGR^2 is a novel framework that creates a self-improving cycle to generate high-quality medical data, enabling better supervised fine-tuning and reinforcement learning for medical AI with superior generalization across modalities and tasks.
Details
Motivation: Vision-Language Models in medicine face critical limitations due to scarcity of expert-annotated data, poor generalization from supervised fine-tuning, and lack of reliable reward signals for reinforcement learning in data-scarce medical domains.Method: MedGR^2 co-develops a data generator and reward model to create a self-improving virtuous cycle that automatically produces high-quality multi-modal medical data. This data is used for both supervised fine-tuning and reinforcement learning via Group Relative Policy Optimization (GRPO).
Result: SFT with MedGR^2-produced data surpasses baselines trained on large human-curated datasets. RL with this data achieves state-of-the-art cross-modality and cross-task generalization, outperforming specialized RL methods. Compact models achieve performance competitive with foundation models 10x larger.
Conclusion: MedGR^2 transforms the problem from data scarcity to data generation, presenting a new paradigm for data-efficient learning in high-stakes medical domains and unlocking RL’s full potential for building truly generalizable medical AI.
Abstract: The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.
[297] Theoretical foundations of the integral indicator application in hyperparametric optimization
Roman S. Kulshin, Anatoly A. Sidorov
Main category: cs.LG
TL;DR: Hyperparametric optimization using multi-criteria integral assessment for recommendation algorithms
Details
Motivation: Traditional single-metric optimization fails to balance multiple performance aspects like accuracy, ranking quality, diversity, and resource efficiency in recommendation systemsMethod: Developed an integral assessment approach that combines various performance indicators into a single consolidated criterion for hyperparameter optimization
Result: Achieves better balance between competing objectives compared to single-metric optimization approaches
Conclusion: The multi-criteria optimization tool has universal applicability beyond recommendation systems to various machine learning and data analysis tasks
Abstract: The article discusses the concept of hyperparametric optimization of recommendation algorithms using an integral assessment that combines various performance indicators into a single consolidated criterion. This approach is opposed to traditional methods of setting up a single metric and allows you to achieve a balance between accuracy, ranking quality, variety of output and the resource intensity of algorithms. The theoretical significance of the research lies in the development of a universal multi-criteria optimization tool that is applicable not only in recommendation systems, but also in a wide range of machine learning and data analysis tasks.
[298] MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training
Yang Luo, Zangwei Zheng, Ziheng Qin, Zirui Zhu, Yong Liu, Yang You
Main category: cs.LG
TL;DR: MERIT optimizer uses max-norm and element-wise trust ratios to enable stable large-batch training of language models without performance degradation, achieving 6k batch size on GPT-2 Medium.
Details
Motivation: Existing optimizers like AdamW and LAMB suffer from performance degradation in large-batch training due to attention layer bottlenecks and ineffective trust ratio calculations that don't properly constrain max attention logits.Method: Proposes MERIT optimizer that uses max-norm to calculate trust ratios to better constrain max attention logits, and constructs element-wise trust ratios that focus on local weight structures for more robust update scaling.
Result: Extensive experiments show MERIT enables 6k batch size training on GPT-2 Medium without performance degradation compared to standard 480 batch size, with 48B training tokens.
Conclusion: MERIT successfully improves training stability for large-batch training by considering max attention logit and finer-granularity trust ratios, paving the way for faster development of large language models.
Abstract: Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language models’ large-batch training, due to the information bottleneck in attention layers caused by the sharp increase of max attention logit. While the LAMB optimizer partially addresses this issue, some attention layers still face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are less effective in directly influencing the max value of query/key weights. Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks relationships of weight values within rows or columns. Building on these observations, we propose a novel optimizer, MERIT, which leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively. Moreover, we further construct element-wise trust ratios to provide more robust update scaling by focusing on local weight structures. Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens. This work highlights the importance of considering the max attention logit and finer-granularity trust ratio in large-batch training. It successfully improves the training stability and paves the way for larger batch usage, enabling faster development and iteration of large language models. Code is available at https://github.com/NUS-HPC-AI-Lab/MERIT.
[299] Unbiased Stochastic Optimization for Gaussian Processes on Finite Dimensional RKHS
Neta Shoham, Haim Avron
Main category: cs.LG
TL;DR: Proposes exact stochastic inference algorithms for Gaussian Processes with finite-dimensional RKHS kernels, achieving better performance than approximation methods when memory constraints limit batch size and inducing points.
Details
Motivation: Current stochastic hyperparameter learning methods for GPs rely on approximations that don't guarantee convergence to true marginal likelihood stationary points.Method: Algorithms for exact stochastic inference of GPs with kernels inducing moderate finite-dimensional RKHS, extendable to infinite dimensions with some exactness trade-off.
Result: Achieves better experimental results than existing methods under memory constraints that limit batch size and number of inducing points.
Conclusion: Provides exact stochastic inference approach for GPs that outperforms approximation-based methods in memory-constrained scenarios.
Abstract: Current methods for stochastic hyperparameter learning in Gaussian Processes (GPs) rely on approximations, such as computing biased stochastic gradients or using inducing points in stochastic variational inference. However, when using such methods we are not guaranteed to converge to a stationary point of the true marginal likelihood. In this work, we propose algorithms for exact stochastic inference of GPs with kernels that induce a Reproducing Kernel Hilbert Space (RKHS) of moderate finite dimension. Our approach can also be extended to infinite dimensional RKHSs at the cost of forgoing exactness. Both for finite and infinite dimensional RKHSs, our method achieves better experimental results than existing methods when memory resources limit the feasible batch size and the possible number of inducing points.
[300] Local Virtual Nodes for Alleviating Over-Squashing in Graph Neural Networks
Tuğrul Hasan Karabulut, İnci M. Baytaş
Main category: cs.LG
TL;DR: Local Virtual Nodes (LVN) method addresses over-squashing in GNNs by adding trainable virtual nodes to bottleneck areas without disrupting global graph structure, improving performance on classification tasks.
Details
Motivation: Over-squashing in graph neural networks creates bottlenecks when gathering information from wide neighborhoods into fixed-size representations. Existing solutions like graph rewiring alter global topology and disrupt domain knowledge encoded in original graph structure.Method: Proposes Local Virtual Nodes (LVN) with trainable embeddings placed based on node centrality to identify bottleneck regions. LVNs improve connectivity in central areas and facilitate communication between distant nodes without adding more layers.
Result: Extensive experiments on benchmark datasets show LVNs enhance structural connectivity and significantly improve performance on both graph and node classification tasks.
Conclusion: LVN approach effectively mitigates over-squashing effects while preserving the global structure and domain knowledge of the original input graph, providing a better alternative to graph rewiring methods.
Abstract: Over-squashing is a challenge in training graph neural networks for tasks involving long-range dependencies. In such tasks, a GNN’s receptive field should be large enough to enable communication between distant nodes. However, gathering information from a wide range of neighborhoods and squashing its content into fixed-size node representations makes message-passing vulnerable to bottlenecks. Graph rewiring and adding virtual nodes are commonly studied remedies that create additional pathways around bottlenecks to mitigate over-squashing. However, these techniques alter the input graph’s global topology and disrupt the domain knowledge encoded in the original graph structure, both of which could be essential to specific tasks and domains. This study presents Local Virtual Nodes (LVN) with trainable embeddings to alleviate the effects of over-squashing without significantly corrupting the global structure of the input graph. The position of the LVNs is determined by the node centrality, which indicates the existence of potential bottlenecks. Thus, the proposed approach aims to improve the connectivity in the regions with likely bottlenecks. Furthermore, trainable LVN embeddings shared across selected central regions facilitate communication between distant nodes without adding more layers. Extensive experiments on benchmark datasets demonstrate that LVNs can enhance structural connectivity and significantly improve performance on graph and node classification tasks. The code can be found at https://github.com/ALLab-Boun/LVN/}{https://github.com/ALLab-Boun/LVN/.
[301] Dimension Agnostic Testing of Survey Data Credibility through the Lens of Regression
Debabrota Basu, Sourav Chakraborty, Debarshi Chanda, Buddha Dev Das, Arijit Ghosh, Arnab Ray
Main category: cs.LG
TL;DR: Proposes a task-based approach to verify survey credibility using model-specific distance metrics with dimension-independent sample complexity, avoiding full distribution estimation.
Details
Motivation: Traditional methods for assessing survey representativeness require exponential samples in high dimensions, but downstream conclusions may remain valid across different distributions if the analysis model is considered.Method: Introduces a model-specific distance metric and designs an algorithm to verify survey credibility specifically for regression models, focusing on task-based assessment rather than full distribution reconstruction.
Result: The algorithm achieves dimension-independent sample complexity, significantly more efficient than approaches that reconstruct the regression model (which scale linearly with dimensionality). Theoretical correctness is proven and numerical performance is demonstrated.
Conclusion: Task-based credibility assessment provides an efficient alternative to traditional distribution distance estimation, enabling practical verification of survey representativeness for specific analytical models without the curse of dimensionality.
Abstract: Assessing whether a sample survey credibly represents the population is a critical question for ensuring the validity of downstream research. Generally, this problem reduces to estimating the distance between two high-dimensional distributions, which typically requires a number of samples that grows exponentially with the dimension. However, depending on the model used for data analysis, the conclusions drawn from the data may remain consistent across different underlying distributions. In this context, we propose a task-based approach to assess the credibility of sampled surveys. Specifically, we introduce a model-specific distance metric to quantify this notion of credibility. We also design an algorithm to verify the credibility of survey data in the context of regression models. Notably, the sample complexity of our algorithm is independent of the data dimension. This efficiency stems from the fact that the algorithm focuses on verifying the credibility of the survey data rather than reconstructing the underlying regression model. Furthermore, we show that if one attempts to verify credibility by reconstructing the regression model, the sample complexity scales linearly with the dimensionality of the data. We prove the theoretical correctness of our algorithm and numerically demonstrate our algorithm’s performance.
[302] Supervised Stochastic Gradient Algorithms for Multi-Trial Source Separation
Ronak Mehta, Mateus Piovezan Otto, Noah Stanis, Azadeh Yazdan-Shahmorad, Zaid Harchaoui
Main category: cs.LG
TL;DR: A stochastic algorithm for supervised independent component analysis that combines proximal gradient optimization with joint prediction model learning through backpropagation.
Details
Motivation: Many scientific contexts provide multi-trial supervision data that can be leveraged to improve independent component analysis, but existing methods don't effectively incorporate this additional supervision.Method: Proximal gradient-type algorithm in the space of invertible matrices combined with joint learning of a prediction model through backpropagation to incorporate multi-trial supervision.
Result: Increased success rate of non-convex optimization and improved interpretability of independent components due to the additional supervision.
Conclusion: The proposed supervised approach enhances both optimization performance and interpretability in independent component analysis by effectively leveraging available multi-trial supervision data.
Abstract: We develop a stochastic algorithm for independent component analysis that incorporates multi-trial supervision, which is available in many scientific contexts. The method blends a proximal gradient-type algorithm in the space of invertible matrices with joint learning of a prediction model through backpropagation. We illustrate the proposed algorithm on synthetic and real data experiments. In particular, owing to the additional supervision, we observe an increased success rate of the non-convex optimization and the improved interpretability of the independent components.
[303] Masked Autoencoders for Ultrasound Signals: Robust Representation Learning for Downstream Applications
Immanuel Roßteutscher, Klaus S. Drese, Thorsten Uphues
Main category: cs.LG
TL;DR: MAE with ViT adapted for 1D ultrasound signals, showing superior performance over CNNs and scratch training through self-supervised pre-training on synthetic data.
Details
Motivation: Ultrasound signals are crucial for industrial NDT and SHM applications but suffer from scarce labeled data and task-specific processing needs. MAEs have shown success in other domains but remain unexplored for 1D ultrasound analysis.Method: Proposed MAE approach with ViT architectures for self-supervised pre-training on unlabeled synthetic ultrasound signals. Systematically studied impact of model size, patch size, and masking ratio on pre-training efficiency and downstream task performance.
Result: Pre-trained models significantly outperformed models trained from scratch and optimized CNN baselines. Pre-training on synthetic data showed superior transferability to real-world signals compared to training on limited real datasets.
Conclusion: Demonstrates MAEs’ potential for advancing ultrasound signal analysis through scalable, self-supervised learning, enabling robust representation learning that enhances downstream task performance.
Abstract: We investigated the adaptation and performance of Masked Autoencoders (MAEs) with Vision Transformer (ViT) architectures for self-supervised representation learning on one-dimensional (1D) ultrasound signals. Although MAEs have demonstrated significant success in computer vision and other domains, their use for 1D signal analysis, especially for raw ultrasound data, remains largely unexplored. Ultrasound signals are vital in industrial applications such as non-destructive testing (NDT) and structural health monitoring (SHM), where labeled data are often scarce and signal processing is highly task-specific. We propose an approach that leverages MAE to pre-train on unlabeled synthetic ultrasound signals, enabling the model to learn robust representations that enhance performance in downstream tasks, such as time-of-flight (ToF) classification. This study systematically investigated the impact of model size, patch size, and masking ratio on pre-training efficiency and downstream accuracy. Our results show that pre-trained models significantly outperform models trained from scratch and strong convolutional neural network (CNN) baselines optimized for the downstream task. Additionally, pre-training on synthetic data demonstrates superior transferability to real-world measured signals compared with training solely on limited real datasets. This study underscores the potential of MAEs for advancing ultrasound signal analysis through scalable, self-supervised learning.
[304] GDS Agent: A Graph Algorithmic Reasoning Agent
Borun Shi, Ioannis Panagiotas
Main category: cs.LG
TL;DR: GDS agent introduces graph algorithms as tools for LLMs to process and reason over large-scale graph data through a model context protocol server.
Details
Motivation: LLMs struggle with processing and reasoning over large-scale graph-structure data despite having multimodal capabilities and tool integration.Method: Developed a comprehensive set of graph algorithms as tools with preprocessing and postprocessing capabilities in a model context protocol (MCP) server that works with any modern LLM.
Result: The GDS agent can solve a wide spectrum of graph tasks, providing accurate and grounded answers to questions requiring graph algorithmic reasoning.
Conclusion: The approach successfully enables LLMs to handle graph data challenges, though some scenarios remain difficult, and future roadmap addresses remaining challenges.
Abstract: Large language models (LLMs) have shown remarkable multimodal information processing and reasoning ability. When equipped with tools through function calling and enhanced with retrieval-augmented techniques, compound LLM-based systems can access closed data sources and answer questions about them. However, they still struggle to process and reason over large-scale graph-structure data. We introduce the GDS (Graph Data Science) agent in this technical report. The GDS agent introduces a comprehensive set of graph algorithms as tools, together with preprocessing (retrieval) and postprocessing of algorithm results, in a model context protocol (MCP) server. The server can be used with any modern LLM out-of-the-box. GDS agent allows users to ask any question that implicitly and intrinsically requires graph algorithmic reasoning about their data, and quickly obtain accurate and grounded answers. We also introduce a new benchmark that evaluates intermediate tool calls as well as final responses. The results indicate that GDS agent is able to solve a wide spectrum of graph tasks. We also provide detailed case studies for more open-ended tasks and study scenarios where the agent struggles. Finally, we discuss the remaining challenges and the future roadmap.
[305] A Hybrid Stochastic Gradient Tracking Method for Distributed Online Optimization Over Time-Varying Directed Networks
Xinli Shi, Xingxing Yuan, Longkang Zhu, Guanghui Wen
Main category: cs.LG
TL;DR: TV-HSGT is a novel distributed online optimization algorithm that combines hybrid stochastic gradient tracking with variance reduction for time-varying directed networks, eliminating the need for Perron vector estimation while achieving improved dynamic regret bounds.
Details
Motivation: Existing distributed optimization algorithms rely on bounded gradient assumptions and overlook stochastic gradient impacts in time-varying directed networks, creating limitations for real-time decision-making in dynamic environments.Method: Proposes TV-HSGT algorithm that integrates row-stochastic and column-stochastic communication schemes over time-varying digraphs, combining current and recursive stochastic gradients with variance reduction mechanisms to track global descent directions accurately.
Result: Theoretical analysis shows TV-HSGT achieves improved bounds on dynamic regret without assuming gradient boundedness. Experimental results on logistic regression tasks confirm effectiveness in dynamic and resource-constrained environments.
Conclusion: TV-HSGT provides an effective solution for distributed online optimization in time-varying directed networks, addressing limitations of existing methods by eliminating Perron vector estimation requirements and handling stochastic gradients effectively.
Abstract: With the increasing scale and dynamics of data, distributed online optimization has become essential for real-time decision-making in various applications. However, existing algorithms often rely on bounded gradient assumptions and overlook the impact of stochastic gradients, especially in time-varying directed networks. This study proposes a novel Time-Varying Hybrid Stochastic Gradient Tracking algorithm named TV-HSGT, based on hybrid stochastic gradient tracking and variance reduction mechanisms. Specifically, TV-HSGT integrates row-stochastic and column-stochastic communication schemes over time-varying digraphs, eliminating the need for Perron vector estimation or out-degree information. By combining current and recursive stochastic gradients, it effectively reduces gradient variance while accurately tracking global descent directions. Theoretical analysis demonstrates that TV-HSGT can achieve improved bounds on dynamic regret without assuming gradient boundedness. Experimental results on logistic regression tasks confirm the effectiveness of TV-HSGT in dynamic and resource-constrained environments.
[306] VarDiU: A Variational Diffusive Upper Bound for One-Step Diffusion Distillation
Leyang Wang, Mingtian Zhang, Zijing Ou, David Barber
Main category: cs.LG
TL;DR: VarDiU proposes a variational diffusive upper bound with unbiased gradient estimation for diffusion distillation, achieving better quality and more stable training than Diff-Instruct.
Details
Motivation: Existing diffusion distillation methods use biased gradient estimates from imperfect denoising score matching, leading to sub-optimal performance in compressing thousand-step teachers to one-step students.Method: Develops VarDiU - a Variational Diffusive Upper Bound that provides an unbiased gradient estimator for diffusion distillation, avoiding the bias from denoising score matching.
Result: Achieves higher generation quality compared to Diff-Instruct and enables more efficient and stable training for one-step diffusion distillation.
Conclusion: VarDiU’s unbiased gradient estimation approach significantly improves diffusion distillation performance, making it a superior method for compressing complex diffusion models into efficient one-step generators.
Abstract: Recently, diffusion distillation methods have compressed thousand-step teacher diffusion models into one-step student generators while preserving sample quality. Most existing approaches train the student model using a diffusive divergence whose gradient is approximated via the student’s score function, learned through denoising score matching (DSM). Since DSM training is imperfect, the resulting gradient estimate is inevitably biased, leading to sub-optimal performance. In this paper, we propose VarDiU (pronounced /va:rdju:/), a Variational Diffusive Upper Bound that admits an unbiased gradient estimator and can be directly applied to diffusion distillation. Using this objective, we compare our method with Diff-Instruct and demonstrate that it achieves higher generation quality and enables a more efficient and stable training procedure for one-step diffusion distillation.
[307] Physics-Constrained Machine Learning for Chemical Engineering
Angan Mukherjee, Victor M. Zavala
Main category: cs.LG
TL;DR: PCML combines physics and ML for better reliability and interpretability, but faces challenges in chemical engineering applications including knowledge embedding strategies, scaling, and uncertainty quantification.
Details
Motivation: To improve the reliability, generalizability, and interpretability of machine learning models in scientific and engineering domains by incorporating physical constraints and knowledge.Method: Combines physical models with data-driven machine learning approaches through various fusion strategies, though specific methods for effective integration remain a challenge.
Result: PCML has shown significant benefits across diverse scientific domains, but technical challenges hinder its applicability in complex chemical engineering applications.
Conclusion: While promising, PCML faces key challenges in chemical engineering including determining optimal physical knowledge embedding, scaling to large datasets, and uncertainty quantification - particularly for closed-loop design, real-time control, and multi-scale phenomena.
Abstract: Physics-constrained machine learning (PCML) combines physical models with data-driven approaches to improve reliability, generalizability, and interpretability. Although PCML has shown significant benefits in diverse scientific and engineering domains, technical and intellectual challenges hinder its applicability in complex chemical engineering applications. Key difficulties include determining the amount and type of physical knowledge to embed, designing effective fusion strategies with ML, scaling models to large datasets and simulators, and quantifying predictive uncertainty. This perspective summarizes recent developments and highlights challenges/opportunities in applying PCML to chemical engineering, emphasizing on closed-loop experimental design, real-time dynamics and control, and handling of multi-scale phenomena.
[308] Self-Composing Neural Operators with Depth and Accuracy Scaling via Adaptive Train-and-Unroll Approach
Juncai He, Xinliang Liu, Jinchao Xu
Main category: cs.LG
TL;DR: A novel neural operator framework using self-composition and adaptive training that achieves SOTA performance on scientific ML tasks, including challenging ultrasound tomography problems.
Details
Motivation: To enhance efficiency and accuracy of neural operators by drawing inspiration from iterative methods in numerical PDE solving, providing both theoretical guarantees and practical computational benefits.Method: Design a neural operator by repeatedly applying a single block (self-composition), progressively deepening the model without adding new blocks. Use adaptive train-and-unroll approach where depth is gradually increased during training.
Result: Achieves state-of-the-art performance on standard benchmarks and demonstrates superior performance on high-frequency ultrasound computed tomography with multigrid-inspired backbone. Shows accuracy scaling law with model depth and significant computational savings.
Conclusion: The framework provides computationally tractable, accurate, and scalable solution for large-scale data-driven scientific machine learning applications, particularly effective for complex wave phenomena resolution.
Abstract: In this work, we propose a novel framework to enhance the efficiency and accuracy of neural operators through self-composition, offering both theoretical guarantees and practical benefits. Inspired by iterative methods in solving numerical partial differential equations (PDEs), we design a specific neural operator by repeatedly applying a single neural operator block, we progressively deepen the model without explicitly adding new blocks, improving the model’s capacity. To train these models efficiently, we introduce an adaptive train-and-unroll approach, where the depth of the neural operator is gradually increased during training. This approach reveals an accuracy scaling law with model depth and offers significant computational savings through our adaptive training strategy. Our architecture achieves state-of-the-art (SOTA) performance on standard benchmarks. We further demonstrate its efficacy on a challenging high-frequency ultrasound computed tomography (USCT) problem, where a multigrid-inspired backbone enables superior performance in resolving complex wave phenomena. The proposed framework provides a computationally tractable, accurate, and scalable solution for large-scale data-driven scientific machine learning applications.
[309] Compositionality in Time Series: A Proof of Concept using Symbolic Dynamics and Compositional Data Augmentation
Michael Hagmann, Michael Staniek, Stefan Riezler
Main category: cs.LG
TL;DR: This paper investigates whether clinical time series data can be modeled as sequences of systematic latent physiological states, and uses this compositional structure to generate synthetic data that performs comparably to original data in forecasting tasks.
Details
Motivation: To address the problem of sparse and low-resource data in clinical time series forecasting by uncovering the underlying compositional structure of physiological states, enabling synthetic data generation that maintains the meaningful patterns of original clinical measurements.Method: Conceptualizes compositionality for time series as a data generation property, develops data-driven procedures to reconstruct elementary states and composition rules, and evaluates using empirical tests based on domain adaptation perspective comparing expected risk of forecasting models trained on original vs synthesized data.
Result: Experimental results show that training on compositionally synthesized data achieves comparable performance to training on original clinical data, and evaluation on synthesized test data shows similar results to original test data, outperforming randomization-based augmentation. Significant performance gains were observed in SOFA score prediction when training entirely on synthesized data.
Conclusion: Clinical time series can be effectively modeled as sequences of systematic latent physiological states, and this compositional approach enables high-quality synthetic data generation that alleviates data scarcity problems while maintaining or even improving forecasting performance in clinical applications.
Abstract: This work investigates whether time series of natural phenomena can be understood as being generated by sequences of latent states which are ordered in systematic and regular ways. We focus on clinical time series and ask whether clinical measurements can be interpreted as being generated by meaningful physiological states whose succession follows systematic principles. Uncovering the underlying compositional structure will allow us to create synthetic data to alleviate the notorious problem of sparse and low-resource data settings in clinical time series forecasting, and deepen our understanding of clinical data. We start by conceptualizing compositionality for time series as a property of the data generation process, and then study data-driven procedures that can reconstruct the elementary states and composition rules of this process. We evaluate the success of this methods using two empirical tests originating from a domain adaptation perspective. Both tests infer the similarity of the original time series distribution and the synthetic time series distribution from the similarity of expected risk of time series forecasting models trained and tested on original and synthesized data in specific ways. Our experimental results show that the test set performance achieved by training on compositionally synthesized data is comparable to training on original clinical time series data, and that evaluation of models on compositionally synthesized test data shows similar results to evaluating on original test data, outperforming randomization-based data augmentation. An additional downstream evaluation of the prediction task of sequential organ failure assessment (SOFA) scores shows significant performance gains when model training is entirely based on compositionally synthesized data compared to training on original data.
[310] Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
Weitao Feng, Lixu Wang, Tianyi Wei, Jie Zhang, Chongyang Gao, Sinong Zhan, Peizhuo Lv, Wei Dong
Main category: cs.LG
TL;DR: RL-based fine-tuning poses greater risks than SFT for harmful misuse of LLMs. TokenBuncher defense suppresses model uncertainty to counter RL attacks while preserving utility.
Details
Motivation: As LLMs grow more capable, the risks of harmful misuse through fine-tuning increase. Prior studies focused on SFT attacks, but RL-based attacks are more effective and pose greater systemic risks that need to be addressed.Method: TokenBuncher defense suppresses model response uncertainty through entropy-as-reward RL and Token Noiser mechanism, preventing RL from exploiting distinct reward signals for harmful behaviors.
Result: Extensive experiments across multiple models and RL algorithms show TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task utility and fine-tunability.
Conclusion: RL-based harmful fine-tuning is more dangerous than SFT, and TokenBuncher provides an effective, general defense against this emerging threat by targeting the fundamental uncertainty that RL exploits.
Abstract: As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response uncertainty. By constraining uncertainty, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of expert-domain harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task utility and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.
[311] EEGDM: Learning EEG Representation with Latent Diffusion Model
Shaocong Wang, Tong Liu, Ming Li, Minjing Yu, Yong-Jin Liu
Main category: cs.LG
TL;DR: EEGDM is a novel self-supervised EEG representation learning method using latent diffusion models that generates EEG signals to learn robust representations, outperforming existing methods with limited training data across diverse tasks.
Details
Motivation: Existing EEG representation learning methods rely on simple masked reconstruction objectives that fail to fully capture the rich semantic information and complex patterns in EEG signals, especially when training data is limited.Method: Proposes EEGDM - a self-supervised method based on latent diffusion model that uses EEG signal generation as objective. Includes an EEG encoder that distills signals and channel augmentations into compact representations to guide diffusion model for EEG generation.
Result: EEGDM (1) reconstructs high-quality EEG signals, (2) learns robust representations effectively, and (3) achieves competitive performance with modest pre-training data size across diverse downstream tasks.
Conclusion: EEGDM demonstrates strong generalizability and practical utility by leveraging diffusion models for EEG representation learning, offering a compact latent space suitable for both generative control and downstream applications.
Abstract: While electroencephalography (EEG) signal analysis using deep learning has shown great promise, existing approaches still face significant challenges in learning generalizable representations that perform well across diverse tasks, particularly when training data is limited. Current EEG representation learning methods including EEGPT and LaBraM typically rely on simple masked reconstruction objective, which may not fully capture the rich semantic information and complex patterns inherent in EEG signals. In this paper, we propose EEGDM, a novel self-supervised EEG representation learning method based on the latent diffusion model, which leverages EEG signal generation as a self-supervised objective, turning the diffusion model into a strong representation learner capable of capturing EEG semantics. EEGDM incorporates an EEG encoder that distills EEG signals and their channel augmentations into a compact representation, acting as conditional information to guide the diffusion model for generating EEG signals. This design endows EEGDM with a compact latent space, which not only offers ample control over the generative process but also can be leveraged for downstream tasks. Experimental results show that EEGDM (1) can reconstruct high-quality EEG signals, (2) effectively learns robust representations, and (3) achieves competitive performance with modest pre-training data size across diverse downstream tasks, underscoring its generalizability and practical utility.
[312] Provable Benefits of In-Tool Learning for Large Language Models
Sam Houliston, Ambroise Odonnat, Charles Arnal, Vivien Cabannes
Main category: cs.LG
TL;DR: Tool-augmented language models with external retrieval outperform weight-based memorization for factual recall, as memorization is limited by parameter count while tool-use enables unbounded recall through efficient circuits.
Details
Motivation: To theoretically and empirically demonstrate the advantages of tool-augmented language models (with external retrieval) over traditional weight-based memorization for factual recall tasks.Method: Theoretical analysis showing parameter count limits memorization, circuit construction proving tool-use enables unbounded recall, and controlled experiments comparing tool-using models vs memorizing models.
Result: Tool-using models consistently outperform memorizing models in factual recall. Teaching tool-use and general rules is more effective than finetuning facts into memory for pretrained LLMs.
Conclusion: Tool-augmented workflows are not just practical but provably more scalable than in-weight learning, providing both theoretical and empirical foundations for their superiority in factual recall tasks.
Abstract: Tool-augmented language models, equipped with retrieval, memory, or external APIs, are reshaping AI, yet their theoretical advantages remain underexplored. In this paper, we address this question by demonstrating the benefits of in-tool learning (external retrieval) over in-weight learning (memorization) for factual recall. We show that the number of facts a model can memorize solely in its weights is fundamentally limited by its parameter count. In contrast, we prove that tool-use enables unbounded factual recall via a simple and efficient circuit construction. These results are validated in controlled experiments, where tool-using models consistently outperform memorizing ones. We further show that for pretrained large language models, teaching tool-use and general rules is more effective than finetuning facts into memory. Our work provides both a theoretical and empirical foundation, establishing why tool-augmented workflows are not just practical, but provably more scalable.
[313] Unleashing Uncertainty: Efficient Machine Unlearning for Generative AI
Christoforos N. Spartalis, Theodoros Semertzidis, Petros Daras, Efstratios Gavves
Main category: cs.LG
TL;DR: SAFEMax is a novel machine unlearning method for diffusion models that maximizes entropy in generated images to produce Gaussian noise for impermissible classes by halting denoising, with selective focus on early diffusion steps for balanced forgetting and retention.
Details
Motivation: To develop an efficient and effective machine unlearning method for diffusion models that can selectively forget impermissible classes while maintaining performance on retained classes, addressing the need for privacy and compliance in AI systems.Method: Grounded in information-theoretic principles, SAFEMax maximizes entropy in generated images to cause the model to generate Gaussian noise when conditioned on impermissible classes by halting the denoising process. It selectively focuses on early diffusion steps where class-specific information is prominent to control the balance between forgetting and retention.
Result: The results demonstrate the effectiveness of SAFEMax and highlight its substantial efficiency gains over state-of-the-art methods in machine unlearning for diffusion models.
Conclusion: SAFEMax provides an effective and efficient solution for machine unlearning in diffusion models, offering controlled forgetting of impermissible classes while maintaining model performance on retained classes through information-theoretic principles and selective step focusing.
Abstract: We introduce SAFEMax, a novel method for Machine Unlearning in diffusion models. Grounded in information-theoretic principles, SAFEMax maximizes the entropy in generated images, causing the model to generate Gaussian noise when conditioned on impermissible classes by ultimately halting its denoising process. Also, our method controls the balance between forgetting and retention by selectively focusing on the early diffusion steps, where class-specific information is prominent. Our results demonstrate the effectiveness of SAFEMax and highlight its substantial efficiency gains over state-of-the-art methods.
[314] GPT-FT: An Efficient Automated Feature Transformation Using GPT for Sequence Reconstruction and Performance Enhancement
Yang Gao, Dongjie Wang, Scott Piersall, Ye Zhang, Liqiang Wang
Main category: cs.LG
TL;DR: A novel transformer-based framework for automated feature transformation that reduces computational costs while maintaining performance through multi-objective optimization with a revised GPT model.
Details
Motivation: Existing feature transformation methods rely on sequential encoder-decoder structures that cause high computational costs and parameter requirements, limiting scalability and efficiency.Method: Four-step framework: transformation records collection, embedding space construction with revised GPT model, gradient-ascent search, and autoregressive reconstruction. The GPT model handles both sequence reconstruction and performance estimation.
Result: Experimental results show the framework matches or exceeds baseline performance with significant gains in computational efficiency on benchmark datasets.
Conclusion: Transformer-based architectures show strong potential for scalable, high-performance automated feature transformation with reduced parameter requirements.
Abstract: Feature transformation plays a critical role in enhancing machine learning model performance by optimizing data representations. Recent state-of-the-art approaches address this task as a continuous embedding optimization problem, converting discrete search into a learnable process. Although effective, these methods often rely on sequential encoder-decoder structures that cause high computational costs and parameter requirements, limiting scalability and efficiency. To address these limitations, we propose a novel framework that accomplishes automated feature transformation through four steps: transformation records collection, embedding space construction with a revised Generative Pre-trained Transformer (GPT) model, gradient-ascent search, and autoregressive reconstruction. In our approach, the revised GPT model serves two primary functions: (a) feature transformation sequence reconstruction and (b) model performance estimation and enhancement for downstream tasks by constructing the embedding space. Such a multi-objective optimization framework reduces parameter size and accelerates transformation processes. Experimental results on benchmark datasets show that the proposed framework matches or exceeds baseline performance, with significant gains in computational efficiency. This work highlights the potential of transformer-based architectures for scalable, high-performance automated feature transformation.
[315] ATM-GAD: Adaptive Temporal Motif Graph Anomaly Detection for Financial Transaction Networks
Zeyue Zhang, Lin Song, Erkang Bao, Xiaoling Lv, Xinyue Wang
Main category: cs.LG
TL;DR: ATM-GAD is a graph neural network that uses temporal motifs and adaptive time windows to detect financial fraud more effectively than previous methods.
Details
Motivation: Conventional ML models and existing graph-based detectors fail to capture two key temporal fraud patterns: recurring suspicious subgraphs (temporal motifs) and account-specific anomalous activity intervals.Method: Uses Temporal Motif Extractor to condense transaction histories, dual-attention blocks (IntraA for within-motif interactions, InterA for cross-motif aggregation), and differentiable Adaptive Time-Window Learner for node-specific observation periods.
Result: Outperforms seven strong anomaly-detection baselines across four real-world datasets, successfully uncovering fraud patterns missed by earlier methods.
Conclusion: ATM-GAD effectively addresses temporal fraud patterns through motif-based analysis and adaptive time windows, demonstrating superior performance in financial anomaly detection.
Abstract: Financial fraud detection is essential to safeguard billions of dollars, yet the intertwined entities and fast-changing transaction behaviors in modern financial systems routinely defeat conventional machine learning models. Recent graph-based detectors make headway by representing transactions as networks, but they still overlook two fraud hallmarks rooted in time: (1) temporal motifs–recurring, telltale subgraphs that reveal suspicious money flows as they unfold–and (2) account-specific intervals of anomalous activity, when fraud surfaces only in short bursts unique to each entity. To exploit both signals, we introduce ATM-GAD, an adaptive graph neural network that leverages temporal motifs for financial anomaly detection. A Temporal Motif Extractor condenses each account’s transaction history into the most informative motifs, preserving both topology and temporal patterns. These motifs are then analyzed by dual-attention blocks: IntraA reasons over interactions within a single motif, while InterA aggregates evidence across motifs to expose multi-step fraud schemes. In parallel, a differentiable Adaptive Time-Window Learner tailors the observation window for every node, allowing the model to focus precisely on the most revealing time slices. Experiments on four real-world datasets show that ATM-GAD consistently outperforms seven strong anomaly-detection baselines, uncovering fraud patterns missed by earlier methods.
[316] Practical Physical Layer Authentication for Mobile Scenarios Using a Synthetic Dataset Enhanced Deep Learning Approach
Yijia Guo, Junqing Zhang, Y. -W. Peter Hong
Main category: cs.LG
TL;DR: Deep learning-based physical layer authentication using CSI for mobile IoT scenarios, achieving improved performance over existing methods through CNN-based Siamese network and synthetic dataset generation.
Details
Motivation: The broadcast nature of wireless transmissions makes IoT devices vulnerable to authentication attacks. Physical layer authentication using channel characteristics shows promise, but practical solutions for dynamic channel variations are lacking.Method: Proposed a CNN-based Siamese network to learn temporal and spatial correlations between CSI pairs. Used synthetic training dataset generation based on WLAN TGn channel model, autocorrelation, and distance correlation to reduce manual data collection overhead.
Result: The approach demonstrated excellent generalization and authentication performance. Improved AUC by 0.03 compared to FCN-based Siamese model and by 0.06 compared to correlation-based benchmark algorithm in both simulation and experimental evaluation.
Conclusion: The deep learning-based physical layer authentication scheme effectively addresses mobile IoT authentication challenges, showing superior performance and practical applicability through comprehensive simulation and experimental validation.
Abstract: The Internet of Things (IoT) is ubiquitous thanks to the rapid development of wireless technologies. However, the broadcast nature of wireless transmissions results in great vulnerability to device authentication. Physical layer authentication emerges as a promising approach by exploiting the unique channel characteristics. However, a practical scheme applicable to dynamic channel variations is still missing. In this paper, we proposed a deep learning-based physical layer channel state information (CSI) authentication for mobile scenarios and carried out comprehensive simulation and experimental evaluation using IEEE 802.11n. Specifically, a synthetic training dataset was generated based on the WLAN TGn channel model and the autocorrelation and the distance correlation of the channel, which can significantly reduce the overhead of manually collecting experimental datasets. A convolutional neural network (CNN)-based Siamese network was exploited to learn the temporal and spatial correlation between the CSI pair and output a score to measure their similarity. We adopted a synergistic methodology involving both simulation and experimental evaluation. The experimental testbed consisted of WiFi IoT development kits and a few typical scenarios were specifically considered. Both simulation and experimental evaluation demonstrated excellent generalization performance of our proposed deep learning-based approach and excellent authentication performance. Demonstrated by our practical measurement results, our proposed scheme improved the area under the curve (AUC) by 0.03 compared to the fully connected network-based (FCN-based) Siamese model and by 0.06 compared to the correlation-based benchmark algorithm.
[317] LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling
Ali Ramlaoui, Martin Siron, Inel Djafar, Joseph Musielewicz, Amandine Rossello, Victor Schmidt, Alexandre Duval
Main category: cs.LG
TL;DR: LeMat-Traj is a standardized dataset of 120M+ atomic configurations from major DFT repositories, with harmonized formats and quality filtering across multiple functionals, enabling better MLIP training.
Details
Motivation: Address fragmented availability and inconsistent formatting of quantum mechanical trajectory datasets from DFT, which are expensive to generate but difficult to combine due to format variations, metadata differences, and accessibility issues.Method: Created LeMat-Traj by aggregating and curating data from large-scale repositories (Materials Project, Alexandria, OQMD), standardizing data representation, harmonizing results, and filtering for high-quality configurations across multiple DFT functionals (PBE, PBESol, SCAN, r2SCAN). Also developed LeMaterial-Fetcher library for reproducible data integration.
Result: Significantly lowers barrier for training transferable and accurate MLIPs. Fine-tuning models pre-trained on high-force data with LeMat-Traj achieves significant reduction in force prediction errors on relaxation tasks. Dataset spans both relaxed low-energy states and high-energy, high-force structures.
Conclusion: LeMat-Traj provides a standardized, high-quality dataset that enables better machine learning interatomic potential development. The accompanying open-source library ensures reproducibility and community-driven evolution of materials datasets.
Abstract: The development of accurate machine learning interatomic potentials (MLIPs) is limited by the fragmented availability and inconsistent formatting of quantum mechanical trajectory datasets derived from Density Functional Theory (DFT). These datasets are expensive to generate yet difficult to combine due to variations in format, metadata, and accessibility. To address this, we introduce LeMat-Traj, a curated dataset comprising over 120 million atomic configurations aggregated from large-scale repositories, including the Materials Project, Alexandria, and OQMD. LeMat-Traj standardizes data representation, harmonizes results and filters for high-quality configurations across widely used DFT functionals (PBE, PBESol, SCAN, r2SCAN). It significantly lowers the barrier for training transferrable and accurate MLIPs. LeMat-Traj spans both relaxed low-energy states and high-energy, high-force structures, complementing molecular dynamics and active learning datasets. By fine-tuning models pre-trained on high-force data with LeMat-Traj, we achieve a significant reduction in force prediction errors on relaxation tasks. We also present LeMaterial-Fetcher, a modular and extensible open-source library developed for this work, designed to provide a reproducible framework for the community to easily incorporate new data sources and ensure the continued evolution of large-scale materials datasets. LeMat-Traj and LeMaterial-Fetcher are publicly available at https://huggingface.co/datasets/LeMaterial/LeMat-Traj and https://github.com/LeMaterial/lematerial-fetcher.
[318] Turning Tabular Foundation Models into Graph Foundation Models
Dmitry Eremeev, Gleb Bazhenov, Oleg Platonov, Artem Babenko, Liudmila Prokhorenkova
Main category: cs.LG
TL;DR: G2T-FM is a graph foundation model that uses TabPFNv2 as backbone, achieving strong performance on graph tasks by augmenting node features with neighborhood aggregation and structural embeddings.
Details
Motivation: Graph foundation models struggle with diverse node feature types beyond text. The paper addresses this challenge by leveraging recent success in tabular foundation models to handle arbitrary feature types in graphs.Method: Proposes G2T-FM which augments original node features with neighborhood feature aggregation, adds structural embeddings, and applies TabPFNv2 tabular foundation model to the constructed node representations.
Result: Achieves strong results in fully in-context regime, significantly outperforms publicly available GFMs and performs on par with well-tuned GNNs. After finetuning, surpasses well-tuned GNN baselines.
Conclusion: Demonstrates the potential of utilizing tabular foundation models for graph machine learning tasks, revealing a previously overlooked direction in the field.
Abstract: While foundation models have revolutionized such fields as natural language processing and computer vision, their application and potential within graph machine learning remain largely unexplored. One of the key challenges in designing graph foundation models (GFMs) is handling diverse node features that can vary across different graph datasets. Although many works on GFMs have been focused exclusively on text-attributed graphs, the problem of handling arbitrary features of other types in GFMs has not been fully addressed. However, this problem is not unique to the graph domain, as it also arises in the field of machine learning for tabular data. In this work, motivated by the recent success of tabular foundation models like TabPFNv2, we propose G2T-FM, a simple graph foundation model that employs TabPFNv2 as a backbone. Specifically, G2T-FM augments the original node features with neighborhood feature aggregation, adds structural embeddings, and then applies TabPFNv2 to the constructed node representations. Even in a fully in-context regime, our model achieves strong results, significantly outperforming publicly available GFMs and performing on par with well-tuned GNNs trained from scratch. Moreover, after finetuning, G2T-FM surpasses well-tuned GNN baselines, highlighting the potential of the proposed approach. More broadly, our paper reveals a previously overlooked direction of utilizing tabular foundation models for graph machine learning tasks.
[319] Finite-Time Guarantees for Multi-Agent Combinatorial Bandits with Nonstationary Rewards
Katherine B. Adams, Justin J. Boutilier, Qinyang He, Yonatan Mintz
Main category: cs.LG
TL;DR: First framework for nonstationary combinatorial multi-armed bandits addressing habituation/recovery effects in sequential resource allocation, with theoretical guarantees and 3x improvement in real-world diabetes intervention case study.
Details
Motivation: Address sequential resource allocation problems where intervention effects evolve dynamically (habituation or recovery) in applications like community health, digital advertising, and workforce retention, requiring balancing heterogeneous rewards with exploration-exploitation tradeoffs.Method: Developed algorithms for combinatorial multi-armed bandits with nonstationary reward distributions, incorporating theoretical guarantees on dynamic regret, and validated through a diabetes intervention case study.
Result: Achieved up to three times as much improvement in program enrollment compared to baseline approaches in the diabetes intervention case study, demonstrating practical efficacy.
Conclusion: Bridges theoretical advances in adaptive learning with practical challenges in population-level behavioral change interventions, providing the first framework for nonstationary rewards in combinatorial multi-armed bandits with real-world applicability.
Abstract: We study a sequential resource allocation problem where a decision maker selects subsets of agents at each period to maximize overall outcomes without prior knowledge of individual-level effects. Our framework applies to settings such as community health interventions, targeted digital advertising, and workforce retention programs, where intervention effects evolve dynamically. Agents may exhibit habituation (diminished response from frequent selection) or recovery (enhanced response from infrequent selection). The technical challenge centers on nonstationary reward distributions that lead to changing intervention effects over time. The problem requires balancing two key competing objectives: heterogeneous individual rewards and the exploration-exploitation tradeoff in terms of learning for improved future decisions as opposed to maximizing immediate outcomes. Our contribution introduces the first framework incorporating this form of nonstationary rewards in the combinatorial multi-armed bandit literature. We develop algorithms with theoretical guarantees on dynamic regret and demonstrate practical efficacy through a diabetes intervention case study. Our personalized community intervention algorithm achieved up to three times as much improvement in program enrollment compared to baseline approaches, validating the framework’s potential for real-world applications. This work bridges theoretical advances in adaptive learning with practical challenges in population-level behavioral change interventions.
[320] Train-Once Plan-Anywhere Kinodynamic Motion Planning via Diffusion Trees
Yaniv Hassidof, Tom Jurgenson, Kiril Solovey
Main category: cs.LG
TL;DR: DiTree combines diffusion policies with sampling-based planners to achieve provably-safe kinodynamic motion planning that generalizes to out-of-distribution scenarios while being 3x faster than classical methods.
Details
Motivation: Sampling-based planners are slow due to uninformed action sampling, while learning-based approaches lack safety guarantees and fail to generalize to out-of-distribution scenarios, limiting their deployment on physical robots.Method: DiTree leverages diffusion policies as informed samplers to guide state-space search within sampling-based planners, combining DP’s ability to model expert trajectories with SBP’s completeness guarantees.
Result: DiTree achieves 3x faster runtimes than classical SBPs and roughly 30% higher success rate than all other approaches, while maintaining provable safety and generalizing to out-of-distribution scenarios.
Conclusion: The framework successfully combines the benefits of learning-based approaches (speed) with sampling-based planners (safety guarantees), enabling efficient and safe kinodynamic motion planning that generalizes well.
Abstract: Kinodynamic motion planning is concerned with computing collision-free trajectories while abiding by the robot’s dynamic constraints. This critical problem is often tackled using sampling-based planners (SBPs) that explore the robot’s high-dimensional state space by constructing a search tree via action propagations. Although SBPs can offer global guarantees on completeness and solution quality, their performance is often hindered by slow exploration due to uninformed action sampling. Learning-based approaches can yield significantly faster runtimes, yet they fail to generalize to out-of-distribution (OOD) scenarios and lack critical guarantees, e.g., safety, thus limiting their deployment on physical robots. We present Diffusion Tree (DiTree): a \emph{provably-generalizable} framework leveraging diffusion policies (DPs) as informed samplers to efficiently guide state-space search within SBPs. DiTree combines DP’s ability to model complex distributions of expert trajectories, conditioned on local observations, with the completeness of SBPs to yield \emph{provably-safe} solutions within a few action propagation iterations for complex dynamical systems. We demonstrate DiTree’s power with an implementation combining the popular RRT planner with a DP action sampler trained on a \emph{single environment}. In comprehensive evaluations on OOD scenarios, % DiTree has comparable runtimes to a standalone DP (3x faster than classical SBPs), while improving the average success rate over DP and SBPs. DiTree is on average 3x faster than classical SBPs, and outperforms all other approaches by achieving roughly 30% higher success rate. Project webpage: https://sites.google.com/view/ditree.
[321] InSQuAD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity
Souradeep Nanda, Anay Majee, Rishabh Iyer
Main category: cs.LG
TL;DR: InSQuAD improves In-Context Learning by using Submodular Mutual Information to select high-quality and diverse examples, addressing the diversity gap in existing retrieval models through combinatorial training and synthetic data augmentation.
Details
Motivation: Existing retrieval models for In-Context Learning often focus only on query relevance while overlooking diversity, which is critical for effective ICL performance.Method: Two main strategies: 1) Modeling ICL as targeted selection using Submodular Mutual Information to mine relevant and diverse examples, 2) Combinatorial training paradigm with novel likelihood-based loss to learn SMI parameters that enforce both quality and diversity.
Result: Significant improvements on nine benchmark datasets when using the trained retrieval model with the targeted selection formulation for ICL.
Conclusion: InSQuAD effectively enhances ICL performance by addressing the diversity problem in example selection through SMI-based quality-diversity optimization and combinatorial training.
Abstract: In this paper, we introduce InSQuAD, designed to enhance the performance of In-Context Learning (ICL) models through Submodular Mutual Information} (SMI) enforcing Quality and Diversity among in-context exemplars. InSQuAD achieves this through two principal strategies: First, we model the ICL task as a targeted selection problem and introduce a unified selection strategy based on SMIs which mines relevant yet diverse in-context examples encapsulating the notions of quality and diversity. Secondly, we address a common pitfall in existing retrieval models which model query relevance, often overlooking diversity, critical for ICL. InSQuAD introduces a combinatorial training paradigm which learns the parameters of an SMI function to enforce both quality and diversity in the retrieval model through a novel likelihood-based loss. To further aid the learning process we augment an existing multi-hop question answering dataset with synthetically generated paraphrases. Adopting the retrieval model trained using this strategy alongside the novel targeted selection formulation for ICL on nine benchmark datasets shows significant improvements validating the efficacy of our approach.
[322] Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance
Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, Xipeng Qiu
Main category: cs.LG
TL;DR: RLG is an inference-time method that combines base and RL fine-tuned diffusion models via geometric averaging, enabling dynamic control over alignment-quality trade-off without additional training.
Details
Motivation: Current RL fine-tuning methods for diffusion models are suboptimal and offer limited flexibility in controlling alignment strength after training, making it challenging to align generative outputs with complex downstream objectives.Method: Reinterpret RL fine-tuning through SDEs and implicit reward conditioning. Introduce Reinforcement Learning Guidance (RLG) that adapts Classifier-Free Guidance by combining base and RL fine-tuned model outputs via geometric average.
Result: RLG consistently improves performance across various architectures, RL algorithms, and tasks including human preferences, compositional control, compressibility, and text rendering. Supports both interpolation and extrapolation for flexible alignment control.
Conclusion: RLG provides a practical and theoretically sound solution for enhancing and controlling diffusion model alignment at inference time, offering unprecedented flexibility without requiring further training.
Abstract: Denoising-based generative models, particularly diffusion and flow matching algorithms, have achieved remarkable success. However, aligning their output distributions with complex downstream objectives, such as human preferences, compositional accuracy, or data compressibility, remains challenging. While reinforcement learning (RL) fine-tuning methods, inspired by advances in RL from human feedback (RLHF) for large language models, have been adapted to these generative frameworks, current RL approaches are suboptimal for diffusion models and offer limited flexibility in controlling alignment strength after fine-tuning. In this work, we reinterpret RL fine-tuning for diffusion models through the lens of stochastic differential equations and implicit reward conditioning. We introduce Reinforcement Learning Guidance (RLG), an inference-time method that adapts Classifier-Free Guidance (CFG) by combining the outputs of the base and RL fine-tuned models via a geometric average. Our theoretical analysis shows that RLG’s guidance scale is mathematically equivalent to adjusting the KL-regularization coefficient in standard RL objectives, enabling dynamic control over the alignment-quality trade-off without further training. Extensive experiments demonstrate that RLG consistently improves the performance of RL fine-tuned models across various architectures, RL algorithms, and downstream tasks, including human preferences, compositional control, compressibility, and text rendering. Furthermore, RLG supports both interpolation and extrapolation, thereby offering unprecedented flexibility in controlling generative alignment. Our approach provides a practical and theoretically sound solution for enhancing and controlling diffusion model alignment at inference. The source code for RLG is publicly available at the Github: https://github.com/jinluo12345/Reinforcement-learning-guidance.
[323] Fast Convergence Rates for Subsampled Natural Gradient Algorithms on Quadratic Model Problems
Gil Goldshlager, Jiang Hu, Lin Lin
Main category: cs.LG
TL;DR: Theoretical analysis shows subsampled natural gradient descent (SNGD) and its accelerated variant SPRING are equivalent to regularized Kaczmarz methods, providing first convergence guarantees and acceleration proofs for these optimization methods.
Details
Motivation: SNGD has shown impressive empirical results in scientific machine learning applications but lacked theoretical explanation, creating a gap between practical success and theoretical understanding.Method: Analyzed SNGD and SPRING convergence for idealized problems with linear models and strongly convex quadratic losses. Proved equivalence to regularized Kaczmarz methods and leveraged existing analyses to establish convergence rates.
Result: First fast convergence rate for SNGD, first convergence guarantee for SPRING, and first proof that SPRING can accelerate SNGD. Extended analysis to general strongly convex quadratic losses.
Conclusion: Tools from randomized linear algebra can illuminate the interplay between subsampling and curvature-aware optimization strategies, providing theoretical foundation for SNGD’s effectiveness beyond least-squares settings.
Abstract: Subsampled natural gradient descent (SNGD) has shown impressive results for parametric optimization tasks in scientific machine learning, such as neural network wavefunctions and physics-informed neural networks, but it has lacked a theoretical explanation. We address this gap by analyzing the convergence of SNGD and its accelerated variant, SPRING, for idealized parametric optimization problems where the model is linear and the loss function is strongly convex and quadratic. In the special case of a least-squares loss, namely the standard linear least-squares problem, we prove that SNGD is equivalent to a regularized Kaczmarz method while SPRING is equivalent to an accelerated regularized Kaczmarz method. As a result, by leveraging existing analyses we obtain under mild conditions (i) the first fast convergence rate for SNGD, (ii) the first convergence guarantee for SPRING in any setting, and (iii) the first proof that SPRING can accelerate SNGD. In the case of a general strongly convex quadratic loss, we extend the analysis of the regularized Kaczmarz method to obtain a fast convergence rate for SNGD under stronger conditions, providing the first explanation for the effectiveness of SNGD outside of the least-squares setting. Overall, our results illustrate how tools from randomized linear algebra can shed new light on the interplay between subsampling and curvature-aware optimization strategies.
[324] Rethinking Invariance Regularization in Adversarial Training to Improve Robustness-Accuracy Trade-off
Futa Waseda, Ching-Chun Chang, Isao Echizen
Main category: cs.LG
TL;DR: ARAT addresses robustness-accuracy trade-off in adversarial training by resolving gradient conflicts and mixture distribution problems through asymmetric invariance loss with stop-gradient and split-BatchNorm structure.
Details
Motivation: Adversarial training suffers from robustness-accuracy trade-off, and existing invariance regularization methods still lead to accuracy loss. The paper identifies two key issues: gradient conflict between invariance and classification objectives, and mixture distribution problem from diverged clean/adversarial input distributions.Method: Proposes Asymmetric Representation-regularized Adversarial Training (ARAT) with: 1) asymmetric invariance loss with stop-gradient operation and predictor to avoid gradient conflict, 2) split-BatchNorm structure to resolve mixture distribution problem.
Result: ARAT shows superiority over existing methods across various settings. Each component effectively addresses the identified issues, offering novel insights into adversarial defense.
Conclusion: The method successfully mitigates the robustness-accuracy trade-off in adversarial training and provides new perspectives on knowledge distillation-based defenses.
Abstract: Adversarial training often suffers from a robustness-accuracy trade-off, where achieving high robustness comes at the cost of accuracy. One approach to mitigate this trade-off is leveraging invariance regularization, which encourages model invariance under adversarial perturbations; however, it still leads to accuracy loss. In this work, we closely analyze the challenges of using invariance regularization in adversarial training and understand how to address them. Our analysis identifies two key issues: (1) a ``gradient conflict" between invariance and classification objectives, leading to suboptimal convergence, and (2) the mixture distribution problem arising from diverged distributions between clean and adversarial inputs. To address these issues, we propose Asymmetric Representation-regularized Adversarial Training (ARAT), which incorporates asymmetric invariance loss with stop-gradient operation and a predictor to avoid gradient conflict, and a split-BatchNorm (BN) structure to resolve the mixture distribution problem. Our detailed analysis demonstrates that each component effectively addresses the identified issues, offering novel insights into adversarial defense. ARAT shows superiority over existing methods across various settings. Finally, we discuss the implications of our findings to knowledge distillation-based defenses, providing a new perspective on their relative successes.
[325] Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study
Zechun Niu, Zhilin Zhang, Jiaxin Mao, Qingyao Ai, Ji-Rong Wen
Main category: cs.LG
TL;DR: This paper investigates the robustness of counterfactual learning to rank (CLTR) models through extensive simulation experiments, finding that IPS-DCM, DLA-PBM, and UPE models show better robustness, while current CLTR models often fail to outperform naive baselines with strong production rankers and limited training data.
Details
Motivation: Previous simulation-based experiments for CLTR evaluation have limitations including weak production rankers, simplified user models, and fixed synthetic data sizes, leaving the robustness of CLTR models in complex situations largely unknown and requiring investigation.Method: Conducted reproducibility study with extensive simulation experiments using: (1) production rankers with different performance levels, (2) multiple user simulation models with different behavior assumptions, and (3) varying numbers of synthetic training sessions.
Result: IPS-DCM, DLA-PBM, and UPE models demonstrated better robustness across various simulation settings. However, existing CLTR models often failed to outperform naive click baselines when the production ranker was strong and training sessions were limited.
Conclusion: There is a pressing need for new CLTR algorithms specifically designed for conditions with strong production rankers and limited training data, as current models show limitations in these scenarios despite some models exhibiting better robustness overall.
Abstract: Counterfactual learning to rank (CLTR) has attracted extensive attention in the IR community for its ability to leverage massive logged user interaction data to train ranking models. While the CLTR models can be theoretically unbiased when the user behavior assumption is correct and the propensity estimation is accurate, their effectiveness is usually empirically evaluated via simulation-based experiments due to a lack of widely available, large-scale, real click logs. However, many previous simulation-based experiments are somewhat limited because they may have one or more of the following deficiencies: 1) using a weak production ranker to generate initial ranked lists, 2) relying on a simplified user simulation model to simulate user clicks, and 3) generating a fixed number of synthetic click logs. As a result, the robustness of CLTR models in complex and diverse situations is largely unknown and needs further investigation. To address this problem, in this paper, we aim to investigate the robustness of existing CLTR models in a reproducibility study with extensive simulation-based experiments that (1) use production rankers with different ranking performance, (2) leverage multiple user simulation models with different user behavior assumptions, and (3) generate different numbers of synthetic sessions for the training queries. We find that the IPS-DCM, DLA-PBM, and UPE models show better robustness under various simulation settings than other CLTR models. Moreover, existing CLTR models often fail to outperform naive click baselines when the production ranker is strong and the number of training sessions is limited, indicating a pressing need for new CLTR algorithms tailored to these conditions.
[326] CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning
Jinyuan Feng, Chaopeng Wei, Tenghai Qiu, Tianyi Hu, Zhiqiang Pu
Main category: cs.LG
TL;DR: CoMoE introduces contrastive learning to MoE fine-tuning to prevent expert redundancy and improve specialization on heterogeneous datasets
Details
Motivation: Current MoE methods underutilize capacity on heterogeneous datasets as experts learn similar knowledge instead of specializingMethod: Adds contrastive objective by sampling from activated vs inactivated experts in top-k routing to recover mutual information gap
Result: Consistently enhances MoE capacity and promotes expert modularization across multiple benchmarks and multi-task settings
Conclusion: Contrastive representation learning effectively addresses expert redundancy in MoE, improving specialization and utilization
Abstract: In parameter-efficient fine-tuning, mixture-of-experts (MoE), which involves specializing functionalities into different experts and sparsely activating them appropriately, has been widely adopted as a promising approach to trade-off between model capacity and computation overhead. However, current MoE variants fall short on heterogeneous datasets, ignoring the fact that experts may learn similar knowledge, resulting in the underutilization of MoE’s capacity. In this paper, we propose Contrastive Representation for MoE (CoMoE), a novel method to promote modularization and specialization in MoE, where the experts are trained along with a contrastive objective by sampling from activated and inactivated experts in top-k routing. We demonstrate that such a contrastive objective recovers the mutual-information gap between inputs and the two types of experts. Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE’s capacity and promote modularization among the experts.
[327] LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
Christoforos N. Spartalis, Theodoros Semertzidis, Efstratios Gavves, Petros Daras
Main category: cs.LG
TL;DR: LoTUS is a novel Machine Unlearning method that removes training sample influence from pre-trained models without retraining, using probability smoothing up to an information-theoretic bound to mitigate over-confidence from data memorization.
Details
Motivation: To eliminate the need for retraining from scratch when removing specific training samples' influence from models, which is particularly important for large-scale datasets like ImageNet1k where retraining is impractical.Method: LoTUS smooths the prediction probabilities of pre-trained models up to an information-theoretic bound to mitigate over-confidence caused by data memorization, and introduces the Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric for evaluation.
Result: LoTUS outperforms eight state-of-the-art baselines on Transformer and ResNet18 models across five public datasets, including large-scale ImageNet1k, demonstrating superior efficiency and effectiveness.
Conclusion: LoTUS provides an effective and efficient solution for machine unlearning that avoids costly retraining while maintaining performance, with practical applicability to real-world large-scale datasets.
Abstract: We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.
[328] GLProtein: Global-and-Local Structure Aware Protein Representation Learning
Yunqing Liu, Wenqi Fan, Xiaoyong Wei, Qing Li
Main category: cs.LG
TL;DR: GLProtein is a novel protein pre-training framework that integrates both global structural similarity and local amino acid details to improve protein function prediction accuracy.
Details
Motivation: Current protein sequence analysis methods don't fully leverage structural information, which includes both 3D structure and amino acid-level details. There's untapped potential in combining global structural similarity with local molecular information for better protein understanding.Method: GLProtein combines protein-masked modeling with triplet structure similarity scoring, protein 3D distance encoding, and substructure-based amino acid molecule encoding to capture both global and local structural information.
Result: Experimental results show GLProtein outperforms previous methods in bioinformatics tasks including protein-protein interaction prediction and contact prediction.
Conclusion: Integrating both global structural similarity and local amino acid details significantly enhances protein prediction accuracy and functional insights, making GLProtein an effective framework for protein pre-training.
Abstract: Proteins are central to biological systems, participating as building blocks across all forms of life. Despite advancements in understanding protein functions through protein sequence analysis, there remains potential for further exploration in integrating protein structural information. We argue that the structural information of proteins is not only limited to their 3D information but also encompasses information from amino acid molecules (local information) to protein-protein structure similarity (global information). To address this, we propose \textbf{GLProtein}, the first framework in protein pre-training that incorporates both global structural similarity and local amino acid details to enhance prediction accuracy and functional insights. GLProtein innovatively combines protein-masked modelling with triplet structure similarity scoring, protein 3D distance encoding and substructure-based amino acid molecule encoding. Experimental results demonstrate that GLProtein outperforms previous methods in several bioinformatics tasks, including predicting protein-protein interaction, contact prediction, and so on.
[329] Reconsidering the Performance of GAE in Link Prediction
Weishuo Ma, Yanbo Wang, Xiyuan Wang, Muhan Zhang
Main category: cs.LG
TL;DR: Well-tuned Graph Autoencoders (GAEs) match performance of recent sophisticated GNN models for link prediction while being more computationally efficient, achieving state-of-the-art results on certain benchmarks.
Details
Motivation: Recent GNN advancements for link prediction may exaggerate benefits due to outdated baselines, requiring systematic exploration of GAEs with modern techniques.Method: Systematically applied model-agnostic tricks from recent methods and tuned hyperparameters to optimize Graph Autoencoders for link prediction.
Result: Achieved state-of-the-art Hits@100 score of 78.41% on ogbl-ppa dataset, with substantial gains on datasets where structural information dominates and features are limited.
Conclusion: Well-tuned GAEs can match sophisticated models’ performance with better efficiency, emphasizing the need to update baselines for accurate progress assessment in GNN link prediction.
Abstract: Recent advancements in graph neural networks (GNNs) for link prediction have introduced sophisticated training techniques and model architectures. However, reliance on outdated baselines may exaggerate the benefits of these new approaches. To tackle this issue, we systematically explore Graph Autoencoders (GAEs) by applying model-agnostic tricks in recent methods and tuning hyperparameters. We find that a well-tuned GAE can match the performance of recent sophisticated models while offering superior computational efficiency on widely-used link prediction benchmarks. Our approach delivers substantial performance gains on datasets where structural information dominates and feature data is limited. Specifically, our GAE achieves a state-of-the-art Hits@100 score of 78.41% on the ogbl-ppa dataset. Furthermore, we examine the impact of various tricks to uncover the reasons behind our success and to guide the design of future methods. Our study emphasizes the critical need to update baselines for a more accurate assessment of progress in GNNs for link prediction. Our code is available at https://github.com/GraphPKU/Refined-GAE.
[330] Categorical Data Clustering via Value Order Estimated Distance Metric Learning
Yiqun Zhang, Mingjie Zhao, Hong Jia, Yang Lu, Mengke Li, Yiu-ming Cheung
Main category: cs.LG
TL;DR: A novel order distance metric learning approach for categorical data clustering that learns optimal order relationships to represent categorical values numerically, enabling better clustering accuracy and interpretability.
Details
Motivation: Categorical data lacks natural metric space like numerical data, making clustering difficult and potentially distorting valuable distribution patterns. Existing methods struggle to represent categorical attributes intuitively.Method: Joint learning paradigm that alternatively performs clustering and order distance metric learning. Learns optimal order relationships for categorical values to quantify their distance in a linear space similar to numerical attributes.
Result: Achieves superior clustering accuracy on categorical and mixed datasets. The learned order distance metric reduces difficulty in understanding and managing categorical data. Experiments with ablation studies and significance tests validate efficacy.
Conclusion: The proposed method effectively addresses the challenge of clustering categorical data by learning intuitive order relationships, providing both improved accuracy and better interpretability compared to traditional approaches.
Abstract: Clustering is a popular machine learning technique for data mining that can process and analyze datasets to automatically reveal sample distribution patterns. Since the ubiquitous categorical data naturally lack a well-defined metric space such as the Euclidean distance space of numerical data, the distribution of categorical data is usually under-represented, and thus valuable information can be easily twisted in clustering. This paper, therefore, introduces a novel order distance metric learning approach to intuitively represent categorical attribute values by learning their optimal order relationship and quantifying their distance in a line similar to that of the numerical attributes. Since subjectively created qualitative categorical values involve ambiguity and fuzziness, the order distance metric is learned in the context of clustering. Accordingly, a new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning with low time complexity and a guarantee of convergence. Due to the clustering-friendly order learning mechanism and the homogeneous ordinal nature of the order distance and Euclidean distance, the proposed method achieves superior clustering accuracy on categorical and mixed datasets. More importantly, the learned order distance metric greatly reduces the difficulty of understanding and managing the non-intuitive categorical data. Experiments with ablation studies, significance tests, case studies, etc., have validated the efficacy of the proposed method. The source code is available at https://github.com/DAJ0612/OCL_Source_Code.
[331] FLASH: Federated Learning Across Simultaneous Heterogeneities
Xiangyu Chang, Sk Miraj Ahmed, Srikanth V. Krishnamurthy, Basak Guler, Ananthram Swami, Samet Oymak, Amit K. Roy-Chowdhury
Main category: cs.LG
TL;DR: FLASH is a client selection algorithm for federated learning that handles multiple simultaneous heterogeneities (data quality, distribution, and latency) through contextual multi-armed bandits, achieving up to 10% accuracy improvements over state-of-the-art methods.
Details
Motivation: Federated learning faces challenges from diverse client heterogeneities including data distribution variations, data quality differences, and compute/communication latency. These heterogeneities often occur simultaneously and interact (e.g., low-latency clients may have poor data quality), requiring an integrated approach.Method: FLASH uses contextual multi-armed bandits (CMAB) to model learning dynamics and dynamically select the most promising clients by trading off statistical information related to data quality, data distribution, and latency.
Result: Extensive experiments show FLASH achieves substantial improvements (up to 10% absolute accuracy) over state-of-the-art baselines. It outperforms federated aggregation methods designed for heterogeneous settings and even improves performance when integrated with them.
Conclusion: FLASH provides the first unified approach to handle multiple simultaneous heterogeneities in federated learning through lightweight and flexible client selection, demonstrating consistent and significant performance gains across diverse scenarios.
Abstract: The key premise of federated learning (FL) is to train ML models across a diverse set of data-owners (clients), without exchanging local data. An overarching challenge to this date is client heterogeneity, which may arise not only from variations in data distribution, but also in data quality, as well as compute/communication latency. An integrated view of these diverse and concurrent sources of heterogeneity is critical; for instance, low-latency clients may have poor data quality, and vice versa. In this work, we propose FLASH(Federated Learning Across Simultaneous Heterogeneities), a lightweight and flexible client selection algorithm that outperforms state-of-the-art FL frameworks under extensive sources of heterogeneity, by trading-off the statistical information associated with the client’s data quality, data distribution, and latency. FLASH is the first method, to our knowledge, for handling all these heterogeneities in a unified manner. To do so, FLASH models the learning dynamics through contextual multi-armed bandits (CMAB) and dynamically selects the most promising clients. Through extensive experiments, we demonstrate that FLASH achieves substantial and consistent improvements over state-of-the-art baselines – as much as 10% in absolute accuracy – thanks to its unified approach. Importantly, FLASH also outperforms federated aggregation methods that are designed to handle highly heterogeneous settings and even enjoys a performance boost when integrated with them.
[332] drGT: Attention-Guided Gene Assessment of Drug Response Utilizing a Drug-Cell-Gene Heterogeneous Network
Yoshitaka Inoue, Hunmin Lee, Tianfan Fu, Augustin Luna
Main category: cs.LG
TL;DR: drGT is a graph deep learning model for drug response prediction that achieves competitive performance while providing interpretability through attention coefficients for biomarker identification and biological process analysis.
Details
Motivation: The challenge in drug response prediction is result interpretation compared to established knowledge. Existing methods often lack interpretability, making it difficult to understand the biological mechanisms behind predictions.Method: drGT uses a heterogeneous graph with drugs, genes, and cell line relationships. It employs attention coefficients to identify important biomarkers and leverages text-mining of PubMed abstracts for validation. The model is trained on benchmark datasets Sanger GDSC, NCI60, and Broad CTRP.
Result: drGT achieves AUROC of 94.5% (random split), 84.4% (unseen drugs), and 70.6% (unseen cell lines). For 976 drugs with known DTIs, 36.9% utilized known interactions and 63.67% of associations were supported by PubMed literature or DTI model predictions.
Conclusion: drGT provides both accurate drug response predictions and interpretable results through attention mechanisms, enabling biomarker identification and biological process analysis that aligns with established knowledge from literature.
Abstract: A challenge in drug response prediction is result interpretation compared to established knowledge. drGT is a graph deep learning model that predicts sensitivity and aids in biomarker identification using attention coefficients (ACs). drGT leverages a heterogeneous graph composed of relationships drawn from drugs, genes, and cell line responses. The model is trained and evaluated using major benchmark datasets: Sanger GDSC, NCI60, and Broad CTRP, which cover a wide range of drugs and cancer cell lines. drGT demonstrates AUROC of up to 94.5% under random splitting, 84.4% for unseen drugs, and 70.6% for unseen cell lines, comparable to existing benchmark methods while also providing interpretability. Regarding interpretability, we review drug-gene co-occurrences by text-mining PubMed abstracts for high-coefficient genes mentioning particular drugs. Across 976 drugs from NCI60 with known drug-target interactions (DTIs), model predictions utilized both known DTIs (36.9%) as well as additional predictive associations, many supported by literature. In addition, we compare the drug-gene associations identified by drGT with those from an established DTI prediction model and find that 63.67% are supported by either PubMed literature or predictions from the DTI model. Further, we describe the utilization of ACs to identify affected biological processes by each drug via enrichment analyses, thereby enhancing biological interpretability. Code is available at https://github.com/sciluna/drGT.
[333] Unlearning Concepts from Text-to-Video Diffusion Models
Shiqi Liu, Yihua Tan
Main category: cs.LG
TL;DR: A novel concept-unlearning method that transfers unlearning capabilities from text-to-image to text-to-video diffusion models, enabling efficient removal of copyrighted content, artist styles, and private information with low computational cost.
Details
Motivation: Text-to-video diffusion models are trained on internet data containing copyrighted content, private portraits, and unsafe videos. Filtering training data is challenging, and existing unlearning methods are computationally expensive for video models.Method: Transfers unlearning capability from text-to-image diffusion models’ text encoder to text-to-video models. Uses few-shot unlearning with generated images to optimize the text encoder, then applies it to video generation.
Result: Method successfully unlearns copyrighted cartoon characters, artist styles, objects, and facial characteristics. Achieves concept unlearning in about 100 seconds on RTX 3070 with low computation resources.
Conclusion: First feasible concept unlearning method for text-to-video diffusion models, making unlearning more accessible in the video domain with efficient computation and small optimization scale.
Abstract: With the advancement of computer vision and natural language processing, text-to-video generation, enabled by text-to-video diffusion models, has become more prevalent. These models are trained using a large amount of data from the internet. However, the training data often contain copyrighted content, including cartoon character icons and artist styles, private portraits, and unsafe videos. Since filtering the data and retraining the model is challenging, methods for unlearning specific concepts from text-to-video diffusion models have been investigated. However, due to the high computational complexity and relative large optimization scale, there is little work on unlearning methods for text-to-video diffusion models. We propose a novel concept-unlearning method by transferring the unlearning capability of the text encoder of text-to-image diffusion models to text-to-video diffusion models. Specifically, the method optimizes the text encoder using few-shot unlearning, where several generated images are used. We then use the optimized text encoder in text-to-video diffusion models to generate videos. Our method costs low computation resources and has small optimization scale. We discuss the generated videos after unlearning a concept. The experiments demonstrates that our method can unlearn copyrighted cartoon characters, artist styles, objects and people’s facial characteristics. Our method can unlearn a concept within about 100 seconds on an RTX 3070. Since there was no concept unlearning method for text-to-video diffusion models before, we make concept unlearning feasible and more accessible in the text-to-video domain.
[334] ExPath: Targeted Pathway Inference for Biological Knowledge Bases via Graph Learning and Explanation
Rikuto Kotoge, Ziwei Yang, Zheng Chen, Yushun Dong, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai
Main category: cs.LG
TL;DR: ExPAth is a novel subgraph inference framework that integrates experimental data to classify biological networks and identify targeted pathways, achieving superior performance over baseline methods.
Details
Motivation: Retrieving targeted pathways in biological knowledge bases with experimental data is challenging and requires specialized expertise, which this work aims to automate and improve.Method: Frames the problem as a graph learning and explaining task, proposes ExPAth framework that integrates experimental data and biological foundation models to encode molecular data, with ML-oriented biological evaluations and new metrics.
Result: Experiments on 301 bio-networks show ExPAth infers biologically meaningful pathways with up to 4.5x higher Fidelity+ and 14x lower Fidelity- than baselines, while preserving signaling chains up to 4x longer.
Conclusion: ExPAth provides an effective framework for identifying targeted pathways in biological networks by integrating experimental data and achieving significant improvements over existing explainer methods.
Abstract: Retrieving targeted pathways in biological knowledge bases, particularly when incorporating wet-lab experimental data, remains a challenging task and often requires downstream analyses and specialized expertise. In this paper, we frame this challenge as a solvable graph learning and explaining task and propose a novel subgraph inference framework, ExPAth, that explicitly integrates experimental data to classify various graphs (bio-networks) in biological databases. The links (representing pathways) that contribute more to classification can be considered as targeted pathways. Our framework can seamlessly integrate biological foundation models to encode the experimental molecular data. We propose ML-oriented biological evaluations and a new metric. The experiments involving 301 bio-networks evaluations demonstrate that pathways inferred by ExPath are biologically meaningful, achieving up to 4.5x higher Fidelity+ (necessity) and 14x lower Fidelity- (sufficiency) than explainer baselines, while preserving signaling chains up to 4x longer.
[335] Expert Routing with Synthetic Data for Continual Learning
Yewon Byun, Sanket Vaibhav Mehta, Saurabh Garg, Emma Strubell, Michael Oberst, Bryan Wilder, Zachary C. Lipton
Main category: cs.LG
TL;DR: G2D is a domain-incremental continual learning method that uses synthetic data to train a domain discriminator for routing samples to appropriate domain-specific experts, outperforming traditional approaches that use synthetic data directly for classifier training.
Details
Motivation: Real-world regulations often allow model sharing but not data sharing across institutions. Practitioners need to adapt models to new domains without catastrophic forgetting while maintaining performance on previous domains, requiring effective routing mechanisms for domain-specific experts.Method: Generate to Discriminate (G2D) leverages synthetic data to train a domain-discriminator that routes test samples to the appropriate domain-specific expert, rather than using synthetic data directly for classifier training.
Result: G2D outperforms competitive domain-incremental learning methods on both vision and language tasks, demonstrating superior performance compared to traditional approaches that use synthetic data for direct classifier training.
Conclusion: Using synthetic data to train domain discriminators for expert routing is more effective than direct classifier training, providing a new perspective on synthetic data utilization in lifelong learning.
Abstract: In many real-world settings, regulations and economic incentives permit the sharing of models but not data across institutional boundaries. In such scenarios, practitioners might hope to adapt models to new domains, without losing performance on previous domains (so-called catastrophic forgetting). While any single model may struggle to achieve this goal, learning an ensemble of domain-specific experts offers the potential to adapt more closely to each individual institution. However, a core challenge in this context is determining which expert to deploy at test time. In this paper, we propose Generate to Discriminate (G2D), a domain-incremental continual learning method that leverages synthetic data to train a domain-discriminator that routes samples at inference time to the appropriate expert. Surprisingly, we find that leveraging synthetic data in this capacity is more effective than using the samples to \textit{directly} train the downstream classifier (the more common approach to leveraging synthetic data in the lifelong learning literature). We observe that G2D outperforms competitive domain-incremental learning methods on tasks in both vision and language modalities, providing a new perspective on the use of synthetic data in the lifelong learning literature.
[336] LASE: Learned Adjacency Spectral Embeddings
Sofía Pérez Casulo, Marcelo Fiori, Federico Larroca, Gonzalo Mateos
Main category: cs.LG
TL;DR: LASE is a neural architecture that learns nodal Adjacency Spectral Embeddings from graphs using unrolled gradient descent, combining GCN and GAT modules for efficient and robust spectral embedding approximation.
Details
Motivation: To create an interpretable, parameter-efficient neural architecture that can approximate spectral embeddings while being robust to missing edges and offering controllable complexity during inference.Method: Unrolls gradient descent iterations for spectral embedding into GNN layers, combining Graph Convolutional Network (GCN) and Graph Attention Network (GAT) modules with refinements like sparse attention and decoupled parameters.
Result: LASE outperforms optimized eigendecomposition routines and shows competitive performance in supervised link prediction and node classification tasks, even beating GNNs with precomputed spectral positional encodings.
Conclusion: LASE provides a differentiable, trainable spectral embedding module that can be integrated into end-to-end graph learning pipelines, offering improved performance and efficiency over traditional methods.
Abstract: We put forth a principled design of a neural architecture to learn nodal Adjacency Spectral Embeddings (ASE) from graph inputs. By bringing to bear the gradient descent (GD) method and leveraging the principle of algorithm unrolling, we truncate and re-interpret each GD iteration as a layer in a graph neural network (GNN) that is trained to approximate the ASE. Accordingly, we call the resulting embeddings and our parametric model Learned ASE (LASE), which is interpretable, parameter efficient, robust to inputs with unobserved edges, and offers controllable complexity during inference. LASE layers combine Graph Convolutional Network (GCN) and fully-connected Graph Attention Network (GAT) modules, which is intuitively pleasing since GCN-based local aggregations alone are insufficient to express the sought graph eigenvectors. We propose several refinements to the unrolled LASE architecture (such as sparse attention in the GAT module and decoupled layerwise parameters) that offer favorable approximation error versus computation tradeoffs; even outperforming heavily-optimized eigendecomposition routines from scientific computing libraries. Because LASE is a differentiable function with respect to its parameters as well as its graph input, we can seamlessly integrate it as a trainable module within a larger (semi-)supervised graph representation learning pipeline. The resulting end-to-end system effectively learns ``discriminative ASEs’’ that exhibit competitive performance in supervised link prediction and node classification tasks, outperforming a GNN even when the latter is endowed with open loop, meaning task-agnostic, precomputed spectral positional encodings.
[337] A Simple Approach to Constraint-Aware Imitation Learning with Application to Autonomous Racing
Shengfan Cao, Eunhyek Joa, Francesco Borrelli
Main category: cs.LG
TL;DR: A simple approach to incorporate safety constraints into imitation learning for autonomous racing tasks, improving constraint satisfaction and performance consistency compared to Behavior Cloning.
Details
Motivation: Traditional imitation learning methods like Behavior Cloning struggle to enforce safety constraints, especially in high-precision tasks like autonomous racing where operating near system limits requires guaranteed constraint satisfaction.Method: A simple approach to incorporate safety constraints directly into the imitation learning objective, validated through simulations with both full-state and image feedback.
Result: The approach demonstrates improved constraint satisfaction and greater consistency in task performance compared to standard Behavior Cloning in autonomous racing scenarios.
Conclusion: Incorporating safety constraints directly into the imitation learning objective provides an effective way to guarantee constraint satisfaction in high-precision tasks like autonomous racing.
Abstract: Guaranteeing constraint satisfaction is challenging in imitation learning (IL), particularly in tasks that require operating near a system’s handling limits. Traditional IL methods, such as Behavior Cloning (BC), often struggle to enforce constraints, leading to suboptimal performance in high-precision tasks. In this paper, we present a simple approach to incorporating safety into the IL objective. Through simulations, we empirically validate our approach on an autonomous racing task with both full-state and image feedback, demonstrating improved constraint satisfaction and greater consistency in task performance compared to BC.
[338] High-Order Tensor Regression in Sparse Convolutional Neural Networks
Roberto Dias Algarte
Main category: cs.LG
TL;DR: A novel tensor-based convolution approach that redefines backpropagation for sparse CNNs with clearer mathematical formulation for high-order tensors.
Details
Motivation: To address the complexity and lack of clarity in conventional convolution methodologies, especially when dealing with high-order tensors in machine learning.Method: Developed a rational tensor-based theory of regression for neural networks, creating a generic framework for sparse convolutional neural networks with simplified mathematical formulation.
Result: The approach proved mathematically clear and concise for high-order tensors, leading to a redefined and simplified backpropagation algorithm.
Conclusion: The tensor-based convolution approach provides a more generic and mathematically sound foundation for sparse CNNs, simplifying both convolution operations and the backpropagation algorithm.
Abstract: This article presents a generic approach to convolution that significantly differs from conventional methodologies in the current Machine Learning literature. The approach, in its mathematical aspects, proved to be clear and concise, particularly when high-order tensors are involved. In this context, a rational theory of regression in neural networks is developed, as a framework for a generic view of sparse convolutional neural networks, the primary focus of this study. As a direct outcome, the classic Backpropagation Algorithm is redefined to align with this rational tensor-based approach and presented in its simplest, most generic form.
[339] CT-PatchTST: Channel-Time Patch Time-Series Transformer for Long-Term Renewable Energy Forecasting
Kuan Lu, Menghao Huo, Yuxiao Li, Qiang Zhu, Zhenrui Chen
Main category: cs.LG
TL;DR: CT-PatchTST is a novel transformer model that provides accurate long-term forecasts for wind and solar power by capturing both temporal patterns and inter-channel correlations, outperforming existing methods on real-world Danish renewable energy data.
Details
Motivation: Accurate renewable energy forecasting is crucial for modern power grid stability, especially with high renewable penetration, to enable proactive energy storage deployment and optimize system operations.Method: Channel-Time Patch Time-Series Transformer (CT-PatchTST) that captures both temporal dependencies and inter-channel correlations, unlike conventional time-series models.
Result: Outperforms existing methods in both accuracy and robustness when evaluated on real-world datasets from Denmark’s offshore wind, onshore wind, and solar generation.
Conclusion: Enables predictive coordination of energy storage systems across integrated power networks, contributing to more stable, responsive, and cost-efficient grid design.
Abstract: Accurate forecasting of renewable energy generation is fundamental to enhancing the dynamic performance of modern power grids, especially under high renewable penetration. This paper presents Channel-Time Patch Time-Series Transformer (CT-PatchTST), a novel deep learning model designed to provide long-term, high-fidelity forecasts of wind and solar power. Unlike conventional time-series models, CT-PatchTST captures both temporal dependencies and inter-channel correlations-features that are critical for effective energy storage planning, control, and dispatch. Reliable forecasting enables proactive deployment of energy storage systems (ESSs), helping to mitigate uncertainties in renewable output, reduce system response time, and optimize storage operation based on location-specific flow and voltage conditions. Evaluated on real-world datasets from Denmark’s offshore wind, onshore wind, and solar generation, CT-PatchTST outperforms existing methods in both accuracy and robustness. By enabling predictive, data-driven coordination of ESSs across integrated source-grid-load-storage systems, this work contributes to the design of more stable, responsive, and cost-efficient power networks.
[340] Improving Quantization with Post-Training Model Expansion
Giuseppe Franco, Pablo Monteagudo-Lago, Ian Colbert, Nicholas Fraser, Michaela Blott
Main category: cs.LG
TL;DR: Post-training model expansion improves quantization quality by selectively increasing model size rather than relaxing quantization constraints, achieving better perplexity with minimal parameter overhead.
Details
Motivation: Traditional post-training optimizations focus on reducing model size to lower inference costs, but recent techniques show that expanding models can improve quality when quantization constraints cannot be met through volume reduction alone.Method: Progressive and selective expansion of pre-trained LLM size without end-to-end retraining, using techniques like inserting online Hadamard rotations and additional higher precision computations to enable 4-bit weight and activation quantization.
Result: For Llama3 1B with 4-bit quantization, reduced gap to full-precision perplexity by average 9% relative to QuaRot and SpinQuant with only 5% more parameters, achieving 3.8% volume reduction compared to BF16 reference.
Conclusion: Post-training model expansion is a viable strategy to improve model quality within quantization co-design space, providing theoretical justification for expanding rather than relaxing quantization constraints.
Abstract: The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced optimization techniques that, interestingly, expand models post-training, increasing model size to improve quality when reducing volume. For instance, to enable 4-bit weight and activation quantization, incoherence processing often necessitates inserting online Hadamard rotations in the compute graph, and preserving highly sensitive weights often calls for additional higher precision computations. However, if application requirements cannot be met, the prevailing solution is to relax quantization constraints. In contrast, we demonstrate post-training model expansion is a viable strategy to improve model quality within a quantization co-design space, and provide theoretical justification. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining. In particular, when quantizing the weights and activations to 4 bits for Llama3 1B, we reduce the gap to full-precision perplexity by an average of 9% relative to both QuaRot and SpinQuant with only 5% more parameters, which is still a 3.8% reduction in volume relative to a BF16 reference model.
[341] Gradual Domain Adaptation for Graph Learning
Pui Ieng Lei, Ximing Chen, Yijun Sheng, Yanyan Liu, Zhiguo Gong, Qiang Yang
Main category: cs.LG
TL;DR: GGDA framework enables gradual graph domain adaptation by constructing knowledge-preserving intermediate graphs and optimizing domain sequences using FGW metric and vertex-based progression.
Details
Motivation: Existing graph-based domain adaptation techniques struggle with large distribution shifts due to difficulty in simulating coherent evolutionary paths from source to target graphs.Method: Constructs compact domain sequence by generating intermediate graphs using Fused Gromov-Wasserstein metric, then applies vertex-based progression with adaptive domain advancement to enhance transferability.
Result: Provides theoretical bounds for Wasserstein distance and demonstrates superior performance across diverse transfer scenarios in extensive experiments.
Conclusion: GGDA framework effectively handles large distribution shifts in graph domain adaptation through optimized gradual domain progression and theoretical guarantees.
Abstract: Existing machine learning literature lacks graph-based domain adaptation techniques capable of handling large distribution shifts, primarily due to the difficulty in simulating a coherent evolutionary path from source to target graph. To meet this challenge, we present a graph gradual domain adaptation (GGDA) framework, which constructs a compact domain sequence that minimizes information loss during adaptation. Our approach starts with an efficient generation of knowledge-preserving intermediate graphs over the Fused Gromov-Wasserstein (FGW) metric. A GGDA domain sequence is then constructed upon this bridging data pool through a novel vertex-based progression, which involves selecting “close” vertices and performing adaptive domain advancement to enhance inter-domain transferability. Theoretically, our framework provides implementable upper and lower bounds for the intractable inter-domain Wasserstein distance, $W_p(\mu_t,\mu_{t+1})$, enabling its flexible adjustment for optimal domain formation. Extensive experiments across diverse transfer scenarios demonstrate the superior performance of our GGDA framework.
[342] Efficient distributional regression trees learning algorithms for calibrated non-parametric probabilistic forecasts
Quentin Duchemin, Guillaume Obozinski
Main category: cs.LG
TL;DR: Novel algorithms for learning probabilistic regression trees using WIS and CRPS loss functions with efficient data structures, offering competitive performance and interpretability.
Details
Motivation: Developing trustworthy AI for critical applications requires ML techniques that can estimate their own uncertainty through probabilistic regression rather than just conditional mean estimation.Method: Introduces algorithms for probabilistic regression trees using weighted interval score (WIS) and continuous ranked probability score (CRPS) loss functions, implemented with efficient data structures like min-max heaps, weight-balanced binary trees, and Fenwick trees.
Result: Numerical experiments show competitive performance with alternative approaches, with additional benefits of interpretability and explainability from tree structures.
Conclusion: The proposed probabilistic regression trees provide efficient, interpretable uncertainty estimation suitable for critical applications, with particular advantages for conformal prediction and group-conditional coverage guarantees.
Abstract: The perspective of developing trustworthy AI for critical applications in science and engineering requires machine learning techniques that are capable of estimating their own uncertainty. In the context of regression, instead of estimating a conditional mean, this can be achieved by producing a predictive interval for the output, or to even learn a model of the conditional probability $p(y|x)$ of an output $y$ given input features $x$. While this can be done under parametric assumptions with, e.g. generalized linear model, these are typically too strong, and non-parametric models offer flexible alternatives. In particular, for scalar outputs, learning directly a model of the conditional cumulative distribution function of $y$ given $x$ can lead to more precise probabilistic estimates, and the use of proper scoring rules such as the weighted interval score (WIS) and the continuous ranked probability score (CRPS) lead to better coverage and calibration properties. This paper introduces novel algorithms for learning probabilistic regression trees for the WIS or CRPS loss functions. These algorithms are made computationally efficient thanks to an appropriate use of known data structures
- namely min-max heaps, weight-balanced binary trees and Fenwick trees. Through numerical experiments, we demonstrate that the performance of our methods is competitive with alternative approaches. Additionally, our methods benefit from the inherent interpretability and explainability of trees. As a by-product, we show how our trees can be used in the context of conformal prediction and explain why they are particularly well-suited for achieving group-conditional coverage guarantees.
[343] Program Semantic Inequivalence Game with Large Language Models
Antonio Valerio Miceli-Barone, Vaishak Belle, Ali Payani
Main category: cs.LG
TL;DR: SInQ method uses semi-adversarial self-play between generator and evaluator agents to create synthetic code reasoning training data, enabling LLMs to handle complex programming tasks requiring semantic reasoning.
Details
Motivation: LLMs struggle with complex coding tasks requiring non-trivial program semantic reasoning, and finding appropriate training data for these tasks is challenging.Method: SInQ semantic inequivalence game: generator creates semantically distinct program variants from real-world tasks, evaluator identifies input examples causing behavioral divergence between original and variant programs, with agents training each other semi-adversarially.
Result: Improved performance on multiple benchmarks including cross-language vulnerability detection (C/C++ detection improved despite Python-only training) and challenging Python builtin identifier swap benchmark, where modern LLMs still struggle.
Conclusion: The self-play approach enables theoretically unlimited improvement and provides effective synthetic training data that enhances LLMs’ ability to reason about program semantics across languages and complex coding tasks.
Abstract: Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics. Finding training examples to teach LLMs to solve these tasks can be challenging. In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ: a generator agent creates program variants that are semantically distinct, derived from a dataset of real-world programming tasks, while an evaluator agent has to identify input examples that cause the original programs and the generated variants to diverge in their behaviour, with the agents training each other semi-adversarially. We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources. We evaluated our approach on multiple code generation and understanding benchmarks, including cross-language vulnerability detection (Lu et al., 2021), where our method improves vulnerability detection in C/C++ code despite being trained exclusively on Python code, and the challenging Python builtin identifier swap benchmark (Miceli-Barone et al., 2023), showing that whereas modern LLMs still struggle with this benchmark, our approach yields substantial improvements. We release the code needed to replicate the experiments, as well as the generated synthetic data, which can be used to fine-tune LLMs.
[344] Diagonal Symmetrization of Neural Network Solvers for the Many-Electron Schrödinger Equation
Kevin Han Huang, Ni Zhan, Elif Ertekin, Peter Orbanz, Ryan P. Adams
Main category: cs.LG
TL;DR: Diagonal group symmetries in neural networks for quantum problems show that in-training symmetrization destabilizes training, while post hoc averaging is more effective.
Details
Motivation: Diagonal groups of isometries are important in many-body quantum problems but lack natural invariant maps, making their incorporation into neural networks challenging.Method: Studied three approaches: data augmentation, group averaging, and canonicalization for incorporating diagonal invariance in neural network ansätze trained via variational Monte Carlo methods.
Result: In-training symmetrization destabilizes training and leads to worse performance due to a unique computational-statistical tradeoff, while post hoc averaging is more effective.
Conclusion: Post hoc averaging emerges as a simple, flexible and effective method for improving neural network solvers for diagonal group symmetries in quantum applications.
Abstract: Incorporating group symmetries into neural networks has been a cornerstone of success in many AI-for-science applications. Diagonal groups of isometries, which describe the invariance under a simultaneous movement of multiple objects, arise naturally in many-body quantum problems. Despite their importance, diagonal groups have received relatively little attention, as they lack a natural choice of invariant maps except in special cases. We study different ways of incorporating diagonal invariance in neural network ans"atze trained via variational Monte Carlo methods, and consider specifically data augmentation, group averaging and canonicalization. We show that, contrary to standard ML setups, in-training symmetrization destabilizes training and can lead to worse performance. Our theoretical and numerical results indicate that this unexpected behavior may arise from a unique computational-statistical tradeoff not found in standard ML analyses of symmetrization. Meanwhile, we demonstrate that post hoc averaging is less sensitive to such tradeoffs and emerges as a simple, flexible and effective method for improving neural network solvers.
[345] Algorithms for the preordering problem and their application to the task of jointly clustering and ordering the accounts of a social network
Jannik Irmai, Maximilian Moeller, Bjoern Andres
Main category: cs.LG
TL;DR: A 4-approximation algorithm and optimization methods for the NP-hard maximum value preordering problem, applied to social network analysis.
Details
Motivation: The maximum value preordering problem combines clustering and partial ordering challenges, which is relevant for analyzing social network structures where both grouping and ordering relationships are important.Method: Developed a linear-time 4-approximation algorithm using maximum dicut construction, implemented local search heuristics, and tightened linear programming relaxations with odd closed walk inequalities that define facets of the preorder polytope.
Result: The paper contributes efficient algorithmic implementations that successfully apply to joint clustering and partial ordering of social network accounts, demonstrating both qualitative and quantitative improvements in output and efficiency.
Conclusion: The proposed approximation algorithms and optimization techniques provide effective solutions for the maximum value preordering problem, with practical applications in social network analysis through joint clustering and ordering approaches.
Abstract: The NP-hard maximum value preordering problem is both a joint relaxation and a hybrid of the clique partition problem (a clustering problem) and the partial ordering problem. Toward approximate solutions and lower bounds, we introduce a linear-time 4-approximation algorithm that constructs a maximum dicut of a subgraph and define local search heuristics. Toward upper bounds, we tighten a linear program relaxation by the class of odd closed walk inequalities that define facets, as we show, of the preorder polytope. We contribute implementations of the algorithms, apply these to the task of jointly clustering and partially ordering the accounts of published social networks, and compare the output and efficiency qualitatively and quantitatively.
[346] Uncertainty-Aware Trajectory Prediction via Rule-Regularized Heteroscedastic Deep Classification
Kumar Manas, Christian Schlauch, Adrian Paschke, Christian Wirth, Nadja Klein
Main category: cs.LG
TL;DR: SHIFT is a novel trajectory prediction framework that combines calibrated uncertainty modeling with automated rule extraction to improve out-of-distribution generalization, achieving state-of-the-art performance on nuScenes dataset.
Details
Motivation: Deep learning trajectory prediction models struggle with out-of-distribution generalization due to unbalanced data, lack of diversity, and poor uncertainty calibration in complex scenarios like intersections.Method: Reformulates trajectory prediction as classification using heteroscedastic spectral-normalized Gaussian processes to disentangle uncertainties. Learns informative priors from automatically generated natural language driving rules using retrieval-augmented generation with LLMs.
Result: Outperforms state-of-the-art methods on nuScenes dataset, showing substantial gains in uncertainty calibration and displacement metrics, particularly excelling in complex scenarios like intersections.
Conclusion: SHIFT effectively addresses generalization challenges through calibrated uncertainty modeling and rule-based priors, demonstrating strong performance in low-data and cross-location scenarios.
Abstract: Deep learning-based trajectory prediction models have demonstrated promising capabilities in capturing complex interactions. However, their out-of-distribution generalization remains a significant challenge, particularly due to unbalanced data and a lack of enough data and diversity to ensure robustness and calibration. To address this, we propose SHIFT (Spectral Heteroscedastic Informed Forecasting for Trajectories), a novel framework that uniquely combines well-calibrated uncertainty modeling with informative priors derived through automated rule extraction. SHIFT reformulates trajectory prediction as a classification task and employs heteroscedastic spectral-normalized Gaussian processes to effectively disentangle epistemic and aleatoric uncertainties. We learn informative priors from training labels, which are automatically generated from natural language driving rules, such as stop rules and drivability constraints, using a retrieval-augmented generation framework powered by a large language model. Extensive evaluations over the nuScenes dataset, including challenging low-data and cross-location scenarios, demonstrate that SHIFT outperforms state-of-the-art methods, achieving substantial gains in uncertainty calibration and displacement metrics. In particular, our model excels in complex scenarios, such as intersections, where uncertainty is inherently higher. Project page: https://kumarmanas.github.io/SHIFT/.
[347] Phase Transitions between Accuracy Regimes in L2 regularized Deep Neural Networks
Ibrahim Talha Ersoy, Karoline Wiesner
Main category: cs.LG
TL;DR: L2 regularization in DNNs causes a first-order phase transition into under-parametrized phase, explained by Ricci curvature of error landscape. Predicts new transition points and hysteresis effects, confirmed numerically. Explains ‘grokking’ as getting stuck in local minima.
Details
Motivation: To understand the phase transition phenomenon in DNNs caused by L2 regularization and explain the recently discovered 'grokking' phenomenon through the lens of error landscape geometry.Method: Using scalar (Ricci) curvature analysis of the error landscape to predict transition points and hysteresis effects, with numerical confirmation of these predictions.
Result: Successfully predicted new transition points as data complexity increases and demonstrated hysteresis effects, confirming both predictions numerically. Provided explanation for grokking as being stuck in local minima.
Conclusion: The work establishes a connection between L2 regularization, phase transitions, and error landscape geometry in DNNs, paving the way for new probing methods of DNN intrinsic structure beyond L2 context.
Abstract: Increasing the L2 regularization of Deep Neural Networks (DNNs) causes a first-order phase transition into the under-parametrized phase – the so-called onset-of learning. We explain this transition via the scalar (Ricci) curvature of the error landscape. We predict new transition points as the data complexity is increased and, in accordance with the theory of phase transitions, the existence of hysteresis effects. We confirm both predictions numerically. Our results provide a natural explanation of the recently discovered phenomenon of ‘\emph{grokking}’ as DNN models getting stuck in a local minimum of the error surface, corresponding to a lower accuracy phase. Our work paves the way for new probing methods of the intrinsic structure of DNNs in and beyond the L2 context.
[348] Irredundant $k$-Fold Cross-Validation
Jesus S. Aguilar-Ruiz
Main category: cs.LG
TL;DR: Irredundant k-fold cross-validation ensures each instance is used exactly once for training and once for testing across all folds, eliminating redundancy and reducing computational cost while maintaining performance comparable to traditional k-fold CV.
Details
Motivation: Traditional k-fold cross-validation creates redundancy where instances are used multiple times for training, leading to disproportionate influence of some instances and potential overfitting due to instance repetition.Method: A novel cross-validation method that guarantees non-overlapping training partitions, ensuring each instance is used exactly once for training and once for testing across the entire validation procedure while preserving stratification and remaining model-agnostic.
Result: Experimental results show consistent performance estimates comparable to traditional k-fold CV, but with less optimistic variance estimates due to non-overlapping training partitions, and significantly reduced computational cost.
Conclusion: Irredundant k-fold cross-validation provides a more balanced dataset utilization, mitigates overfitting, enables sharper model comparisons, and reduces computational overhead while maintaining performance estimation quality.
Abstract: In traditional k-fold cross-validation, each instance is used ($k-1$) times for training and once for testing, leading to redundancy that lets many instances disproportionately influence the learning phase. We introduce Irredundant $k$-fold cross-validation, a novel method that guarantees each instance is used exactly once for training and once for testing across the entire validation procedure. This approach ensures a more balanced utilization of the dataset, mitigates overfitting due to instance repetition, and enables sharper distinctions in comparative model analysis. The method preserves stratification and remains model-agnostic, i.e., compatible with any classifier. Experimental results demonstrate that it delivers consistent performance estimates across diverse datasets – comparable to $k$-fold cross-validation – while providing less optimistic variance estimates because training partitions are non-overlapping, and significantly reducing the overall computational cost.
[349] Balancing Interference and Correlation in Spatial Experimental Designs: A Causal Graph Cut Approach
Jin Zhu, Jingyi Li, Hongyi Zhou, Yinan Lin, Zhenhua Lin, Chengchun Shi
Main category: cs.LG
TL;DR: Proposes a spatial experiment design method using graph cut algorithms to optimize information and improve causal effect estimator accuracy, handling spatial interference and various covariance functions efficiently.
Details
Motivation: To optimize spatial experiments for better information extraction and enhanced accuracy of causal effect estimators, addressing challenges of spatial interference and computational efficiency.Method: Uses a surrogate function for mean squared error (MSE) of the estimator, enabling application of classical graph cut algorithms to determine optimal experimental designs.
Result: The method effectively accommodates moderate to large spatial interference effects, adapts to different spatial covariance functions, and demonstrates computational efficiency in synthetic and real-world simulations.
Conclusion: The proposed design is validated through theoretical analysis and numerical experiments, showing effectiveness in optimizing spatial experiments for causal inference, with practical implementation available.
Abstract: This paper focuses on the design of spatial experiments to optimize the amount of information derived from the experimental data and enhance the accuracy of the resulting causal effect estimator. We propose a surrogate function for the mean squared error (MSE) of the estimator, which facilitates the use of classical graph cut algorithms to learn the optimal design. Our proposal offers three key advances: (1) it accommodates moderate to large spatial interference effects; (2) it adapts to different spatial covariance functions; (3) it is computationally efficient. Theoretical results and numerical experiments based on synthetic environments and a dispatch simulator that models a city-scale ridesharing market, further validate the effectiveness of our design. A python implementation of our method is available at https://github.com/Mamba413/CausalGraphCut.
[350] Learning to Drive Ethically: Embedding Moral Reasoning into Autonomous Driving
Dianzhao Li, Ostap Okhrin
Main category: cs.LG
TL;DR: Hierarchical Safe RL framework integrates ethical reasoning with driving objectives using ethical risk cost and prioritized experience replay, outperforming baselines in real-world traffic scenarios.
Details
Motivation: Autonomous vehicles need robust ethical reasoning to protect vulnerable road users (pedestrians, cyclists) and ensure widespread adoption by addressing safety concerns in routine and emergency maneuvers.Method: Two-level framework: 1) Decision level - Safe RL agent trained with composite ethical risk cost (collision probability + harm severity) using dynamic Prioritized Experience Replay; 2) Execution level - polynomial path planning with PID and Stanley controllers for smooth trajectory generation.
Result: Outperforms baseline methods in reducing ethical risk while maintaining driving performance, validated on real-world traffic datasets with diverse vehicles, cyclists, and pedestrians.
Conclusion: First study demonstrating ethical decision-making via Safe RL in real-world human-mixed traffic, showing potential of combining control theory and data-driven learning for ethically accountable autonomy that protects vulnerable road users.
Abstract: Autonomous vehicles hold great promise for reducing traffic fatalities and improving transportation efficiency, yet their widespread adoption hinges on embedding robust ethical reasoning into routine and emergency maneuvers, particularly to protect vulnerable road users (VRUs) such as pedestrians and cyclists. Here, we present a hierarchical Safe Reinforcement Learning (Safe RL) framework that explicitly integrates moral considerations with standard driving objectives. At the decision level, a Safe RL agent is trained using a composite ethical risk cost, combining collision probability and harm severity, to generate high-level motion targets. A dynamic Prioritized Experience Replay mechanism amplifies learning from rare but critical, high-risk events. At the execution level, polynomial path planning coupled with Proportional-Integral-Derivative (PID) and Stanley controllers translates these targets into smooth, feasible trajectories, ensuring both accuracy and comfort. We train and validate our approach on rich, real-world traffic datasets encompassing diverse vehicles, cyclists, and pedestrians, and demonstrate that it outperforms baseline methods in reducing ethical risk and maintaining driving performance. To our knowledge, this is the first study of ethical decision-making for autonomous vehicles via Safe RL evaluated on real-world, human-mixed traffic scenarios. Our results highlight the potential of combining formal control theory and data-driven learning to advance ethically accountable autonomy that explicitly protects those most at risk in urban traffic environments.
[351] Transformers Meet In-Context Learning: A Universal Approximation Theory
Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, Yuxin Chen
Main category: cs.LG
TL;DR: Transformers can perform in-context learning by approximating universal function representations and solving Lasso-like problems at test time without weight updates.
Details
Motivation: To develop a universal approximation theory explaining how transformers enable in-context learning beyond optimization algorithm mimicry, extending to non-convex problems.Method: Integrate Barron’s universal function approximation theory with algorithm approximator viewpoint, showing transformers can find linear representations with small ℓ1-norm over universal features.
Result: Constructed transformers can predict based on few noisy in-context examples with vanishingly small risk, without weight updates.
Conclusion: Transformers enable in-context learning by approximating function representations and solving optimization problems at inference time, extending beyond convex optimization mimicry.
Abstract: Large language models are capable of in-context learning, the ability to perform new tasks at test time using a handful of input-output examples, without parameter updates. We develop a universal approximation theory to elucidate how transformers enable in-context learning. For a general class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can predict based on a few noisy in-context examples with vanishingly small risk. Unlike prior work that frames transformers as approximators of optimization algorithms (e.g., gradient descent) for statistical learning tasks, we integrate Barron’s universal function approximation theory with the algorithm approximator viewpoint. Our approach yields approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being mimicked, extending far beyond convex problems like linear regression. The key is to show that (i) any target function can be nearly linearly represented, with small $\ell_1$-norm, over a set of universal features, and (ii) a transformer can be constructed to find the linear representation – akin to solving Lasso – at test time.
[352] Pareto Actor-Critic for Communication and Computation Co-Optimization in Non-Cooperative Federated Learning Services
Renxuan Tan, Rongpeng Li, Xiaoxue Yu, Xianfu Chen, Xing Xu, Zhifeng Zhao
Main category: cs.LG
TL;DR: PAC-MCoFL is a game-theoretic MARL framework that enables multiple service providers to jointly optimize federated learning resources through Pareto Actor-Critic principles, achieving significant performance improvements over existing methods.
Details
Motivation: Federated learning in multi-service provider ecosystems faces challenges from non-cooperative dynamics, privacy constraints, and competing interests that prevent centralized optimization of communication and computation resources.Method: Integrates Pareto Actor-Critic principles with expectile regression, uses ternary Cartesian decomposition for high-dimensional action space management, and includes a scalable variant (PAC-MCoFL-p) with parameterized conjecture generator to reduce computational complexity.
Result: Achieves approximately 5.8% improvement in total reward and 4.2% improvement in hypervolume indicator over latest MARL solutions, effectively balancing individual SP and system performance in scaled deployments with diverse data heterogeneity.
Conclusion: The framework provides theoretical convergence guarantees and demonstrates superiority through extensive simulations, offering an effective solution for multi-SP federated learning optimization with provable performance bounds.
Abstract: Federated learning (FL) in multi-service provider (SP) ecosystems is fundamentally hampered by non-cooperative dynamics, where privacy constraints and competing interests preclude the centralized optimization of multi-SP communication and computation resources. In this paper, we introduce PAC-MCoFL, a game-theoretic multi-agent reinforcement learning (MARL) framework where SPs act as agents to jointly optimize client assignment, adaptive quantization, and resource allocation. Within the framework, we integrate Pareto Actor-Critic (PAC) principles with expectile regression, enabling agents to conjecture optimal joint policies to achieve Pareto-optimal equilibria while modeling heterogeneous risk profiles. To manage the high-dimensional action space, we devise a ternary Cartesian decomposition (TCAD) mechanism that facilitates fine-grained control. Further, we develop PAC-MCoFL-p, a scalable variant featuring a parameterized conjecture generator that substantially reduces computational complexity with a provably bounded error. Alongside theoretical convergence guarantees, our framework’s superiority is validated through extensive simulations – PAC-MCoFL achieves approximately 5.8% and 4.2% improvements in total reward and hypervolume indicator (HVI), respectively, over the latest MARL solutions. The results also demonstrate that our method can more effectively balance individual SP and system performance in scaled deployments and under diverse data heterogeneity.
[353] MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan Ö. Arık, Tomas Pfister
Main category: cs.LG
TL;DR: MLE-STAR is a novel LLM-based agent for machine learning engineering that combines web knowledge retrieval with targeted component-level exploration and ablation-guided refinement to achieve superior performance on Kaggle competitions.
Details
Motivation: Existing LLM-based MLE agents rely too heavily on inherent LLM knowledge and use coarse exploration strategies that modify entire code structures at once, limiting their ability to select effective models and perform deep component-specific exploration.Method: MLE-STAR first retrieves effective models from the web using search engines to form initial solutions, then iteratively refines them through targeted exploration of specific ML components guided by ablation studies analyzing individual code block impacts, plus a novel ensembling method.
Result: MLE-STAR achieves medals in 64% of Kaggle competitions on MLE-bench Lite, significantly outperforming the best alternative approach.
Conclusion: The combination of external knowledge retrieval, targeted component exploration, and ablation-guided refinement enables MLE-STAR to effectively automate machine learning engineering tasks and outperform existing methods.
Abstract: Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLE-STAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 64% of the Kaggle competitions on the MLE-bench Lite, significantly outperforming the best alternative.
[354] Graph-R1: Incentivizing the Zero-Shot Graph Learning Capability in LLMs via Explicit Reasoning
Yicong Wu, Guangyue Lu, Yuan Zuo, Huarong Zhang, Junjie Wu
Main category: cs.LG
TL;DR: Graph-R1: A GNN-free approach that reformulates graph tasks as textual reasoning problems solved by Large Reasoning Models, outperforming state-of-the-art baselines in zero-shot settings.
Details
Motivation: Address limitations of Graph Neural Networks (fixed label spaces) and Large Language Models (lack structural biases) by leveraging explicit reasoning from Large Reasoning Models for graph tasks without task-specific supervision.Method: Reformulate graph tasks (node classification, link prediction, graph classification) as textual reasoning problems. Use reinforcement learning framework with task-specific rethink templates to guide reasoning over linearized graphs. Create datasets with detailed reasoning traces.
Result: Graph-R1 outperforms state-of-the-art baselines in zero-shot settings, producing interpretable and effective predictions.
Conclusion: Demonstrates the promise of explicit reasoning for graph learning and provides new resources for future research in zero-shot graph task generalization.
Abstract: Generalizing to unseen graph tasks without task-pecific supervision remains challenging. Graph Neural Networks (GNNs) are limited by fixed label spaces, while Large Language Models (LLMs) lack structural inductive biases. Recent advances in Large Reasoning Models (LRMs) provide a zero-shot alternative via explicit, long chain-of-thought reasoning. Inspired by this, we propose a GNN-free approach that reformulates graph tasks–node classification, link prediction, and graph classification–as textual reasoning problems solved by LRMs. We introduce the first datasets with detailed reasoning traces for these tasks and develop Graph-R1, a reinforcement learning framework that leverages task-specific rethink templates to guide reasoning over linearized graphs. Experiments demonstrate that Graph-R1 outperforms state-of-the-art baselines in zero-shot settings, producing interpretable and effective predictions. Our work highlights the promise of explicit reasoning for graph learning and provides new resources for future research.
[355] Escaping Plato’s Cave: JAM for Aligning Independently Trained Vision and Language Models
Lauren Hyoseo Yoon, Yisong Yue, Been Kim
Main category: cs.LG
TL;DR: JAM (Joint Autoencoder Modulator) aligns frozen vision and language models by training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives, enabling explicit optimization for shared semantic representations.
Details
Motivation: The Platonic Representation Hypothesis suggests vision and language models may converge toward a shared statistical model of reality, but existing methods only detect this alignment post-hoc. The paper aims to explicitly optimize for alignment, particularly in fine-grained contextual distinctions.Method: Developed JAM framework that jointly trains modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. Introduced multimodal Spread Loss that outperforms classic contrastive methods. Systematically evaluated across alignment objectives, layer depth effectiveness, and foundation model scale.
Result: JAM reliably induces alignment even across independently trained representations. The multimodal Spread Loss outperforms classic contrastive methods. The method provides both theoretical insight into shared semantics structure and practical guidance for transforming unimodal foundations into multimodal models.
Conclusion: JAM successfully enables explicit optimization for alignment between vision and language representations, offering a practical framework for creating specialist multimodal models from generalist unimodal foundations while providing insights into the nature of shared semantic representations.
Abstract: Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.
[356] STDiff: A State Transition Diffusion Framework for Time Series Imputation in Industrial Systems
Gary Simethy, Daniel Ortiz-Arroyo, Petar Durdevic
Main category: cs.LG
TL;DR: STDiff is a diffusion model for industrial time series imputation that learns system dynamics step-by-step instead of using fixed windows, achieving better performance especially for long gaps.
Details
Motivation: Traditional window-based imputation methods fail in industrial systems with non-stationary dynamics, control actions, and long uninterrupted gaps.Method: Uses conditional denoising diffusion model with causal bias aligned to control theory, generating missing values step-by-step based on recent known state and control/environmental inputs.
Result: Achieves lowest errors on wastewater treatment dataset with simulated gaps (advantage increases for longer gaps), produces dynamically plausible trajectories on raw industrial data with real gaps.
Conclusion: Dynamics-aware, explicitly conditioned imputation is robust for industrial time series, with STDiff outperforming window-based models that tend to flatten or over-smooth.
Abstract: Most deep learning methods for imputing missing values treat the task as completing patterns within a fixed time window. This assumption often fails in industrial systems, where dynamics are driven by control actions, are highly non-stationary, and can experience long, uninterrupted gaps. We propose STDiff, which reframes imputation as learning how the system evolves from one state to the next. STDiff uses a conditional denoising diffusion model with a causal bias aligned to control theory, generating missing values step-by-step based on the most recent known state and relevant control or environmental inputs. On a public wastewater treatment dataset with simulated missing blocks, STDiff consistently achieves the lowest errors, with its advantage increasing for longer gaps. On a raw industrial dataset with substantial real gaps, it produces trajectories that remain dynamically plausible, in contrast to window-based models that tend to flatten or over-smooth. These results support dynamics-aware, explicitly conditioned imputation as a robust approach for industrial time series, and we discuss computational trade-offs and extensions to broader domains.
[357] DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation
Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao
Main category: cs.LG
TL;DR: DANCE proposes continuous neural architecture evolution that learns architectural distributions for adaptive deployment across diverse scenarios and hardware constraints, outperforming NAS methods with lower search costs.
Details
Motivation: Existing NAS methods lack adaptability across deployment scenarios, require costly separate searches for each context, and struggle with performance consistency across different platforms.Method: Reformulates architecture search as continuous evolution through learning distributions over architectural components. Uses continuous architecture distribution, unified architecture space with learned selection gates, and multi-stage training strategy.
Result: Outperforms state-of-the-art NAS approaches in accuracy across five datasets while significantly reducing search costs. Maintains robust performance under varying computational constraints and adapts smoothly to different hardware requirements.
Conclusion: DANCE provides an effective solution for adaptive neural architecture deployment that addresses key limitations of traditional NAS methods through continuous evolution and distribution learning.
Abstract: Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE’s effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at https://github.com/Applied-Machine-Learning-Lab/DANCE.
[358] Dynamic Triangulation-Based Graph Rewiring for Graph Neural Networks
Hugo Attali, Thomas Papastergiou, Nathalie Pernelle, Fragkiskos D. Malliaros
Main category: cs.LG
TL;DR: TRIGON is a novel graph rewiring framework that constructs enriched triangulations by learning to select relevant triangles from multiple graph views, addressing oversquashing and oversmoothing issues in GNNs.
Details
Motivation: Graph Neural Networks suffer from performance limitations due to graph topology issues like oversquashing and oversmoothing, which graph rewiring techniques aim to mitigate by modifying graph structure.Method: TRIGON constructs non-planar triangulations by learning to select relevant triangles from multiple graph views, jointly optimizing triangle selection and downstream classification performance.
Result: TRIGON produces rewired graphs with improved structural properties (reduced diameter, increased spectral gap, lower effective resistance) and outperforms state-of-the-art methods on node classification tasks across homophilic and heterophilic benchmarks.
Conclusion: The TRIGON framework effectively addresses GNN limitations through learned triangulation-based graph rewiring, demonstrating superior performance on various graph learning benchmarks.
Abstract: Graph Neural Networks (GNNs) have emerged as the leading paradigm for learning over graph-structured data. However, their performance is limited by issues inherent to graph topology, most notably oversquashing and oversmoothing. Recent advances in graph rewiring aim to mitigate these limitations by modifying the graph topology to promote more effective information propagation. In this work, we introduce TRIGON, a novel framework that constructs enriched, non-planar triangulations by learning to select relevant triangles from multiple graph views. By jointly optimizing triangle selection and downstream classification performance, our method produces a rewired graph with markedly improved structural properties such as reduced diameter, increased spectral gap, and lower effective resistance compared to existing rewiring methods. Empirical results demonstrate that TRIGON outperforms state-of-the-art approaches on node classification tasks across a range of homophilic and heterophilic benchmarks.
[359] Ranked Set Sampling-Based Multilayer Perceptron: Improving Generalization via Variance-Based Bounds
Feijiang Li, Liuya Zhang, Jieting Wang, Tao Yan, Yuhua Qian
Main category: cs.LG
TL;DR: The paper proposes RSS-MLP, a method that uses Rank Set Sampling instead of Simple Random Sampling in bagging to reduce variance of empirical loss and improve MLP generalization.
Details
Motivation: To enhance MLP's generalization ability by reducing variance of empirical loss, addressing the high randomness issue in traditional bagging with SRS.Method: Introduces Rank Set Sampling (RSS) to create ordered structure in training data, developing RSS-MLP method that reduces variance of empirical exponential and logistic losses compared to SRS.
Result: Theoretical analysis shows RSS produces smaller variance than SRS. Experiments on 12 benchmark datasets with two loss functions and fusion methods demonstrate effectiveness.
Conclusion: RSS-MLP is an effective approach for improving MLP performance through variance reduction of empirical loss using ordered sampling structure.
Abstract: Multilayer perceptron (MLP), one of the most fundamental neural networks, is extensively utilized for classification and regression tasks. In this paper, we establish a new generalization error bound, which reveals how the variance of empirical loss influences the generalization ability of the learning model. Inspired by this learning bound, we advocate to reduce the variance of empirical loss to enhance the ability of MLP. As is well-known, bagging is a popular ensemble method to realize variance reduction. However, bagging produces the base training data sets by the Simple Random Sampling (SRS) method, which exhibits a high degree of randomness. To handle this issue, we introduce an ordered structure in the training data set by Rank Set Sampling (RSS) to further reduce the variance of loss and develop a RSS-MLP method. Theoretical results show that the variance of empirical exponential loss and the logistic loss estimated by RSS are smaller than those estimated by SRS, respectively. To validate the performance of RSS-MLP, we conduct comparison experiments on twelve benchmark data sets in terms of the two convex loss functions under two fusion methods. Extensive experimental results and analysis illustrate the effectiveness and rationality of the propose method.
[360] Improving Hospital Risk Prediction with Knowledge-Augmented Multimodal EHR Modeling
Rituparna Datta, Jiaming Cui, Zihan Guan, Vishal G. Reddy, Joshua C. Eby, Gregory Madden, Rupesh Silwal, Anil Vullikanti
Main category: cs.LG
TL;DR: A unified framework integrating structured EHR data and unstructured clinical notes using LLM with graph-based knowledge retrieval for clinical risk prediction, achieving state-of-the-art performance on readmission and mortality prediction.
Details
Motivation: Accurate prediction of clinical outcomes from EHRs is critical for early intervention and improved patient care. EHRs contain multimodal data (structured and unstructured notes) that need to be effectively integrated for comprehensive analysis.Method: Two-stage architecture: 1) Fine-tuned LLM extracts task-relevant information from clinical notes enhanced by graph-based retrieval of external medical knowledge (e.g., PubMed), 2) Combines unstructured representations with structured data features for final predictions.
Result: Achieved strong performance with AUC scores of 0.84 for 30-day readmission and 0.92 for in-hospital mortality prediction, outperforming all existing baselines and clinical risk scoring systems despite severe dataset imbalance (4-13% positive rates).
Conclusion: The framework successfully integrates multimodal EHR data with external knowledge retrieval, demonstrating superior performance for clinical risk prediction tasks and representing one of the first LLM-based approaches combining graph-guided knowledge retrieval with structured data.
Abstract: Accurate prediction of clinical outcomes using Electronic Health Records (EHRs) is critical for early intervention, efficient resource allocation, and improved patient care. EHRs contain multimodal data, including both structured data and unstructured clinical notes that provide rich, context-specific information. In this work, we introduce a unified framework that seamlessly integrates these diverse modalities, leveraging all relevant available information through a two-stage architecture for clinical risk prediction. In the first stage, a fine-tuned Large Language Model (LLM) extracts crucial, task-relevant information from clinical notes, which is enhanced by graph-based retrieval of external domain knowledge from sources such as a medical corpus like PubMed, grounding the LLM’s understanding. The second stage combines both unstructured representations and features derived from the structured data to generate the final predictions. This approach supports a wide range of clinical tasks. Here, we demonstrate its effectiveness on 30-day readmission and in-hospital mortality prediction. Experimental results show that our framework achieves strong performance, with AUC scores of $0.84$ and $0.92$, respectively, despite these tasks involving severely imbalanced datasets, with positive rates ranging from approximately $4%$ to $13%$. Moreover, it outperforms all existing baselines and clinical practices, including established risk scoring systems. To the best of our knowledge, this is one of the first frameworks for healthcare prediction which enhances the power of an LLM-based graph-guided knowledge retrieval method by combining it with structured data for improved clinical outcome prediction.
[361] VRPRM: Process Reward Modeling via Visual Reasoning
Xinquan Chen, Bangwei Liu, Xuhong Wang, Yingchun Wang, Chaochao Lu
Main category: cs.LG
TL;DR: VRPRM introduces visual reasoning to process reward models, achieving superior reasoning capabilities with significantly less data (3.6K CoT-PRM + 50K non-CoT) compared to traditional PRMs requiring 400K data.
Details
Motivation: Current PRMs lack long-term reasoning and deep thinking capabilities, while CoT-PRM approaches are too expensive due to high annotation costs, limiting their practical application across various tasks.Method: Proposed VRPRM (process reward model via visual reasoning) with an efficient two-stage training strategy combining minimal CoT-PRM supervised fine-tuning data with non-CoT PRM reinforcement learning data.
Result: VRPRM surpassed non-thinking PRM with 400K total data, achieving up to 118% relative performance improvement over base model in Best-of-N experiments using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL data.
Conclusion: The combined training strategy enables higher quality reasoning capabilities at lower data annotation costs, providing a new paradigm for more efficient PRM training and data utilization.
Abstract: Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.
[362] Distributed optimization: designed for federated learning
Wenyou Guo, Ting Qu, Chunrong Pan, George Q. Huang
Main category: cs.LG
TL;DR: Proposes distributed optimization algorithms using augmented Lagrangian technique for federated learning with diverse communication topologies, featuring enhanced computational efficiency and strong convergence guarantees.
Details
Motivation: To address the need for privacy-preserving distributed machine learning in cross-organizational data collaboration through federated learning with improved optimization methods.Method: Developed augmented Lagrangian-based distributed optimization algorithms with proximal relaxation and quadratic approximation, supporting both centralized and decentralized FL settings with multiple termination criteria.
Result: The framework systematically recovers classical optimization methods (proximal algorithm, gradient descent, SGD) and demonstrates strong performance in large-scale settings with statistical heterogeneity.
Conclusion: The proposed approach provides rigorous theoretical convergence guarantees and exhibits excellent performance in heterogeneous federated learning environments, generalizing multiple classical optimization methods.
Abstract: Federated Learning (FL), as a distributed collaborative Machine Learning (ML) framework under privacy-preserving constraints, has garnered increasing research attention in cross-organizational data collaboration scenarios. This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique, designed to accommodate diverse communication topologies in both centralized and decentralized FL settings. Furthermore, we develop multiple termination criteria and parameter update mechanisms to enhance computational efficiency, accompanied by rigorous theoretical guarantees of convergence. By generalizing the augmented Lagrangian relaxation through the incorporation of proximal relaxation and quadratic approximation, our framework systematically recovers a broad of classical unconstrained optimization methods, including proximal algorithm, classic gradient descent, and stochastic gradient descent, among others. Notably, the convergence properties of these methods can be naturally derived within the proposed theoretical framework. Numerical experiments demonstrate that the proposed algorithm exhibits strong performance in large-scale settings with significant statistical heterogeneity across clients.
[363] SleepDIFFormer: Sleep Stage Classification via Multivariate Differential Transformer
Benjamin Wei Hao Chin, Yuin Torng Yew, Haocheng Wu, Lanxin Liang, Chow Khuen Chan, Norita Mohd Zain, Siti Balqis Samdin, Sim Kuan Goh
Main category: cs.LG
TL;DR: SleepDIFFormer - a transformer-based method for cross-domain sleep stage classification using joint EEG-EOG signals with differential attention and domain alignment to improve generalization.
Details
Motivation: Manual sleep stage classification is time-consuming and error-prone. Existing ML/DL methods struggle with non-stationary EEG/EOG signals across different datasets, leading to poor generalization performance.Method: Proposed Multivariate Differential Transformer (SleepDIFFormer) with MDTA architecture for joint EEG-EOG representation learning. Uses cross-domain alignment to mitigate attention noise and learn domain-invariant features through feature distribution alignment.
Result: Achieved state-of-the-art performance on five different sleep staging datasets. Thorough ablation analysis showed effectiveness, and differential attention weights were interpreted to align with characteristic sleep EEG patterns.
Conclusion: SleepDIFFormer effectively addresses cross-domain generalization challenges in sleep stage classification, with implications for automated sleep quality assessment applications.
Abstract: Classification of sleep stages is essential for assessing sleep quality and diagnosing sleep disorders. However, manual inspection of EEG characteristics for each stage is time-consuming and prone to human error. Although machine learning and deep learning methods have been actively developed, they continue to face challenges from the non-stationarity and variability of electroencephalography (EEG) and electrooculography (EOG) signals across different domains (i.e., datasets), often leading to poor generalization. This work proposed a Sleep Stage Classification method by developing Multivariate Differential Transformer (SleepDIFFormer) for joint EEG and EOG representation learning. Specifically, SleepDIFFormer was developed to process EEG and EOG signals using our Multivariate Differential Transformer Architecture (MDTA) for time series, trained with cross-domain alignment. Our method mitigated spatial and temporal attention noise while learning a domain-invariant joint EEG-EOG representation through feature distribution alignment, thereby enabling generalization to unseen target datasets. Empirically, we evaluated our method on five different sleep staging datasets and compared it with existing approaches, achieving state-of-the-art performance. We also conducted a thorough ablation analysis of SleepDIFFormer and interpreted the differential attention weights, highlighting their relevance to characteristic sleep EEG patterns. These findings have implications for advancing automated sleep stage classification and its application to sleep quality assessment. Our source code is publicly available at https://github.com/Ben1001409/SleepDIFFormer
[364] Quantum Graph Attention Network: A Novel Quantum Multi-Head Attention Mechanism for Graph Learning
An Ning, Tai Yue Li, Nan Yow Chen
Main category: cs.LG
TL;DR: QGAT integrates variational quantum circuits into graph attention mechanisms, using quantum parallelism to generate multiple attention coefficients simultaneously, reducing computational overhead while improving expressiveness and robustness.
Details
Motivation: To enhance graph neural networks by leveraging quantum computing advantages - specifically quantum parallelism and entanglement - to create more expressive attention mechanisms with reduced computational complexity compared to classical multi-head attention.Method: Uses strongly entangling quantum circuits with amplitude-encoded node features. A single quantum circuit simultaneously generates multiple attention coefficients, enabling parameter sharing across heads. Joint optimization of classical projection weights and quantum circuit parameters in end-to-end training.
Result: Demonstrates effectiveness in capturing complex structural dependencies, improved generalization in inductive scenarios, enhanced robustness against feature and structural noise, and reduced computational overhead.
Conclusion: QGAT shows potential for scalable quantum-enhanced learning across domains like chemistry and biology, offers advantages in handling noisy real-world data, and can be easily integrated into existing classical attention-based architectures.
Abstract: We propose the Quantum Graph Attention Network (QGAT), a hybrid graph neural network that integrates variational quantum circuits into the attention mechanism. At its core, QGAT employs strongly entangling quantum circuits with amplitude-encoded node features to enable expressive nonlinear interactions. Distinct from classical multi-head attention that separately computes each head, QGAT leverages a single quantum circuit to simultaneously generate multiple attention coefficients. This quantum parallelism facilitates parameter sharing across heads, substantially reducing computational overhead and model complexity. Classical projection weights and quantum circuit parameters are optimized jointly in an end-to-end manner, ensuring flexible adaptation to learning tasks. Empirical results demonstrate QGAT’s effectiveness in capturing complex structural dependencies and improved generalization in inductive scenarios, highlighting its potential for scalable quantum-enhanced learning across domains such as chemistry, biology, and network analysis. Furthermore, experiments confirm that quantum embedding enhances robustness against feature and structural noise, suggesting advantages in handling real-world noisy data. The modularity of QGAT also ensures straightforward integration into existing architectures, allowing it to easily augment classical attention-based models.
[365] Graph Data Modeling: Molecules, Proteins, & Chemical Processes
José Manuel Barraza-Chavez, Rana A. Barghout, Ricardo Almada-Monter, Adrian Jinich, Radhakrishnan Mahadevan, Benjamin Sanchez-Lengeling
Main category: cs.LG
TL;DR: This primer introduces graph data modeling and graph neural networks for chemical applications including molecules, proteins, and chemical processes, providing foundations for applying graph methods to chemical discovery.
Details
Motivation: Graphs naturally describe chemical structures and interactions in molecules, proteins, and processes, making them essential for chemical sciences where traditional methods may be limited.Method: The paper outlines graph design foundations, key prediction tasks, and demonstrates how graph neural networks can operate on chemical graph representations across various chemical science domains.
Result: The primer provides representative examples and prepares readers to apply graph-based machine learning methods to chemical discovery problems.
Conclusion: Graph data modeling and graph neural networks represent powerful approaches for the next generation of chemical discovery by effectively capturing the structural and interaction complexities in chemical systems.
Abstract: Graphs are central to the chemical sciences, providing a natural language to describe molecules, proteins, reactions, and industrial processes. They capture interactions and structures that underpin materials, biology, and medicine. This primer, Graph Data Modeling: Molecules, Proteins, & Chemical Processes, introduces graphs as mathematical objects in chemistry and shows how learning algorithms (particularly graph neural networks) can operate on them. We outline the foundations of graph design, key prediction tasks, representative examples across chemical sciences, and the role of machine learning in graph-based modeling. Together, these concepts prepare readers to apply graph methods to the next generation of chemical discovery.
[366] Quantum-Classical Hybrid Molecular Autoencoder for Advancing Classical Decoding
Afrar Jahin, Yi Pan, Yingfeng Wang, Tianming Liu, Wei Zhang
Main category: cs.LG
TL;DR: Hybrid quantum-classical architecture for SMILES string reconstruction achieves 84% quantum fidelity and 60% classical similarity, outperforming existing quantum baselines.
Details
Motivation: Classical approaches struggle with high fidelity and validity in molecular design, and quantum machine learning integration with sequence-based tasks like SMILES reconstruction remains underexplored with fidelity degradation issues.Method: Proposed hybrid quantum-classical architecture that integrates quantum encoding with classical sequence modeling for SMILES reconstruction.
Result: Achieved approximately 84% quantum fidelity and 60% classical reconstruction similarity, surpassing existing quantum baselines.
Conclusion: Lays foundation for future QML applications by balancing quantum representations with classical sequence models, catalyzing research on quantum-aware sequence models for molecular and drug discovery.
Abstract: Although recent advances in quantum machine learning (QML) offer significant potential for enhancing generative models, particularly in molecular design, a large array of classical approaches still face challenges in achieving high fidelity and validity. In particular, the integration of QML with sequence-based tasks, such as Simplified Molecular Input Line Entry System (SMILES) string reconstruction, remains underexplored and usually suffers from fidelity degradation. In this work, we propose a hybrid quantum-classical architecture for SMILES reconstruction that integrates quantum encoding with classical sequence modeling to improve quantum fidelity and classical similarity. Our approach achieves a quantum fidelity of approximately 84% and a classical reconstruction similarity of 60%, surpassing existing quantum baselines. Our work lays a promising foundation for future QML applications, striking a balance between expressive quantum representations and classical sequence models and catalyzing broader research on quantum-aware sequence models for molecular and drug discovery.
[367] Tune My Adam, Please!
Theodoros Athanasiadis, Steven Adriaensen, Samuel Müller, Frank Hutter
Main category: cs.LG
TL;DR: Adam-PFN is a new surrogate model for freeze-thaw Bayesian optimization that improves Adam hyperparameter tuning using pre-trained learning curves and a novel augmentation method called CDF-augment.
Details
Motivation: Adam optimizer is widely used but hyperparameter tuning is tedious and costly. Existing freeze-thaw BO methods are limited by generic surrogates without prior knowledge of hyperparameter effects on learning.Method: Proposed Adam-PFN surrogate model pre-trained on learning curves from TaskSet, combined with CDF-augment learning curve augmentation method to artificially increase training examples.
Result: Improves learning curve extrapolation and accelerates hyperparameter optimization on TaskSet evaluation tasks, with strong performance on out-of-distribution tasks.
Conclusion: The approach provides an effective solution for low-budget hyperparameter tuning of Adam optimizer by incorporating domain-specific knowledge through pre-training and augmentation techniques.
Abstract: The Adam optimizer remains one of the most widely used optimizers in deep learning, and effectively tuning its hyperparameters is key to optimizing performance. However, tuning can be tedious and costly. Freeze-thaw Bayesian Optimization (BO) is a recent promising approach for low-budget hyperparameter tuning, but is limited by generic surrogates without prior knowledge of how hyperparameters affect learning. We propose Adam-PFN, a new surrogate model for Freeze-thaw BO of Adam’s hyperparameters, pre-trained on learning curves from TaskSet, together with a new learning curve augmentation method, CDF-augment, which artificially increases the number of available training examples. Our approach improves both learning curve extrapolation and accelerates hyperparameter optimization on TaskSet evaluation tasks, with strong performance on out-of-distribution (OOD) tasks.
[368] Parameter-Free Structural-Diversity Message Passing for Graph Neural Networks
Mingyue Kong, Yinglong Zhang, Chengda Xu, Xuewen Xia, Xing Xu
Main category: cs.LG
TL;DR: SDGNN is a parameter-free graph neural network framework that uses structural diversity theory to improve adaptability across diverse graph datasets without trainable parameters or complex training.
Details
Motivation: Mainstream GNNs struggle with structural heterogeneity and complex feature distributions, leading to over-smoothing and semantic degradation due to their reliance on many trainable parameters and fixed aggregation rules.Method: Proposes SDGNN framework with structural-diversity message passing mechanism that captures neighborhood structure heterogeneity and feature semantic stability without additional parameters, using complementary structure-driven and feature-driven modeling.
Result: Outperforms mainstream GNNs on eight public benchmarks and PubMed citation network under challenging conditions like low supervision, class imbalance, and cross-domain transfer.
Conclusion: Provides a new theoretical perspective for parameter-free GNN design and validates structural diversity as a core signal in graph representation learning, with implementation publicly available.
Abstract: Graph Neural Networks (GNNs) have shown remarkable performance in structured data modeling tasks such as node classification. However, mainstream approaches generally rely on a large number of trainable parameters and fixed aggregation rules, making it difficult to adapt to graph data with strong structural heterogeneity and complex feature distributions. This often leads to over-smoothing of node representations and semantic degradation. To address these issues, this paper proposes a parameter-free graph neural network framework based on structural diversity, namely SDGNN (Structural-Diversity Graph Neural Network). The framework is inspired by structural diversity theory and designs a unified structural-diversity message passing mechanism that simultaneously captures the heterogeneity of neighborhood structures and the stability of feature semantics, without introducing additional trainable parameters. Unlike traditional parameterized methods, SDGNN does not rely on complex model training, but instead leverages complementary modeling from both structure-driven and feature-driven perspectives, thereby effectively improving adaptability across datasets and scenarios. Experimental results show that on eight public benchmark datasets and an interdisciplinary PubMed citation network, SDGNN consistently outperforms mainstream GNNs under challenging conditions such as low supervision, class imbalance, and cross-domain transfer. This work provides a new theoretical perspective and general approach for the design of parameter-free graph neural networks, and further validates the importance of structural diversity as a core signal in graph representation learning. To facilitate reproducibility and further research, the full implementation of SDGNN has been released at: https://github.com/mingyue15694/SGDNN/tree/main
[369] Constraint Learning in Multi-Agent Dynamic Games from Demonstrations of Local Nash Interactions
Zhouyu Zhang, Chih-Yuan Chiu, Glen Chou
Main category: cs.LG
TL;DR: Inverse dynamic game algorithm learns parametric constraints from multi-agent Nash equilibrium interactions using MILP encoding of KKT conditions, with theoretical guarantees for safe/unsafe set approximation and practical applications in motion planning.
Details
Motivation: To develop a method for learning constraints from observed multi-agent interactions, enabling better understanding of agent behaviors and facilitating robust motion planning in interactive environments.Method: Mixed-integer linear programs (MILP) encoding Karush-Kuhn-Tucker (KKT) conditions of interacting agents to recover constraints consistent with Nash stationarity of interaction demonstrations.
Result: The method successfully infers both convex and non-convex constraints from interaction demonstrations, establishes theoretical guarantees for learning inner approximations of true safe/unsafe sets, and enables robust motion planning that satisfies underlying constraints.
Conclusion: The proposed inverse dynamic game approach effectively learns parametric constraints from Nash equilibrium interactions, with proven capabilities across simulations and hardware experiments for various constraint classes and nonlinear agent dynamics.
Abstract: We present an inverse dynamic game-based algorithm to learn parametric constraints from a given dataset of local generalized Nash equilibrium interactions between multiple agents. Specifically, we introduce mixed-integer linear programs (MILP) encoding the Karush-Kuhn-Tucker (KKT) conditions of the interacting agents, which recover constraints consistent with the Nash stationarity of the interaction demonstrations. We establish theoretical guarantees that our method learns inner approximations of the true safe and unsafe sets, as well as limitations of constraint learnability from demonstrations of Nash equilibrium interactions. We also use the interaction constraints recovered by our method to design motion plans that robustly satisfy the underlying constraints. Across simulations and hardware experiments, our methods proved capable of inferring constraints and designing interactive motion plans for various classes of constraints, both convex and non-convex, from interaction demonstrations of agents with nonlinear dynamics.
cs.MA
[370] Validating Generative Agent-Based Models for Logistics and Supply Chain Management Research
Vincent E. Castillo
Main category: cs.MA
TL;DR: GABMs using LLMs show promise for supply chain simulations but require dual validation - surface equivalence testing and decision process analysis - to ensure they properly represent human behavior.
Details
Motivation: To evaluate whether LLMs can validly simulate human behavior in logistics and supply chain management contexts, as traditional agent-based models lack realistic human-like reasoning capabilities.Method: Controlled experiment with 957 human participants (477 dyads) in food delivery scenarios, comparing six state-of-the-art LLMs using moderated mediation design, Two One-Sided Tests for equivalence, and structural equation modeling.
Result: Some LLMs demonstrate surface-level equivalence to humans but show artificial decision processes not present in human participants, revealing an equivalence-versus-process paradox.
Conclusion: GABMs are potentially viable for LSCM research with proper dual validation (human equivalence testing and decision process validation), providing a framework for rigorous development and evidence-based LLM selection.
Abstract: Generative Agent-Based Models (GABMs) powered by large language models (LLMs) offer promising potential for empirical logistics and supply chain management (LSCM) research by enabling realistic simulation of complex human behaviors. Unlike traditional agent-based models, GABMs generate human-like responses through natural language reasoning, which creates potential for new perspectives on emergent LSCM phenomena. However, the validity of LLMs as proxies for human behavior in LSCM simulations is unknown. This study evaluates LLM equivalence of human behavior through a controlled experiment examining dyadic customer-worker engagements in food delivery scenarios. I test six state-of-the-art LLMs against 957 human participants (477 dyads) using a moderated mediation design. This study reveals a need to validate GABMs on two levels: (1) human equivalence testing, and (2) decision process validation. Results reveal GABMs can effectively simulate human behaviors in LSCM; however, an equivalence-versus-process paradox emerges. While a series of Two One-Sided Tests (TOST) for equivalence reveals some LLMs demonstrate surface-level equivalence to humans, structural equation modeling (SEM) reveals artificial decision processes not present in human participants for some LLMs. These findings show GABMs as a potentially viable methodological instrument in LSCM with proper validation checks. The dual-validation framework also provides LSCM researchers with a guide to rigorous GABM development. For practitioners, this study offers evidence-based assessment for LLM selection for operational tasks.
[371] Bridging Finite and Infinite-Horizon Nash Equilibria in Linear Quadratic Games
Giulio Salizzoni, Sophie Hall, Maryam Kamgarpour
Main category: cs.MA
TL;DR: Finite-horizon LQ games have unique Nash equilibria while infinite-horizon ones may have multiple. The paper shows finite-horizon equilibria form a dynamical system whose fixed points correspond to infinite-horizon equilibria, and periodic orbits correspond to periodic Nash equilibria.
Details
Motivation: To clarify the relationship between finite-horizon and infinite-horizon linear quadratic games and understand how multiple equilibria in infinite-horizon settings relate to finite-horizon approximations.Method: Interpret finite-horizon equilibrium as a nonlinear dynamical system, analyze its fixed points and periodic orbits, and provide numerical simulations to study convergence behavior.
Result: Fixed points of the dynamical system correspond exactly to infinite-horizon equilibria, periodic orbits correspond to periodic Nash equilibria, and simulations reveal three asymptotic regimes: convergence to stationary equilibria, periodic equilibria, and bounded non-convergent trajectories.
Conclusion: The findings provide new insights and tools for tuning finite-horizon LQ games using infinite-horizon concepts, establishing a clear connection between the two settings through dynamical systems analysis.
Abstract: Finite-horizon linear quadratic (LQ) games admit a unique Nash equilibrium, while infinite-horizon settings may have multiple. We clarify the relationship between these two cases by interpreting the finite-horizon equilibrium as a nonlinear dynamical system. Within this framework, we prove that its fixed points are exactly the infinite-horizon equilibria and that any such equilibrium can be recovered by an appropriate choice of terminal costs. We further show that periodic orbits of the dynamical system, when they arise, correspond to periodic Nash equilibria, and we provide numerical evidence of convergence to such cycles. Finally, simulations reveal three asymptotic regimes: convergence to stationary equilibria, convergence to periodic equilibria, and bounded non-convergent trajectories. These findings offer new insights and tools for tuning finite-horizon LQ games using infinite-horizon.
[372] Evolution favours positively biased reasoning in sequential interactions with high future gains
Marco Saponara, Elias Fernandez Domingos, Jorge M. Pacheco, Tom Lenaerts
Main category: cs.MA
TL;DR: Evolutionary game theory shows that positively biased reasoning (like wishful thinking) evolves as adaptive strategy in social dilemmas, outcompeting rational behavior in sequential games.
Details
Motivation: To understand the evolutionary roots of human cognitive biases that deviate from game-theoretical rationality, particularly why humans hold unrealistic expectations about future outcomes.Method: Used Evolutionary Game Theory with a population deploying various level-k reasoning strategies (both unbiased and biased) in sequential interactions modeled by the Incremental Centipede Game. Positively biased strategies favor higher uncertain rewards, while negatively biased strategies show opposite tendency.
Result: Selection consistently favors positively biased reasoning over rational behavior, with rational strategies going extinct. Bias co-evolves with bounded rationality, and positively biased agents can coexist with non-reasoning agents. Longer games further promote positively biased reasoning.
Conclusion: Certain cognitive biases like wishful thinking constitute adaptive features that help cope with social dilemmas, despite deviating from rational judgment.
Abstract: Empirical evidence shows that human behaviour often deviates from game-theoretical rationality. For instance, humans may hold unrealistic expectations about future outcomes. As the evolutionary roots of such biases remain unclear, we investigate here how reasoning abilities and cognitive biases co-evolve using Evolutionary Game Theory. In our model, individuals in a population deploy a variety of unbiased and biased level-k reasoning strategies to anticipate others’ behaviour in sequential interactions, represented by the Incremental Centipede Game. Positively biased reasoning strategies have a systematic inference bias towards higher but uncertain rewards, while negatively biased strategies reflect the opposite tendency. We find that selection consistently favours positively biased reasoning, with rational behaviour even going extinct. This bias co-evolves with bounded rationality, as the reasoning depth remains limited in the population. Interestingly, positively biased agents may co-exist with non-reasoning agents, thus pointing to a novel equilibrium. Longer games further promote positively biased reasoning, as they can lead to higher future rewards. The biased reasoning strategies proposed in this model may reflect cognitive phenomena like wishful thinking and defensive pessimism. This work therefore supports the claim that certain cognitive biases, despite deviating from rational judgment, constitute an adaptive feature to better cope with social dilemmas.
[373] CBS with Continuous-Time Revisit
Andy Li, Zhe Chen, Danial Harabor, Mor Vered
Main category: cs.MA
TL;DR: CCBS algorithm for continuous-time multi-agent path finding is incomplete due to uncountably infinite state space from continuous wait durations, but a restricted version with fixed wait durations (MAPFrdt) is optimally solvable.
Details
Motivation: To examine the theoretical foundations of CCBS algorithm for continuous-time MAPF and identify limitations in preserving optimality guarantees when extending classical MAPF solvers to continuous time.Method: Theoretical analysis and counter-examples to demonstrate CCBS incompleteness, plus identification of a restricted sub-problem (MAPFrdt) with fixed wait durations that is optimally solvable.
Result: CCBS is incomplete for general continuous-time MAPF due to uncountably infinite state space, but complete and optimal for MAPFrdt with fixed wait durations.
Conclusion: Continuous wait durations create fundamental challenges for optimal MAPF solvers, with open questions remaining about generalized versions allowing arbitrary wait times and continuous space movements.
Abstract: Multi-Agent Path Finding in Continuous Time (\mapfr) extends the classical MAPF problem by allowing agents to operate in continuous time. Conflict-Based Search with Continuous Time (CCBS) is a foundational algorithm for solving \mapfr optimally. In this paper, we revisit the theoretical claims of CCBS and show the algorithm is incomplete, due to an uncountably infinite state space created by continuous wait durations. Through theoretical analysis and counter-examples, we examine the inherent challenges of extending existing MAPF solvers to address \mapfr while preserving optimality guarantees. By restricting waiting duration to fixed amounts, we identify a related sub-problem on graphs, \mapfrdt which we show is optimally solvable, including by CCBS. It remains an open question whether similar models exist for \mapfrct, a generalised version of \mapfrdt that allows arbitrary wait times, and \mapfrcs, which further allows arbitrary movements in continuous space.
[374] Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol
Xinxing Ren, Caelum Forder, Qianbo Zang, Ahsen Tahir, Roman J. Georgio, Suman Deb, Peter Carroll, Önder Gürcan, Zekun Guo
Main category: cs.MA
TL;DR: Anemoi is a semi-centralized multi-agent system that enables direct inter-agent communication through A2A MCP server, reducing planner dependency and improving collaboration efficiency compared to traditional centralized approaches.
Details
Motivation: Traditional centralized multi-agent systems suffer from strong dependency on planner capability and limited inter-agent communication through costly prompt concatenation, leading to redundancy and information loss.Method: Proposes Anemoi, a semi-centralized MAS built on Agent-to-Agent (A2A) communication MCP server from Coral Protocol, enabling structured direct inter-agent collaboration with real-time monitoring and adaptive plan updates.
Result: Achieved 52.73% accuracy on GAIA benchmark with GPT-4.1-mini as planner, surpassing strongest open-source baseline OWL (43.63%) by +9.09% under identical LLM settings.
Conclusion: Anemoi reduces reliance on single planner, supports adaptive updates, minimizes redundant context passing, and provides more scalable and cost-efficient execution for multi-agent systems.
Abstract: Recent advances in generalist multi-agent systems (MAS) have largely followed a context-engineering plus centralized paradigm, where a planner agent coordinates multiple worker agents through unidirectional prompt passing. While effective under strong planner models, this design suffers from two critical limitations: (1) strong dependency on the planner’s capability, which leads to degraded performance when a smaller LLM powers the planner; and (2) limited inter-agent communication, where collaboration relies on costly prompt concatenation and context injection, introducing redundancy and information loss. To address these challenges, we propose Anemoi, a semi-centralized MAS built on the Agent-to-Agent (A2A) communication MCP server from Coral Protocol. Unlike traditional designs, Anemoi enables structured and direct inter-agent collaboration, allowing all agents to monitor progress, assess results, identify bottlenecks, and propose refinements in real time. This paradigm reduces reliance on a single planner, supports adaptive plan updates, and minimizes redundant context passing, resulting in more scalable and cost-efficient execution. Evaluated on the GAIA benchmark, Anemoi achieved 52.73% accuracy with a small LLM (GPT-4.1-mini) as the planner, surpassing the strongest open-source baseline OWL (43.63%) by +9.09% under identical LLM settings. Our implementation is publicly available at https://github.com/Coral-Protocol/Anemoi.
cs.MM
[375] MM-HSD: Multi-Modal Hate Speech Detection in Videos
Berta Céspedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, Andrea Cavallaro
Main category: cs.MM
TL;DR: MM-HSD is a multi-modal hate speech detection model for videos that integrates video frames, audio, speech transcripts, and on-screen text using Cross-Modal Attention, achieving state-of-the-art performance.
Details
Motivation: Existing multi-modal hate speech detection approaches are limited in videos, often omitting relevant modalities like on-screen text and audio, and fail to capture inter-modal dependencies effectively.Method: Proposes MM-HSD model that integrates video frames, audio, speech transcripts, and on-screen text using Cross-Modal Attention (CMA) as an early feature extractor, systematically comparing query/key configurations.
Result: Outperforms state-of-the-art methods on HateMM dataset with M-F1 score of 0.874, with best performance when on-screen text is used as query and other modalities serve as key.
Conclusion: The integration of multiple modalities with Cross-Modal Attention, particularly using on-screen text as query, significantly improves hate speech detection performance in videos.
Abstract: While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.~on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions between different modalities in the CMA block. Our approach leads to improved performance when on-screen text is used as a query and the rest of the modalities serve as a key. Experiments on the HateMM dataset show that MM-HSD outperforms state-of-the-art methods on M-F1 score (0.874), using concatenation of transcript, audio, video, on-screen text, and CMA for feature extraction on raw embeddings of the modalities. The code is available at https://github.com/idiap/mm-hsd
[376] diveXplore at the Video Browser Showdown 2024
Klaus Schoeffmann, Sahar Nasirihaghighi
Main category: cs.MM
TL;DR: Revised diveXplore system for VBS2024 with OpenCLIP integration, improved query server, optimized UI, and enhanced exploration capabilities for video clusters.
Details
Motivation: Based on experience from VBS2023 and feedback from IVR4B special session at CBMI2023, the authors identified the need to improve the diveXplore system's capabilities for better video browsing and search performance.Method: Integrated OpenCLIP trained on LAION-2B dataset for image/text embeddings, implemented a distributed query server for handling different queries and merging results, optimized user interface for fast browsing, and added exploration view for large clusters of similar videos.
Result: The revised diveXplore system now supports free-text and visual similarity search with improved performance, better query distribution and result merging, faster browsing experience, and enhanced exploration of video clusters like weddings, paraglider events, and snow/ice scenery.
Conclusion: The comprehensive revision of diveXplore system addresses previous limitations and provides enhanced multimedia search and exploration capabilities for VBS2024, leveraging state-of-the-art CLIP technology and improved system architecture.
Abstract: According to our experience from VBS2023 and the feedback from the IVR4B special session at CBMI2023, we have largely revised the diveXplore system for VBS2024. It now integrates OpenCLIP trained on the LAION-2B dataset for image/text embeddings that are used for free-text and visual similarity search, a query server that is able to distribute different queries and merge the results, a user interface optimized for fast browsing, as well as an exploration view for large clusters of similar videos (e.g., weddings, paraglider events, snow and ice scenery, etc.).
[377] Less is More - diveXplore 5.0 at VBS 2021
Andreas Leibetseder, Klaus Schoeffmann
Main category: cs.MM
TL;DR: diveXplore system rebuilt from scratch to reduce complexity after performance declined from feature additions, maintaining proven features in modular design
Details
Motivation: Performance decline due to increasing system complexity from new features that couldn't be well integrated with the core interactive self-organizing featuremapMethod: Completely rebuilt version 5.0 implemented from scratch with modular architecture to reduce complexity while keeping useful features
Result: New version developed for VBS 2021 that addresses the complexity issues of previous versions
Conclusion: Complete system redesign was necessary to counteract performance decline and accommodate future feature additions in a modular way
Abstract: As a longstanding participating system in the annual Video Browser Showdown (VBS2017-VBS2020) as well as in two iterations of the more recently established Lifelog Search Challenge (LSC2018-LSC2019), diveXplore is developed as a feature-rich Deep Interactive Video Exploration system. After its initial successful employment as a competitive tool at the challenges, its performance, however, declined as new features were introduced increasing its overall complexity. We mainly attribute this to the fact that many additions to the system needed to revolve around the system’s core element - an interactive self-organizing browseable featuremap, which, as an integral component did not accommodate the addition of new features well. Therefore, counteracting said performance decline, the VBS 2021 version constitutes a completely rebuilt version 5.0, implemented from scratch with the aim of greatly reducing the system’s complexity as well as keeping proven useful features in a modular manner.
[378] diveXplore 6.0: ITEC’s Interactive Video Exploration System at VBS 2022
Andreas Leibetseder, Klaus Schoeffmann
Main category: cs.MM
TL;DR: diveXplore v6.0 is a refined interactive video search system that has been simplified and optimized since VBS2017, with new features for shot segmentation, map search, and improved concept/temporal context search.
Details
Motivation: To create a more modern, leaner, and faster video search system by refactoring the veteran diveXplore platform and introducing new capabilities for improved search performance.Method: Major refactoring of the system (version 5.0) to reduce feature bloat while maintaining core functionality, followed by version 6.0 with reconsidered shot segmentation, map search implementation, and new features for concept and temporal context search.
Result: The refactored system showed increasing performance in VBS2021 compared to previous competitions, demonstrating that the leaner approach was effective.
Conclusion: Simplifying and modernizing the diveXplore system while strategically adding new search capabilities proved to be a successful approach for improving video search performance in competitive evaluations.
Abstract: Continuously participating since the sixth Video Browser Showdown (VBS2017), diveXplore is a veteran interactive search system that throughout its lifetime has offered and evaluated numerous features. After undergoing major refactoring for the most recent VBS2021, however, the system since version 5.0 is less feature rich, yet, more modern, leaner and faster than the original system. This proved to be a sensible decision as the new system showed increasing performance in VBS2021 when compared to the most recent former competitions. With version 6.0 we reconsider shot segmentation, map search and introduce new features for improving concept as well as temporal context search.
[379] AdaDPCC: Adaptive Rate Control and Rate-Distortion-Complexity Optimization for Dynamic Point Cloud Compression
Chenhao Zhang, Wei Gao
Main category: cs.MM
TL;DR: Novel dynamic point cloud compression framework with slimmable architecture for efficient rate-distortion-complexity optimization, achieving 5.81% BD-Rate reduction and 44.6% coding time reduction.
Details
Motivation: Current dynamic point cloud compression methods face challenges with complexity management and rate control, which are crucial for applications like autonomous driving and AR/VR.Method: Proposes a slimmable framework with multiple coding routes, coarse-to-fine motion estimation/compensation module for sparse data, and content-adaptive rate control module for precise bitrate management.
Result: Reduces average BD-Rate by 5.81%, improves BD-PSNR by 0.42 dB, maintains 0.40% average bitrate error, and reduces coding time by up to 44.6% compared to state-of-the-art methods.
Conclusion: The proposed framework provides efficient real-time compression with precise rate control, making it suitable for bitrate-constrained dynamic point cloud compression scenarios.
Abstract: Dynamic point cloud compression (DPCC) is crucial in applications like autonomous driving and AR/VR. Current compression methods face challenges with complexity management and rate control. This paper introduces a novel dynamic coding framework that supports variable bitrate and computational complexities. Our approach includes a slimmable framework with multiple coding routes, allowing for efficient Rate-Distortion-Complexity Optimization (RDCO) within a single model. To address data sparsity in inter-frame prediction, we propose the coarse-to-fine motion estimation and compensation module that deconstructs geometric information while expanding the perceptive field. Additionally, we propose a precise rate control module that content-adaptively navigates point cloud frames through various coding routes to meet target bitrates. The experimental results demonstrate that our approach reduces the average BD-Rate by 5.81% and improves the BD-PSNR by 0.42 dB compared to the state-of-the-art method, while keeping the average bitrate error at 0.40%. Moreover, the average coding time is reduced by up to 44.6% compared to D-DPCC, underscoring its efficiency in real-time and bitrate-constrained DPCC scenarios. Our code is available at https://git.openi.org.cn/OpenPointCloud/Ada_DPCC.
[380] OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi, Seung-Hyun Lee, Jun Hwan Ahn, Rae-Hong Park, Hyung-Min Park
Main category: cs.MM
TL;DR: The paper introduces OLKAVS, the largest Korean audio-visual speech dataset with 1,150 hours of multi-view video from 1,107 speakers, addressing limitations of existing English-focused datasets.
Details
Motivation: Existing audio-visual datasets are mostly English-focused, model-dependent during preparation, and have limited multi-view content. There's a need for large-scale Korean multi-modal datasets to advance research.Method: Developed OLKAVS dataset containing studio-recorded Korean speech with 9 different viewpoints and various noise conditions. Provided pre-trained baseline models for audio-visual speech recognition and lip reading tasks.
Result: Created the largest publicly available audio-visual speech dataset with extensive multi-view content. Experiments demonstrated effectiveness of multi-modal and multi-view training over uni-modal and frontal-view-only approaches.
Conclusion: OLKAVS dataset enables advanced multi-modal research in Korean speech processing, speaker recognition, pronunciation analysis, and mouth motion studies, addressing previous dataset limitations.
Abstract: Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multi-view videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multi-view training over uni-modal and frontal-view-only training. We expect the OLKAVS dataset to facilitate multi-modal research in broader areas such as Korean speech recognition, speaker recognition, pronunciation level classification, and mouth motion analysis.
[381] TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity
Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu
Main category: cs.MM
TL;DR: TAG-WM is a tamper-aware generative watermarking method that embeds dual watermarks for copyright protection and tampering localization while maintaining image quality.
Details
Motivation: Address copyright and authenticity risks in AI-generated content by developing robust watermarking that can detect and localize malicious tampering while preserving generation quality.Method: Uses four modules: dual-mark joint sampling for embedding copyright and localization watermarks, watermark latent reconstruction, dense variation region detector using diffusion inversion sensitivity, and tamper-aware decoding guided by localization results.
Result: Achieves state-of-the-art performance in tampering robustness and localization capability under distortion, maintains lossless generation quality, and supports 256-bit watermark capacity.
Conclusion: TAG-WM provides an effective solution for protecting AI-generated content against tampering while ensuring watermark robustness and accurate localization capabilities.
Abstract: AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. However, the widespread adoption and advancing capabilities of generative image editing tools have amplified malicious tampering risks, while simultaneously posing new challenges to passive tampering detection and watermark robustness. To address these challenges, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises four key modules: a dual-mark joint sampling (DMJS) algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, the watermark latent reconstruction (WLR) utilizing reversed DMJS, a dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and the tamper-aware decoding (TAD) guided by localization results. The experimental results demonstrate that TAG-WM achieves state-of-the-art performance in both tampering robustness and localization capability even under distortion, while preserving lossless generation quality and maintaining a watermark capacity of 256 bits. The code is available at: https://github.com/Suchenl/TAG-WM.
[382] A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task
Mashiro Toyooka, Kiyoharu Aizawa, Yoko Yamakata
Main category: cs.MM
TL;DR: This paper introduces a new Japanese recipe dataset with annotated ingredient state changes and evaluates LLMs’ ability to track ingredient states during cooking procedures.
Details
Motivation: LLMs are trained on procedural texts but lack direct observation of real-world phenomena, making it challenging to track intermediate ingredient states in cooking recipes where such states are often omitted.Method: The authors constructed a Japanese recipe dataset with clear annotations of ingredient state changes and designed three novel tasks to evaluate LLMs’ ability to track state transitions and identify ingredients at intermediate steps. They tested widely used LLMs like Llama3.1-70B and Qwen2.5-72B.
Result: Experiments showed that learning ingredient state knowledge improves LLMs’ understanding of cooking processes, achieving performance comparable to commercial LLMs.
Conclusion: The proposed state probing method and dataset effectively evaluate LLMs’ world understanding in the cooking domain, demonstrating that explicit state tracking enhances recipe comprehension.
Abstract: Large Language Models (LLMs) are trained on a vast amount of procedural texts, but they do not directly observe real-world phenomena. In the context of cooking recipes, this poses a challenge, as intermediate states of ingredients are often omitted, making it difficult for models to track ingredient states and understand recipes accurately. In this paper, we apply state probing, a method for evaluating a language model’s understanding of the world, to the domain of cooking. We propose a new task and dataset for evaluating how well LLMs can recognize intermediate ingredient states during cooking procedures. We first construct a new Japanese recipe dataset with clear and accurate annotations of ingredient state changes, collected from well-structured and controlled recipe texts. Using this dataset, we design three novel tasks to evaluate whether LLMs can track ingredient state transitions and identify ingredients present at intermediate steps. Our experiments with widely used LLMs, such as Llama3.1-70B and Qwen2.5-72B, show that learning ingredient state knowledge improves their understanding of cooking processes, achieving performance comparable to commercial LLMs. The dataset are publicly available at: https://huggingface.co/datasets/mashi6n/nhkrecipe-100-anno-1
eess.AS
[383] Live Vocal Extraction from K-pop Performances
Yujin Kim, Richa Namballa, Magdalena Fuentes
Main category: eess.AS
TL;DR: Automatic extraction of live vocals from K-pop performances using source separation, cross-correlation, and amplitude scaling techniques
Details
Motivation: Inspired by K-pop's global success and dynamic fan engagement culture, the paper aims to address the challenge of automatically isolating live vocals from live performancesMethod: Combination of source separation, cross-correlation, and amplitude scaling to automatically remove pre-recorded vocals and instrumentals from live performances
Result: Preliminary work that introduces the novel task of live vocal separation and establishes a foundation for future research in this area
Conclusion: The proposed methodology provides an initial framework for automatically extracting live vocals from performances, opening up new research directions in audio processing inspired by K-pop culture
Abstract: K-pop’s global success is fueled by its dynamic performances and vibrant fan engagement. Inspired by K-pop fan culture, we propose a methodology for automatically extracting live vocals from performances. We use a combination of source separation, cross-correlation, and amplitude scaling to automatically remove pre-recorded vocals and instrumentals from a live performance. Our preliminary work introduces the task of live vocal separation and provides a foundation for future research in this topic.
[384] Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder
Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe
Main category: eess.AS
TL;DR: Unified multi-speaker encoder that jointly learns representations for speaker diarization, speech separation, and multi-speaker ASR using shared speech encoder with residual weighted-sum encoding.
Details
Motivation: To capture inherent interdependencies among speaker diarization, speech separation, and multi-speaker ASR tasks through joint training, improving performance on overlapping speech data.Method: Joint training architecture with shared speech foundational encoder using residual weighted-sum encoding (RWSE) from multiple layers to leverage information from different semantic levels and enable bottom-up alignment between tasks.
Result: Substantially improves over single-task baselines on LibriMix evaluation sets. Achieves diarization error rates of 1.37% on Libri2Mix and 2.29% on Libri3Mix, outperforming previous studies.
Conclusion: The unified multi-speaker encoder effectively leverages task interdependencies through joint training and multi-layer representation fusion, demonstrating superior performance across all three speech processing tasks.
Abstract: This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder. We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively.
[385] CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation
Ruifan Deng, Yitian Gong, Qinghui Gao, Luozhijie Jin, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu
Main category: eess.AS
TL;DR: CodecBench is a new comprehensive benchmark for evaluating audio codec performance across acoustic and semantic capabilities in complex scenarios like multi-speaker environments, background noise, and paralinguistic information.
Details
Motivation: Existing audio codec evaluation is limited by simplistic metrics and scenarios, and current benchmarks are not designed for complex application scenarios that modern multimodal LLMs require.Method: The authors introduce CodecBench, a comprehensive evaluation dataset that assesses audio codec performance from both acoustic and semantic perspectives across four data domains.
Result: The benchmark enables identification of current limitations in audio codec technology and provides a framework for more comprehensive evaluation in complex scenarios.
Conclusion: CodecBench aims to highlight future research directions and foster advances in audio codec development to better support multimodal LLMs in diverse application scenarios.
Abstract: With the rise of multimodal large language models (LLMs), audio codec plays an increasingly vital role in encoding audio into discrete tokens, enabling integration of audio into text-based LLMs. Current audio codec captures two types of information: acoustic and semantic. As audio codec is applied to diverse scenarios in speech language model , it needs to model increasingly complex information and adapt to varied contexts, such as scenarios with multiple speakers, background noise, or richer paralinguistic information. However, existing codec’s own evaluation has been limited by simplistic metrics and scenarios, and existing benchmarks for audio codec are not designed for complex application scenarios, which limits the assessment performance on complex datasets for acoustic and semantic capabilities. We introduce CodecBench, a comprehensive evaluation dataset to assess audio codec performance from both acoustic and semantic perspectives across four data domains. Through this benchmark, we aim to identify current limitations, highlight future research directions, and foster advances in the development of audio codec. The codes are available at https://github.com/RayYuki/CodecBench.
[386] Sound event detection with audio-text models and heterogeneous temporal annotations
Manu Harju, Annamaria Mesaros
Main category: eess.AS
TL;DR: Using synthetic captions to guide sound event detection systems improves performance over traditional CRNN methods, especially when dealing with unbalanced classes and partial weak labels.
Details
Motivation: Recent advances in synthetic caption generation from audio allow leveraging natural language information for other audio tasks, providing an opportunity to enhance sound event detection systems with text guidance.Method: Proposed a novel method using machine-generated captions as complementary information to strong labels for training sound event detection. Evaluated different textual inputs and studied scenarios with partial strong labels and partial weak labels.
Result: Synthetic captions improved performance in both cases: PSDS-1 score increased from 0.223 to 0.277 with strong labels, and from 0.166 to 0.218 when half training data had only weak labels on a 50-class unbalanced dataset.
Conclusion: Text-guided sound event detection using synthetic captions significantly outperforms traditional CRNN architecture, demonstrating the value of incorporating natural language information even with limited or weak labeling.
Abstract: Recent advances in generating synthetic captions based on audio and related metadata allow using the information contained in natural language as input for other audio tasks. In this paper, we propose a novel method to guide a sound event detection system with free-form text. We use machine-generated captions as complementary information to the strong labels for training, and evaluate the systems using different types of textual inputs. In addition, we study a scenario where only part of the training data has strong labels, and the rest of it only has temporally weak labels. Our findings show that synthetic captions improve the performance in both cases compared to the CRNN architecture typically used for sound event detection. On a dataset of 50 highly unbalanced classes, the PSDS-1 score increases from 0.223 to 0.277 when trained with strong labels, and from 0.166 to 0.218 when half of the training data has only weak labels.
[387] Online incremental learning for audio classification using a pretrained audio model
Manjunath Mulimani, Annamaria Mesaros
Main category: eess.AS
TL;DR: Proposes an online incremental learning method using pre-trained audio embeddings with an added nonlinear layer to expand dimensionality and capture sound characteristics, enabling single-pass adaptation with minimal forgetting.
Details
Motivation: Existing incremental learning methods for audio require training from scratch and multiple iterations per new task. This work aims to develop an online learner that can adapt to new audio classification tasks with minimal forgetting in a single forward pass.Method: Inject a layer with nonlinear activation between pre-trained model’s audio embeddings and classifier to expand embedding dimensionality and capture distinct sound characteristics. Adapts model in single forward pass through training samples.
Result: Outperforms other methods in both class-incremental learning (ESC-50) and domain-incremental learning (TAU Urban Acoustic Scenes 2019 dataset) setups.
Conclusion: The proposed approach effectively enables online incremental learning for audio classification with minimal forgetting, demonstrating superior performance compared to existing methods in both class and domain incremental scenarios.
Abstract: Incremental learning aims to learn new tasks sequentially without forgetting the previously learned ones. Most of the existing incremental learning methods for audio focus on training the model from scratch on the initial task, and the same model is used to learn upcoming incremental tasks. The model is trained for several iterations to adapt to each new task, using some specific approaches to reduce the forgetting of old tasks. In this work, we propose a method for using generalizable audio embeddings produced by a pre-trained model to develop an online incremental learner that solves sequential audio classification tasks over time. Specifically, we inject a layer with a nonlinear activation function between the pre-trained model’s audio embeddings and the classifier; this layer expands the dimensionality of the embeddings and effectively captures the distinct characteristics of sound classes. Our method adapts the model in a single forward pass (online) through the training samples of any task, with minimal forgetting of old tasks. We demonstrate the performance of the proposed method in two incremental learning setups: one class-incremental learning using ESC-50 and one domain-incremental learning of different cities from the TAU Urban Acoustic Scenes 2019 dataset; for both cases, the proposed approach outperforms other methods.
[388] A Solution of Ultra Wideband Based High-resolution and Lossless Audio Transmission
Fengyun Zhang
Main category: eess.AS
TL;DR: Proposes UWB technology for high-resolution lossless audio transmission to overcome wireless audio limitations like bandwidth, compression, latency, and compatibility issues.
Details
Motivation: Address the current challenges and limitations in wireless audio transmission including insufficient data bandwidth, compression artifacts, high latency, and poor inter-device compatibility that degrade audio quality and user experience.Method: Utilizes ultra wideband (UWB) technology which provides the necessary bandwidth for high-resolution lossless audio transmission with ultra-low latency, enabling exceptional sound quality for real-time applications.
Result: UWB emerges as a promising solution that not only enables high-quality audio transmission but also offers precise location tracking capabilities for augmented and virtual reality applications.
Conclusion: UWB technology effectively addresses the core limitations of current wireless audio transmission systems by providing the bandwidth needed for lossless audio with low latency, while also extending functionality to spatial tracking for AR/VR applications.
Abstract: This paper provides an overview of the current challenges in wireless audio transmission and highlights the limitations of existing technologies regarding data bandwidth, data compression, latency, and inter-device compatibility. To address these shortcomings, it proposes a high-resolution, lossless audio transmission scheme utilizing ultra wideband (UWB) technology. UWB emerges as a promising solution by offering the necessary bandwidth to enable exceptional sound quality with ultra-low latency, making it ideal for real-time audio applications and addressing synchronization concerns in audio-visual use cases. Additionally, UWB’s unique capabilities extend beyond high-resolution audio, allowing for precise location tracking in augmented and virtual reality applications.
[389] Leveraging Discriminative Latent Representations for Conditioning GAN-Based Speech Enhancement
Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel
Main category: eess.AS
TL;DR: DisCoGAN improves GAN-based speech enhancement in low-SNR scenarios by using discriminative model features as conditioning, outperforming existing methods across various noise conditions.
Details
Motivation: Current generative speech enhancement methods (GANs and diffusion models) struggle in very low signal-to-noise ratio (SNR) scenarios, which remain under-explored and challenging for both discriminative and generative state-of-the-art approaches.Method: Proposes DisCoGAN - a method that leverages latent features extracted from discriminative speech enhancement models as generic conditioning features to improve GAN-based speech enhancement. Also evaluates various GAN architectures including end-to-end training, first-stage processing, and post-filtering approaches.
Result: DisCoGAN demonstrates performance improvements over baseline models, particularly in low-SNR scenarios, while maintaining competitive or superior performance in high-SNR conditions and on real-world recordings. Consistently outperforms existing methods.
Conclusion: The discriminative conditioning approach effectively enhances GAN-based speech enhancement, especially in challenging low-SNR conditions, with comprehensive ablation studies confirming the contributions of individual components.
Abstract: Generative speech enhancement methods based on generative adversarial networks (GANs) and diffusion models have shown promising results in various speech enhancement tasks. However, their performance in very low signal-to-noise ratio (SNR) scenarios remains under-explored and limited, as these conditions pose significant challenges to both discriminative and generative state-of-the-art methods. To address this, we propose a method that leverages latent features extracted from discriminative speech enhancement models as generic conditioning features to improve GAN-based speech enhancement. The proposed method, referred to as DisCoGAN, demonstrates performance improvements over baseline models, particularly in low-SNR scenarios, while also maintaining competitive or superior performance in high-SNR conditions and on real-world recordings. We also conduct a comprehensive evaluation of conventional GAN-based architectures, including GANs trained end-to-end, GANs as a first processing stage, and post-filtering GANs, as well as discriminative models under low-SNR conditions. We show that DisCoGAN consistently outperforms existing methods. Finally, we present an ablation study that investigates the contributions of individual components within DisCoGAN and analyzes the impact of the discriminative conditioning method on overall performance.
[390] Automatic Inspection Based on Switch Sounds of Electric Point Machines
Ayano Shibata, Toshiki Gunji, Mitsuaki Tsuda, Takashi Endo, Kota Dohi, Tomoya Nishida, Satoko Nomoto
Main category: eess.AS
TL;DR: IoT-based monitoring system using cameras and microphones to detect turnout switching errors in electric point machines through sound analysis, reducing visual inspections and enabling real-time failure detection.
Details
Motivation: To replace human inspections with automated monitoring for labor-saving and preventive maintenance, addressing the high cost of new sensors and difficulty in substituting electrical characteristic monitoring.Method: Implementation of cameras and microphones in electric point machines to monitor lock-piece conditions remotely, with a specific focus on analyzing “switch sound” for detecting turnout switching errors.
Result: Expected test results were obtained from the proposed sound-based method for detecting equipment failures, demonstrating feasibility for real-time monitoring.
Conclusion: The sound-based monitoring approach successfully enables automated inspection of electronic point machines, reducing downtime from equipment failures and decreasing reliance on visual inspections.
Abstract: Since 2018, East Japan Railway Company and Hitachi, Ltd. have been working to
replace human inspections with IoT-based monitoring. The purpose is
Labor-saving required for equipment inspections and provide appropriate
preventive maintenance. As an alternative to visual inspection, it has been
difficult to substitute electrical characteristic monitoring, and the
introduction of new high-performance sensors has been costly. In 2019, we
implemented cameras and microphones in an NS'' electric point machines to reduce downtime from equipment failures, allowing for remote monitoring of lock-piece conditions. This method for detecting turnout switching errors based on sound information was proposed, and the expected test results were obtained. The proposed method will make it possible to detect equipment failures in real time, thereby reducing the need for visual inspections. This paper presents the results of our technical studies aimed at automating the inspection of electronic point machines using sound, specifically focusing on
switch
sound’’ beginning in 2019.
[391] Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System
Hashim Ali, Surya Subramani, Lekha Bollinani, Nithin Sai Adupa, Sali El-Loh, Hafiz Malik
Main category: eess.AS
TL;DR: The paper presents an AASIST-based deepfake detection system using WavLM large frontend with RawBoost augmentation, achieving second place in SAFE Challenge for unmodified and laundered audio detection tasks.
Details
Motivation: To develop robust synthetic speech detection systems that can handle various audio conditions including unmodified, compressed, and laundered audio designed to evade detection.Method: Systematically explored self-supervised learning front-ends, training data compositions, and audio length configurations. Used AASIST-based approach with WavLM large frontend and RawBoost augmentation, trained on multilingual dataset of 256,600 samples from multiple sources.
Result: Achieved second place in both Task 1 (unmodified audio detection) and Task 3 (laundered audio detection), demonstrating strong generalization and robustness across different audio conditions.
Conclusion: The proposed approach with SSL front-ends and comprehensive training data composition effectively addresses synthetic speech detection challenges, showing excellent performance in detecting both unmodified and intentionally obfuscated audio.
Abstract: The SAFE Challenge evaluates synthetic speech detection across three tasks: unmodified audio, processed audio with compression artifacts, and laundered audio designed to evade detection. We systematically explore self-supervised learning (SSL) front-ends, training data compositions, and audio length configurations for robust deepfake detection. Our AASIST-based approach incorporates WavLM large frontend with RawBoost augmentation, trained on a multilingual dataset of 256,600 samples spanning 9 languages and over 70 TTS systems from CodecFake, MLAAD v5, SpoofCeleb, Famous Figures, and MAILABS. Through extensive experimentation with different SSL front-ends, three training data versions, and two audio lengths, we achieved second place in both Task 1 (unmodified audio detection) and Task 3 (laundered audio detection), demonstrating strong generalization and robustness.
[392] Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech
Jingyuan Xing, Zhipeng Li, Jialong Mai, Xiaofen Xing, Xiangmin Xu
Main category: eess.AS
TL;DR: A novel zero-shot TTS framework combining autoregressive and non-autoregressive modules to better capture acoustic-semantic relationships, achieving superior quality and efficiency.
Details
Motivation: Existing zero-shot TTS models struggle with capturing complex correlations between acoustic and semantic features, leading to lack of expressiveness and similarity in synthesized speech.Method: Proposes Parallel GPT framework with AR model using Parallel Tokenizer for simultaneous semantic/acoustic token synthesis, and Coupled NAR model for detailed token prediction based on AR output.
Result: Significantly outperforms existing zero-shot TTS models in both quality and efficiency on English and Chinese datasets.
Conclusion: The parallel AR-NAR architecture effectively harmonizes independence and interdependence of acoustic-semantic information, advancing zero-shot TTS performance.
Abstract: Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects.This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, considering the interdependence, the Coupled NAR model predicts detailed tokens based on the general AR model’s output. Parallel GPT, built on this architecture, is designed to improve zero-shot text-to-speech synthesis through its parallel structure. Experiments on English and Chinese datasets demonstrate that the proposed model significantly outperforms the quality and efficiency of the synthesis of existing zero-shot TTS models. Speech demos are available at https://t1235-ch.github.io/pgpt/.
[393] Is Audio Spoof Detection Robust to Laundering Attacks?
Hashim Ali, Surya Subramani, Shefali Sudhir, Raksha Varahamurthy, Hafiz Malik
Main category: eess.AS
TL;DR: SOTA voice spoof detection systems perform poorly against aggressive laundering attacks like reverberation and noise, highlighting the need for more robust detection methods.
Details
Motivation: Voice cloning technology has advanced significantly, enabling high-quality synthesized speech that can be abused. While detection methods exist, they are primarily tested on clean audio databases, leaving them vulnerable to real-world laundering attacks.Method: Created ASVSpoof Laundering Database based on ASVSpoof 2019 (LA) eval database with 1388.22 hours of audio. Evaluated seven SOTA audio spoof detection approaches against various laundering attacks.
Result: SOTA systems showed poor performance against aggressive laundering attacks, particularly reverberation and additive noise attacks.
Conclusion: Current voice spoof detection methods are not robust enough against real-world laundering attacks, indicating a critical need for developing more resilient detection systems.
Abstract: Voice-cloning (VC) systems have seen an exceptional increase in the realism of synthesized speech in recent years. The high quality of synthesized speech and the availability of low-cost VC services have given rise to many potential abuses of this technology. Several detection methodologies have been proposed over the years that can detect voice spoofs with reasonably good accuracy. However, these methodologies are mostly evaluated on clean audio databases, such as ASVSpoof 2019. This paper evaluates SOTA Audio Spoof Detection approaches in the presence of laundering attacks. In that regard, a new laundering attack database, called the ASVSpoof Laundering Database, is created. This database is based on the ASVSpoof 2019 (LA) eval database comprising a total of 1388.22 hours of audio recordings. Seven SOTA audio spoof detection approaches are evaluated on this laundered database. The results indicate that SOTA systems perform poorly in the presence of aggressive laundering attacks, especially reverberation and additive noise attacks. This suggests the need for robust audio spoof detection.
eess.IV
[394] A Machine Learning Approach to Volumetric Computations of Solid Pulmonary Nodules
Yihan Zhou, Haocheng Huang, Yue Yu, Jianhui Shang
Main category: eess.IV
TL;DR: Advanced 3D CNN framework with subtype-specific bias correction achieves 8.0% error in lung nodule volume estimation, outperforming existing methods by 17+ percentage points with 3x faster processing.
Details
Motivation: Early lung cancer detection requires accurate volumetric assessment of pulmonary nodules, but traditional methods like CTR and spherical approximation produce inconsistent estimates due to nodule shape and density variability.Method: Multi-scale 3D convolutional neural network combined with subtype-specific bias correction, trained and evaluated on 364 cases from Shanghai Chest Hospital.
Result: Achieved mean absolute deviation of 8.0% compared to manual nonlinear regression, with inference times under 20 seconds per scan - significantly better than existing deep learning/semi-automated methods (25-30% error, 60+ seconds processing).
Conclusion: The framework provides highly accurate, efficient, and scalable tool for clinical lung nodule screening and monitoring, with strong potential to improve early lung cancer detection.
Abstract: Early detection of lung cancer is crucial for effective treatment and relies on accurate volumetric assessment of pulmonary nodules in CT scans. Traditional methods, such as consolidation-to-tumor ratio (CTR) and spherical approximation, are limited by inconsistent estimates due to variability in nodule shape and density. We propose an advanced framework that combines a multi-scale 3D convolutional neural network (CNN) with subtype-specific bias correction for precise volume estimation. The model was trained and evaluated on a dataset of 364 cases from Shanghai Chest Hospital. Our approach achieved a mean absolute deviation of 8.0 percent compared to manual nonlinear regression, with inference times under 20 seconds per scan. This method outperforms existing deep learning and semi-automated pipelines, which typically have errors of 25 to 30 percent and require over 60 seconds for processing. Our results show a reduction in error by over 17 percentage points and a threefold acceleration in processing speed. These advancements offer a highly accurate, efficient, and scalable tool for clinical lung nodule screening and monitoring, with promising potential for improving early lung cancer detection.
[395] Data-Efficient Point Cloud Semantic Segmentation Pipeline for Unimproved Roads
Andrew Yarovoi, Christopher R. Valenta
Main category: eess.IV
TL;DR: Two-stage training framework for point cloud segmentation that achieves significant performance improvements with only 50 labeled point clouds by combining multi-dataset pre-training with in-domain fine-tuning.
Details
Motivation: To address the challenge of robust 3D semantic segmentation in low-data scenarios, particularly for unimproved roads and other classes where labeled data is scarce and expensive to obtain.Method: Two-stage approach: 1) Pre-train projection-based CNN on mixture of public urban datasets and small curated in-domain data, 2) Fine-tune lightweight prediction head exclusively on in-domain data. Explores Point Prompt Training for batch normalization, Manifold Mixup regularization, and histogram-normalized ambients.
Result: Improves mean IoU from 33.5% to 51.8% and overall accuracy from 85.5% to 90.8% compared to naive training, using only 50 labeled point clouds from target domain.
Conclusion: Pre-training across multiple datasets is crucial for improving generalization and enabling robust segmentation under limited supervision, providing a practical framework for challenging low-data 3D segmentation scenarios.
Abstract: In this case study, we present a data-efficient point cloud segmentation pipeline and training framework for robust segmentation of unimproved roads and seven other classes. Our method employs a two-stage training framework: first, a projection-based convolutional neural network is pre-trained on a mixture of public urban datasets and a small, curated in-domain dataset; then, a lightweight prediction head is fine-tuned exclusively on in-domain data. Along the way, we explore the application of Point Prompt Training to batch normalization layers and the effects of Manifold Mixup as a regularizer within our pipeline. We also explore the effects of incorporating histogram-normalized ambients to further boost performance. Using only 50 labeled point clouds from our target domain, we show that our proposed training approach improves mean Intersection-over-Union from 33.5% to 51.8% and the overall accuracy from 85.5% to 90.8%, when compared to naive training on the in-domain data. Crucially, our results demonstrate that pre-training across multiple datasets is key to improving generalization and enabling robust segmentation under limited in-domain supervision. Overall, this study demonstrates a practical framework for robust 3D semantic segmentation in challenging, low-data scenarios. Our code is available at: https://github.com/andrewyarovoi/MD-FRNet.
[396] Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion
Junru Lin, Chirag Vashist, Mikaela Angelina Uy, Colton Stearns, Xuan Luo, Leonidas Guibas, Ke Li
Main category: eess.IV
TL;DR: GMC introduces a novel approach for 3D scene interpolation that handles large global motions using SE(3) mappings to a shared canonical space, outperforming existing methods that fail with large displacements.
Details
Motivation: Existing dynamic scene interpolation methods fail when motion between timesteps is large, as they rely on small-motion assumptions and linear approximations that break down with significant displacements.Method: GMC learns unary potential fields that predict SE(3) mappings into a shared canonical space, balancing correspondence, spatial and semantic smoothness, and local rigidity constraints.
Result: The method significantly outperforms existing baselines on 3D scene interpolation with large global motions and enables extrapolation capabilities that other methods cannot achieve.
Conclusion: GMC provides a robust solution for handling large motion in dynamic scene interpolation through canonical space mapping, overcoming limitations of conventional small-motion assumption techniques.
Abstract: Existing dynamic scene interpolation methods typically assume that the motion between consecutive timesteps is small enough so that displacements can be locally approximated by linear models. In practice, even slight deviations from this small-motion assumption can cause conventional techniques to fail. In this paper, we introduce Global Motion Corresponder (GMC), a novel approach that robustly handles large motion and achieves smooth transitions. GMC learns unary potential fields that predict SE(3) mappings into a shared canonical space, balancing correspondence, spatial and semantic smoothness, and local rigidity. We demonstrate that our method significantly outperforms existing baselines on 3D scene interpolation when the two states undergo large global motions. Furthermore, our method enables extrapolation capabilities where other baseline methods cannot.
[397] Is the medical image segmentation problem solved? A survey of current developments and future directions
Guoping Xu, Jayaram K. Udupa, Jax Luo, Songlin Zhao, Yajun Yu, Scott B. Raymond, Hao Peng, Lipeng Ning, Yogesh Rathi, Wei Liu, You Zhang
Main category: eess.IV
TL;DR: Comprehensive review of deep learning-based medical image segmentation progress over the past decade, examining 7 key dimensions including learning paradigms, task evolution, modality integration, and architectural advancements.
Details
Motivation: To assess the extent to which current models have overcome persistent challenges in medical image segmentation and identify remaining gaps, providing a holistic overview of the field's trajectory.Method: In-depth review organized around seven key dimensions: learning paradigms (supervised to semi-/unsupervised), task evolution (organ to lesion segmentation), modality integration, foundation models, probabilistic approaches, dimensionality progression (2D to 4D), and agent-based segmentation.
Result: The review traces progress across encoder, bottleneck, skip connections, and decoder components, examining core principles like multiscale analysis and attention mechanisms, while maintaining an updated repository of literature and open-source resources.
Conclusion: The comprehensive analysis provides insights into the trajectory of deep learning-based medical image segmentation and aims to inspire future innovation in the field.
Abstract: Medical image segmentation has advanced rapidly over the past two decades, largely driven by deep learning, which has enabled accurate and efficient delineation of cells, tissues, organs, and pathologies across diverse imaging modalities. This progress raises a fundamental question: to what extent have current models overcome persistent challenges, and what gaps remain? In this work, we provide an in-depth review of medical image segmentation, tracing its progress and key developments over the past decade. We examine core principles, including multiscale analysis, attention mechanisms, and the integration of prior knowledge, across the encoder, bottleneck, skip connections, and decoder components of segmentation networks. Our discussion is organized around seven key dimensions: (1) the shift from supervised to semi-/unsupervised learning, (2) the transition from organ segmentation to lesion-focused tasks, (3) advances in multi-modality integration and domain adaptation, (4) the role of foundation models and transfer learning, (5) the move from deterministic to probabilistic segmentation, (6) the progression from 2D to 3D and 4D segmentation, and (7) the trend from model invocation to segmentation agents. Together, these perspectives provide a holistic overview of the trajectory of deep learning-based medical image segmentation and aim to inspire future innovation. To support ongoing research, we maintain a continually updated repository of relevant literature and open-source resources at https://github.com/apple1986/medicalSegReview
[398] UltraEar: a multicentric, large-scale database combining ultra-high-resolution computed tomography and clinical data for ear diseases
Ruowei Tang, Pengfei Zhao, Xiaoguang Li, Ning Xu, Yue Cheng, Mengshi Zhang, Zhixiang Wang, Zhengyu Zhang, Hongxia Yin, Heyu Ding, Shusheng Gong, Yuhe Liu, Zhenchang Wang
Main category: eess.IV
TL;DR: UltraEar Database is a large-scale multicentric repository of 0.1mm ultra-high-resolution CT images and clinical data for ear diseases, collected from 11 hospitals over 15 years, with standardized preprocessing and privacy protection.
Details
Motivation: Ear diseases affect billions globally with substantial health burdens, and CT imaging is crucial for diagnosis and treatment. There's a need for a comprehensive, high-resolution database to advance research and clinical applications in otology.Method: Establishment of a multicentric database recruiting patients from 11 tertiary hospitals (2020-2035), integrating U-HRCT images, structured reports, clinical data, with standardized preprocessing pipelines for calibration, annotation, and segmentation. Data is anonymized and stored securely.
Result: Creation of UltraEar Database - an unprecedented ultra-high-resolution reference atlas covering various otologic disorders with technical fidelity and clinical relevance, ensuring data privacy compliance.
Conclusion: UltraEar provides a valuable resource for radiological research, AI algorithm development, educational training, and multi-institutional collaboration, with continuous updates planned for long-term accessibility to the global otologic research community.
Abstract: Ear diseases affect billions of people worldwide, leading to substantial health and socioeconomic burdens. Computed tomography (CT) plays a pivotal role in accurate diagnosis, treatment planning, and outcome evaluation. The objective of this study is to present the establishment and design of UltraEar Database, a large-scale, multicentric repository of isotropic 0.1 mm ultra-high-resolution CT (U-HRCT) images and associated clinical data dedicated to ear diseases. UltraEar recruits patients from 11 tertiary hospitals between October 2020 and October 2035, integrating U-HRCT images, structured CT reports, and comprehensive clinical information, including demographics, audiometric profiles, surgical records, and pathological findings. A broad spectrum of otologic disorders is covered, such as otitis media, cholesteatoma, ossicular chain malformation, temporal bone fracture, inner ear malformation, cochlear aperture stenosis, enlarged vestibular aqueduct, and sigmoid sinus bony deficiency. Standardized preprocessing pipelines have been developed for geometric calibration, image annotation, and multi-structure segmentation. All personal identifiers in DICOM headers and metadata are removed or anonymized to ensure compliance with data privacy regulation. Data collection and curation are coordinated through monthly expert panel meetings, with secure storage on an offline cloud system. UltraEar provides an unprecedented ultra-high-resolution reference atlas with both technical fidelity and clinical relevance. This resource has significant potential to advance radiological research, enable development and validation of AI algorithms, serve as an educational tool for training in otologic imaging, and support multi-institutional collaborative studies. UltraEar will be continuously updated and expanded, ensuring long-term accessibility and usability for the global otologic research community.
[399] Efficient and Privacy-Protecting Background Removal for 2D Video Streaming using iPhone 15 Pro Max LiDAR
Jessica Kinnevan, Naifa Alqahtani, Toral Chauhan
Main category: eess.IV
TL;DR: Using iPhone 15 Pro Max’s LiDAR for real-time background removal at 60fps, overcoming lighting limitations of traditional methods.
Details
Motivation: Traditional background removal techniques like chroma keying and AI models are dependent on lighting conditions and perform poorly in low-light environments. LiDAR's depth-based approach provides lighting-independent background removal.Method: Integrated iPhone 15 Pro Max’s LiDAR and color cameras with GPU-based image processing using SwiftUI, Swift frameworks, and Metal Shader Language for real-time enhancement at 60fps.
Result: Successfully achieved real-time background removal at 60fps streaming rate, though limited by depth map resolution of 320x240 due to bandwidth constraints and material reflection limitations.
Conclusion: LiDAR technology shows promise as a superior background removal method for mobile devices. If resolution can match color image quality, it could become the dominant approach for video and photography applications.
Abstract: Light Detection and Ranging (LiDAR) technology in consumer-grade mobile devices can be used as a replacement for traditional background removal and compositing techniques. Unlike approaches such as chroma keying and trained AI models, LiDAR’s depth information is independent of subject lighting, and performs equally well in low-light and well-lit environments. We integrate the LiDAR and color cameras on the iPhone 15 Pro Max with GPU-based image processing. We use Apple’s SwiftUI and Swift frameworks for user interface and backend development, and Metal Shader Language (MSL) for realtime image enhancement at the standard iPhone streaming frame rate of 60 frames per second. The only meaningful limitations of the technology are the streaming bandwidth of the depth data, which currently reduces the depth map resolution to 320x240, and any pre-existing limitations of the LiDAR IR laser to reflect accurate depth from some materials. If the LiDAR resolution on a mobile device like the iPhone can be improved to match the color image resolution, LiDAR could feasibly become the preeminent method of background removal for video applications and photography.
[400] GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction
Kian Anvari Hamedani, Narges Razizadeh, Shahabedin Nabavi, Mohsen Ebrahimi Moghaddam
Main category: eess.IV
TL;DR: GENRE-CMR is a GAN-based deep unrolled reconstruction framework that improves cardiac MRI reconstruction quality and generalization across diverse acquisition settings using residual connections, Edge-Aware Region loss, and Statistical Distribution Alignment loss.
Details
Motivation: Accelerated CMR image reconstruction faces challenges in balancing scan time and image quality, especially when generalizing across different acquisition settings. There's a need for robust reconstruction methods that maintain high fidelity across diverse protocols.Method: Proposes GENRE-CMR - a generative adversarial network with residual deep unrolled reconstruction framework. Uses cascade of convolutional subnetworks with residual connections for progressive feature propagation. Integrates two novel loss functions: Edge-Aware Region (EAR) loss to focus on structurally informative regions and prevent blurriness, and Statistical Distribution Alignment (SDA) loss using symmetric KL divergence to regularize feature space across data distributions.
Result: Achieves state-of-the-art performance with 0.9552 SSIM and 38.90 dB PSNR on unseen data distributions across various acceleration factors and sampling trajectories. Outperforms existing methods on both training and unseen data. Ablation studies confirm the contribution of each component to reconstruction quality and generalization.
Conclusion: GENRE-CMR provides a unified and robust solution for high-quality CMR reconstruction that can be clinically deployed across heterogeneous acquisition protocols, demonstrating superior generalization capabilities compared to existing methods.
Abstract: Accelerated Cardiovascular Magnetic Resonance (CMR) image reconstruction remains a critical challenge due to the trade-off between scan time and image quality, particularly when generalizing across diverse acquisition settings. We propose GENRE-CMR, a generative adversarial network (GAN)-based architecture employing a residual deep unrolled reconstruction framework to enhance reconstruction fidelity and generalization. The architecture unrolls iterative optimization into a cascade of convolutional subnetworks, enriched with residual connections to enable progressive feature propagation from shallow to deeper stages. To further improve performance, we integrate two loss functions: (1) an Edge-Aware Region (EAR) loss, which guides the network to focus on structurally informative regions and helps prevent common reconstruction blurriness; and (2) a Statistical Distribution Alignment (SDA) loss, which regularizes the feature space across diverse data distributions via a symmetric KL divergence formulation. Extensive experiments confirm that GENRE-CMR surpasses state-of-the-art methods on training and unseen data, achieving 0.9552 SSIM and 38.90 dB PSNR on unseen distributions across various acceleration factors and sampling trajectories. Ablation studies confirm the contribution of each proposed component to reconstruction quality and generalization. Our framework presents a unified and robust solution for high-quality CMR reconstruction, paving the way for clinically adaptable deployment across heterogeneous acquisition protocols.
[401] Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification in MIDOG 2025
Guillaume Balezo, Raphaël Bourgade, Thomas Walter
Main category: eess.IV
TL;DR: DINOv3-H+ vision transformer pretrained on natural images achieves strong performance (0.8871 balanced accuracy) on atypical mitotic figure classification in histopathology when fine-tuned with LoRA and augmentation, despite domain gap.
Details
Motivation: Atypical mitotic figures (AMFs) are important prognostic markers but difficult to detect due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge provides a benchmark for AMF classification across domains.Method: Fine-tuned DINOv3-H+ vision transformer (pretrained on natural images) using low-rank adaptation (LoRA) with 650k trainable parameters and extensive data augmentation.
Result: Achieved balanced accuracy of 0.8871 on preliminary test set, demonstrating effective transfer learning from natural images to histopathology despite domain differences.
Conclusion: DINOv3 pretraining combined with parameter-efficient fine-tuning provides a strong baseline for atypical mitosis classification, highlighting the robustness of the approach for the MIDOG 2025 challenge.
Abstract: Atypical mitotic figures (AMFs) are markers of abnormal cell division associated with poor prognosis, yet their detection remains difficult due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge introduces a benchmark for AMF classification across multiple domains. In this work, we evaluate the recently published DINOv3-H+ vision transformer, pretrained on natural images, which we fine-tuned using low-rank adaptation (LoRA, 650k trainable parameters) and extensive augmentation. Despite the domain gap, DINOv3 transfers effectively to histopathology, achieving a balanced accuracy of 0.8871 on the preliminary test set. These results highlight the robustness of DINOv3 pretraining and show that, when combined with parameter-efficient fine-tuning, it provides a strong baseline for atypical mitosis classification in MIDOG 2025.
[402] MTS-Net: Dual-Enhanced Positional Multi-Head Self-Attention for 3D CT Diagnosis of May-Thurner Syndrome
Yixin Huang, Yiqi Jin, Ke Tao, Kaijian Xia, Jianfeng Gu, Lei Yu, Haojie Li, Lan Du, Cunjian Chen
Main category: eess.IV
TL;DR: MTS-Net is a 3D deep learning framework with novel attention modules that achieves state-of-the-art performance for May-Thurner Syndrome diagnosis from CT scans, outperforming existing baselines with 0.79 accuracy and 0.84 AUC.
Details
Motivation: May-Thurner Syndrome affects over 20% of the population and increases deep venous thrombosis risk, but accurate CT-based diagnosis is challenging due to subtle anatomical variations and lack of automated tools.Method: Proposed MTS-Net builds on 3D ResNet-18 with a novel dual-enhanced positional multi-head self-attention (DEP-MHSA) module that uses multi-scale convolution and integrated positional embeddings to enhance spatial context for venous compression detection.
Result: MTS-Net achieves 0.79 accuracy, 0.84 AUC, and 0.78 F1-score, outperforming 3D ResNet, DenseNet-BC, and BabyNet baselines. The authors also curated the first public MTS dataset with 747 gender-balanced subjects.
Conclusion: The work introduces both a new diagnostic architecture for MTS and provides the first publicly available benchmark dataset, facilitating future research in automated vascular syndrome detection.
Abstract: May-Thurner Syndrome (MTS) is a vascular condition that affects over 20% of the population and significantly increases the risk of iliofemoral deep venous thrombosis. Accurate and early diagnosis of MTS using computed tomography (CT) remains a clinical challenge due to the subtle anatomical compression and variability across patients. In this paper, we propose MTS-Net, an end-to-end 3D deep learning framework designed to capture spatial-temporal patterns from CT volumes for reliable MTS diagnosis. MTS-Net builds upon 3D ResNet-18 by embedding a novel dual-enhanced positional multi-head self-attention (DEP-MHSA) module into the Transformer encoder of the network’s final stages. The proposed DEP-MHSA employs multi-scale convolution and integrates positional embeddings into both attention weights and residual paths, enhancing spatial context preservation, which is crucial for identifying venous compression. To validate our approach, we curate the first publicly available dataset for MTS, MTS-CT, containing over 747 gender-balanced subjects with standard and enhanced CT scans. Experimental results demonstrate that MTS-Net achieves average 0.79 accuracy, 0.84 AUC, and 0.78 F1-score, outperforming baseline models including 3D ResNet, DenseNet-BC, and BabyNet. Our work not only introduces a new diagnostic architecture for MTS but also provides a high-quality benchmark dataset to facilitate future research in automated vascular syndrome detection. We make our code and dataset publicly available at:https://github.com/Nutingnon/MTS_dep_mhsa.