Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 141]
cs.CV [Total: 304]
cs.AI [Total: 79]
cs.SD [Total: 18]
cs.LG [Total: 220]
cs.MA [Total: 6]
cs.MM [Total: 4]
eess.AS [Total: 10]
eess.IV [Total: 22]

cs.CL

Khalid Hasan, Jamil Saquer, Mukulika Ghosh

Main category: cs.CL

TL;DR: The study evaluates transformer models (BERT, RoBERTa, etc.) and LSTM approaches for mental health disorder classification on Reddit, showing transformers’ superiority, especially RoBERTa, with high F1 scores. LSTM with BERT embeddings also performed well, balancing accuracy and resource use.

Details

Motivation: The increasing prevalence of mental health disorders calls for automated, scalable tools for early detection and monitoring, leveraging NLP advancements.

Method: The study compares transformer models (BERT, RoBERTa, etc.) and LSTM approaches using a large annotated Reddit dataset, validated via statistical and topic modeling analysis.

Result: RoBERTa achieved the highest F1 scores (99.54% on hold-out, 96.05% on external test sets). LSTM with BERT embeddings also performed competitively (F1 >94%) with lower computational costs.

Conclusion: Transformer models, particularly RoBERTa, are highly effective for mental health monitoring, with LSTM+BERT offering a resource-efficient alternative. Findings support real-time clinical and digital mental health applications.

Abstract: The rising prevalence of mental health disorders necessitates the development of robust, automated tools for early detection and monitoring. Recent advances in Natural Language Processing (NLP), particularly transformer-based architectures, have demonstrated significant potential in text analysis. This study provides a comprehensive evaluation of state-of-the-art transformer models (BERT, RoBERTa, DistilBERT, ALBERT, and ELECTRA) against Long Short-Term Memory (LSTM) based approaches using different text embedding techniques for mental health disorder classification on Reddit. We construct a large annotated dataset, validating its reliability through statistical judgmental analysis and topic modeling. Experimental results demonstrate the superior performance of transformer models over traditional deep-learning approaches. RoBERTa achieved the highest classification performance, with a 99.54% F1 score on the hold-out test set and a 96.05% F1 score on the external test set. Notably, LSTM models augmented with BERT embeddings proved highly competitive, achieving F1 scores exceeding 94% on the external dataset while requiring significantly fewer computational resources. These findings highlight the effectiveness of transformer-based models for real-time, scalable mental health monitoring. We discuss the implications for clinical applications and digital mental health interventions, offering insights into the capabilities and limitations of state-of-the-art NLP methodologies in mental disorder detection.

[2] Setting The Table with Intent: Intent-aware Schema Generation and Editing for Literature Review Tables

Vishakh Padmakumar, Joseph Chee Chang, Kyle Lo, Doug Downey, Aakanksha Naik

Main category: cs.CL

TL;DR: The paper addresses ambiguity and lack of refinement in schema generation for academic literature using LLMs, introducing a dataset with synthesized intents and editing techniques to improve schema quality.

Details

Motivation: The need to organize and compare academic documents efficiently, hindered by ambiguity in evaluations and lack of refinement methods.

Method: Augments unannotated table corpora with synthesized intents for dataset creation, benchmarks LLM-based schema generation, and proposes editing techniques.

Result: Incorporating table intents improves schema reconstruction; fine-tuned smaller models compete with prompted LLMs, and editing further enhances schemas.

Conclusion: The proposed methods reduce ambiguity and improve schema generation, demonstrating the potential of smaller models and editing techniques.

Abstract: The increasing volume of academic literature makes it essential for researchers to organize, compare, and contrast collections of documents. Large language models (LLMs) can support this process by generating schemas defining shared aspects along which to compare papers. However, progress on schema generation has been slow due to: (i) ambiguity in reference-based evaluations, and (ii) lack of editing/refinement methods. Our work is the first to address both issues. First, we present an approach for augmenting unannotated table corpora with synthesized intents and apply it to create a dataset for studying schema generation conditioned on a given information need, thus reducing ambiguity. With this dataset, we show how incorporating table intents significantly improves baseline performance in reconstructing reference schemas. Next, we propose several LLM-based schema editing techniques. We start by comprehensively benchmarking several single-shot schema generation methods, including prompted LLM workflows and fine-tuned models, showing that smaller, open-weight models can be fine-tuned to be competitive with state-of-the-art prompted LLMs. Then we demonstrate that our editing techniques can further improve schemas generated by these methods.

[3] Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri

Felix Kraus, Nicolas Blumenröhr, Danah Tonne, Achim Streit

Main category: cs.CL

TL;DR: WOKIE is an open-source pipeline for automated translation of SKOS thesauri, combining external translation services and LLMs to improve accessibility and interoperability in Digital Humanities.

Details

Motivation: Addresses language diversity barriers in DH, limiting access and reuse of knowledge resources.

Method: Uses external translation services and LLMs for refinement, balancing quality, scalability, and cost. Evaluated across 15 languages.

Result: Improves accessibility, reuse, and cross-lingual interoperability of thesauri.

Conclusion: WOKIE supports inclusive, multilingual research infrastructures by hurdle-free translation and better ontology matching.

Abstract: We introduce WOKIE, an open-source, modular, and ready-to-use pipeline for the automated translation of SKOS thesauri. This work addresses a critical need in the Digital Humanities (DH), where language diversity can limit access, reuse, and semantic interoperability of knowledge resources. WOKIE combines external translation services with targeted refinement using Large Language Models (LLMs), balancing translation quality, scalability, and cost. Designed to run on everyday hardware and be easily extended, the application requires no prior expertise in machine translation or LLMs. We evaluate WOKIE across several DH thesauri in 15 languages with different parameters, translation services and LLMs, systematically analysing translation quality, performance, and ontology matching improvements. Our results show that WOKIE is suitable to enhance the accessibility, reuse, and cross-lingual interoperability of thesauri by hurdle-free automated translation and improved ontology matching performance, supporting more inclusive and multilingual research infrastructures.

[4] Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning

Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li

Main category: cs.CL

TL;DR: The paper addresses geospatial hallucinations in LLMs, proposing an evaluation framework and a mitigation method (KTO-based) to improve accuracy by 29.6%.

Details

Motivation: LLMs often generate inaccurate geospatial knowledge, compromising reliability, yet systematic evaluation and mitigation of such hallucinations are underexplored.

Method: A comprehensive evaluation framework using geospatial knowledge graphs and a dynamic factuality aligning method (KTO-based) to mitigate hallucinations.

Result: Evaluation of 20 advanced LLMs revealed geospatial hallucinations, and the KTO method improved performance by 29.6%.

Conclusion: The proposed benchmark and learning algorithm effectively enhance LLM trustworthiness in geospatial tasks.

Abstract: Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations (incorrect or inconsistent representations of geospatial information) that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been widely studied, the systematic evaluation and mitigation of geospatial hallucinations remain largely unexplored. To address this gap, we propose a comprehensive evaluation framework for geospatial hallucinations, leveraging structured geospatial knowledge graphs for controlled assessment. Through extensive evaluation across 20 advanced LLMs, we uncover the hallucinations in their geospatial knowledge. Building on these insights, we introduce a dynamic factuality aligning method based on Kahneman-Tversky Optimization (KTO) to mitigate geospatial hallucinations in LLMs, leading to a performance improvement of over 29.6% on the proposed benchmark. Extensive experimental results demonstrate the effectiveness of our benchmark and learning algorithm in enhancing the trustworthiness of LLMs in geospatial knowledge and reasoning tasks.

[5] Efficient Attention Mechanisms for Large Language Models: A Survey

Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, Jianyong Wang

Main category: cs.CL

TL;DR: A survey on efficient attention mechanisms in Transformer-based models to address quadratic complexity issues, covering linear and sparse attention methods, and their integration into large-scale language models.

Details

Motivation: The quadratic time and memory complexity of self-attention in Transformers hinders efficient long-context modeling, prompting research into scalable solutions.

Method: The paper reviews two categories of efficient attention: linear attention (kernel approximations, recurrent formulations) and sparse attention (fixed patterns, clustering). It also discusses hybrid designs and hardware considerations.

Result: The survey systematically integrates algorithmic innovations and practical deployment strategies for efficient attention in large-scale models.

Conclusion: This work serves as a foundational reference for designing scalable and efficient language models by aligning theory with practice.

Abstract: Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.

[6] MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

Muntasir Wahed, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani-Tür, Ismini Lourentzou

Main category: cs.CL

TL;DR: The paper explores vulnerabilities in LLMs’ code generation under multi-turn adversarial prompts and introduces a benchmark for evaluation. Fine-tuning on MOCHA improves robustness.

Details

Motivation: To address underexplored robustness of LLMs against adversarial misuse in code generation, particularly through multi-turn malicious prompts.

Method: Introduces code decomposition attacks and a benchmark (\benchmarkname{}) to evaluate LLMs’ robustness. Uses fine-tuning on MOCHA for improvement.

Result: Empirical results show persistent vulnerabilities, especially in multi-turn scenarios. Fine-tuning on MOCHA enhances rejection rates by up to 32.4%.

Conclusion: Fine-tuning on MOCHA effectively improves LLM robustness against adversarial prompts without additional supervision.

Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.

[7] HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track

Xuchen Wei, Yangxin Wu, Yaoyin Zhang, Henglyu Liu, Kehai Chen, Xuefeng Bai, Min Zhang

Main category: cs.CL

TL;DR: HITSZ’s IWSLT 2025 submission introduces an end-to-end ST system combining Whisper ASR and Krutrim LLM for English-Indic translation, achieving notable BLEU scores, and explores CoT for further improvements.

Details

Motivation: Addressing low-resource challenges in English-Indic speech-to-text translation by leveraging pre-trained models.

Method: Integration of Whisper ASR and Krutrim LLM in an end-to-end system, with additional exploration of Chain-of-Thought (CoT) for quality enhancement.

Result: Average BLEU scores of 28.88 (English-to-Indic) and 27.86 (Indic-to-English); CoT showed potential (e.g., +13.84 BLEU for Tamil-English) but faced consistency issues.

Conclusion: The proposed system is effective for low-resource ST, with CoT offering promising yet inconsistent improvements.

Abstract: This paper presents HITSZ’s submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of $28.88$ for English-to-Indic directions and $27.86$ for Indic-to-English directions. Furthermore, we investigated the Chain-of-Thought (CoT) method. While this method showed potential for significant translation quality improvements on successfully parsed outputs (e.g. a $13.84$ BLEU increase for Tamil-to-English), we observed challenges in ensuring the model consistently adheres to the required CoT output format.

[8] MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues

Main category: cs.CL

TL;DR: MCIF is a new multilingual, multimodal benchmark for evaluating MLLMs across languages and modalities using human-annotated scientific talks.

Details

Motivation: Existing benchmarks lack comprehensive evaluation of multilingual and multimodal capabilities in MLLMs, limiting progress in the field.

Method: MCIF introduces a benchmark based on scientific talks, covering speech, vision, and text across four languages (English, German, Italian, Chinese) for short- and long-form contexts.

Result: MCIF enables thorough assessment of MLLMs’ instruction-following abilities in crosslingual, multimodal settings.

Conclusion: MCIF addresses gaps in current benchmarks and promotes open research in MLLM development.

Abstract: Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations – hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities – speech, vision, and text – and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs’ abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

[9] RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

Andrei Vlad Man, Răzvan-Alexandru Smădu, Cristian-George Craciun, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel

Main category: cs.CL

TL;DR: The paper evaluates LLMs and VLMs for Romanian driving law tasks using the RoD-TAL dataset, showing improvements with fine-tuning and reasoning models but noting challenges in visual reasoning.

Details

Motivation: Address the need for AI tools in legal education for under-resourced languages like Romanian, focusing on driving law understanding.

Method: Introduces RoD-TAL dataset, uses RAG pipelines, dense retrievers, and reasoning-optimized models for IR, QA, Visual IR, and Visual QA tasks.

Result: Domain-specific fine-tuning boosts retrieval; reasoning models improve QA accuracy, exceeding exam passing grades, but visual reasoning lags.

Conclusion: LLMs and VLMs show promise for legal education but face limitations in visual tasks, indicating areas for future improvement.

Abstract: The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, alongside annotated legal references and human explanations. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum grades required to pass driving exams. However, visual reasoning remains challenging, highlighting the potential and the limitations of applying LLMs and VLMs to legal education.

[10] ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang

Main category: cs.CL

TL;DR: ProsodyLM improves speech language models by introducing a tokenization scheme that better captures prosody, enabling diverse prosody processing capabilities through pre-training.

Details

Motivation: Existing methods for training speech language models inadequately capture prosody, limiting their ability to understand and generate nuanced speech.

Method: ProsodyLM uses a tokenization scheme where speech is transcribed into text followed by word-level prosody tokens, retaining more prosody information.

Result: ProsodyLM learns diverse prosody capabilities, such as handling contrastive focus, understanding emotion/stress, and maintaining prosody consistency in long contexts.

Conclusion: ProsodyLM demonstrates that a simple tokenization scheme can significantly enhance prosody learning in speech language models.

Abstract: Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information – we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.

[11] Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

Maitha Alshehhi, Ahmed Sharshar, Mohsen Guizani

Main category: cs.CL

TL;DR: Benchmarking multilingual and monolingual LLMs in low-resource languages reveals cross-lingual transfer benefits, effective quantization for efficiency, and performance trade-offs with pruning.

Details

Motivation: To understand LLM performance in low-resource languages like Kannada and Arabic and evaluate the impact of model compression strategies.

Method: Benchmarked multilingual and monolingual LLMs (e.g., BLOOMZ, AceGPT) across Arabic, English, and Indic languages, testing pruning and quantization effects.

Result: Multilingual models outperform monolingual ones; quantization maintains accuracy, but aggressive pruning harms performance, especially in larger models.

Conclusion: Key strategies for scalable multilingual NLP include leveraging cross-lingual transfer and careful compression, with interventions needed for low-resource challenges.

Abstract: Although LLMs have attained significant success in high-resource languages, their capacity in low-resource linguistic environments like Kannada and Arabic is not yet fully understood. This work benchmarking the performance of multilingual and monolingual Large Language Models (LLMs) across Arabic, English, and Indic languages, with particular emphasis on the effects of model compression strategies such as pruning and quantization. Findings shows significant performance differences driven by linguistic diversity and resource availability on SOTA LLMS as BLOOMZ, AceGPT, Jais, LLaMA-2, XGLM, and AraGPT2. We find that multilingual versions of the model outperform their language-specific counterparts across the board, indicating substantial cross-lingual transfer benefits. Quantization (4-bit and 8-bit) is effective in maintaining model accuracy while promoting efficiency, but aggressive pruning significantly compromises performance, especially in bigger models. Our findings pinpoint key strategies to construct scalable and fair multilingual NLP solutions and underscore the need for interventions to address hallucination and generalization errors in the low-resource setting.

[12] Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs

Ronak Upasham, Tathagata Dey, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: The paper introduces a novel pipeline for Table-to-Text (T2T) generation that combines objective and subjective text by leveraging intermediate RDF triples, outperforming larger models like GPT-3.5 in some metrics.

Details

Motivation: Existing T2T approaches lack subjectivity, focusing only on objective descriptions. This work aims to bridge the gap by enriching generated text with interpretations beyond raw data.

Method: A three-stage pipeline: 1) RDF triple extraction, 2) text aggregation into narratives, and 3) subjectivity infusion. Uses fine-tuned T5 models instead of large LLMs.

Result: Achieves comparable performance to GPT-3.5 and outperforms Mistral-7B and Llama-2 in certain metrics, balancing factual accuracy with subjectivity.

Conclusion: The proposed pipeline is the first to integrate intermediate representations for enhancing both factual correctness and subjectivity in T2T generation.

Abstract: In Table-to-Text (T2T) generation, existing approaches predominantly focus on providing objective descriptions of tabular data. However, generating text that incorporates subjectivity, where subjectivity refers to interpretations beyond raw numerical data, remains underexplored. To address this, we introduce a novel pipeline that leverages intermediate representations to generate both objective and subjective text from tables. Our three-stage pipeline consists of: 1) extraction of Resource Description Framework (RDF) triples, 2) aggregation of text into coherent narratives, and 3) infusion of subjectivity to enrich the generated text. By incorporating RDFs, our approach enhances factual accuracy while maintaining interpretability. Unlike large language models (LLMs) such as GPT-3.5, Mistral-7B, and Llama-2, our pipeline employs smaller, fine-tuned T5 models while achieving comparable performance to GPT-3.5 and outperforming Mistral-7B and Llama-2 in several metrics. We evaluate our approach through quantitative and qualitative analyses, demonstrating its effectiveness in balancing factual accuracy with subjective interpretation. To the best of our knowledge, this is the first work to propose a structured pipeline for T2T generation that integrates intermediate representations to enhance both factual correctness and subjectivity.

[13] Basic Reading Distillation

Zhi Zhou, Sirui Miao, Xiangyu Duan, Hao Yang, Min Zhang

Main category: cs.CL

TL;DR: Proposes Basic Reading Distillation (BRD) to train small models by imitating LLMs’ basic reading behaviors, improving performance on tasks despite smaller size.

Details

Motivation: Addresses the high computational demands of LLMs by focusing on educating small models on generic, task-unrelated texts, which prior distillation methods overlooked.

Method: Introduces BRD, where small models learn basic reading behaviors (e.g., named entity recognition, Q&A) from LLMs, then apply to downstream tasks.

Result: Small models trained with BRD outperform or match 20x larger LLMs on benchmarks like language inference and BIG-bench tasks.

Conclusion: BRD effectively modifies small models’ probability distributions and complements existing distillation techniques, offering a scalable solution.

Abstract: Large language models (LLMs) have demonstrated remarkable abilities in various natural language processing areas, but they demand high computation resources which limits their deployment in real-world. Distillation is one technique to solve this problem through either knowledge distillation or task distillation. Both distillation approaches train small models to imitate specific features of LLMs, but they all neglect basic reading education for small models on generic texts that are \emph{unrelated} to downstream tasks. In this paper, we propose basic reading distillation (BRD) which educates a small model to imitate LLMs basic reading behaviors, such as named entity recognition, question raising and answering, on each sentence. After such basic education, we apply the small model on various tasks including language inference benchmarks and BIG-bench tasks. It shows that the small model can outperform or perform comparable to over 20x bigger LLMs. Analysis reveals that BRD effectively influences the probability distribution of the small model, and has orthogonality to either knowledge distillation or task distillation.

[14] JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models

Yifan Hao, Fangning Chao, Yaqian Hao, Zhaojun Cui, Huan Bai, Haiyu Zhang, Yankai Liu, Chao Deng, Junlan Feng

Main category: cs.CL

TL;DR: JT-Math-8B is an open-source model series designed for advanced mathematical reasoning, outperforming competitors like GPT-4o through multi-stage optimization and a high-quality dataset.

Details

Motivation: Addressing the limitations of current LLMs in complex mathematical reasoning, which requires deep understanding and multi-step deliberation.

Method: Uses a multi-stage optimization framework: base, instruct (SFT + GRPO-based RL), and thinking (Long CoT + multi-stage RL curriculum) models, trained on a 210B-token dataset.

Result: Achieves state-of-the-art performance among open-source models, surpassing GPT-4o in competition-level mathematics.

Conclusion: JT-Math-8B demonstrates superior mathematical reasoning capabilities, setting a new benchmark for open-source models.

Abstract: Mathematical reasoning is a cornerstone of artificial general intelligence and a primary benchmark for evaluating the capabilities of Large Language Models (LLMs). While state-of-the-art models show promise, they often falter when faced with complex problems that demand deep conceptual understanding and intricate, multi-step deliberation. To address this challenge, we introduce JT-Math-8B, a series of open-source models comprising base, instruct, and thinking versions, built upon a systematic, multi-stage optimization framework. Our pre-training corpus is a high-quality, 210B-token dataset curated through a dedicated data pipeline that uses model-based validation to ensure quality and diversity. The Instruct Model is optimized for direct, concise answers through Supervised Fine-Tuning (SFT) and a GRPO-based reinforcement learning (RL) method. The Thinking Model is trained for complex problem-solving using a Long Chain-of-Thought (Long CoT) approach, combining SFT with a novel, multi-stage RL curriculum that progressively increases task difficulty and context length up to 32K tokens. JT-Math-8B achieves state-of-the-art results among open-source models of similar size, surpassing prominent models like OpenAI’s O1-mini and GPT-4o , and demonstrating superior performance on competition-level mathematics.

[15] Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs

Rebecca M. M. Hicke, Brian Haggard, Mia Ferrante, Rayhan Khanna, David Mimno

Main category: cs.CL

TL;DR: The paper explores the cultural and literary side of the American Evangelical movement, focusing on Christian Fiction. It uses computational tools to analyze divine acts in the genre, revealing differences between the Left Behind series and broader Christian Fiction, as well as between male and female authors.

Details

Motivation: Christian Fiction is understudied, with most attention on the Left Behind series. The paper aims to provide a broader understanding of the genre and its depiction of divine acts.

Method: Human annotators developed definitions for ‘acts of God,’ which were adapted for a lightweight language model (LM) with a larger model’s assistance. The LM matched human annotations for subtle tasks.

Result: The analysis revealed significant differences between the Left Behind series and other Christian Fiction, as well as between books by male and female authors.

Conclusion: The study highlights the utility of computational tools in literary analysis and uncovers nuanced distinctions within Christian Fiction.

Abstract: In addition to its more widely studied political activities, the American Evangelical movement has a well-developed but less externally visible cultural and literary side. Christian Fiction, however, has been little studied, and what scholarly attention there is has focused on the explosively popular Left Behind series. In this work, we use computational tools to provide both a broad topical overview of Christian Fiction as a genre and a more directed exploration of how its authors depict divine acts. Working with human annotators we first developed definitions and a codebook for “acts of God.” We then adapted those instructions designed for human annotators for use by a recent, lightweight LM with the assistance of a much larger model. The laptop-scale LM is capable of matching human annotations, even when the task is subtle and challenging. Using these annotations, we show that significant and meaningful differences exist between the Left Behind books and Christian Fiction more broadly and between books by male and female authors.

[16] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities

Dong Du, Shulin Liu, Tao Yang, Shaohua Chen, Yang Li

Main category: cs.CL

TL;DR: The paper proposes UloRL, a method to enhance LLMs’ reasoning by efficiently handling ultra-long outputs via segmented decoding and dynamic masking, achieving significant performance gains.

Details

Motivation: Traditional RL frameworks struggle with inefficiencies in ultra-long output sequences due to long-tail distributions and entropy collapse, limiting LLMs' reasoning potential.

Method: UloRL divides ultra-long outputs into short segments for efficient training and introduces dynamic masking of well-Mastered Positive Tokens (MPTs) to prevent entropy collapse.

Result: On Qwen3-30B-A3B, UloRL achieved 2.06x faster training and improved performance on AIME2025 (70.9% to 85.1%) and BeyondAIME (50.7% to 61.9%), surpassing larger models.

Conclusion: UloRL effectively advances LLMs’ reasoning with ultra-long sequences, demonstrating practical improvements and potential for broader application.

Abstract: Recent advances in large language models (LLMs) have highlighted the potential of reinforcement learning with verifiable rewards (RLVR) to enhance reasoning capabilities through extended output sequences. However, traditional RL frameworks face inefficiencies when handling ultra-long outputs due to long-tail sequence distributions and entropy collapse during training. To address these challenges, we propose an Ultra-Long Output Reinforcement Learning (UloRL) approach for advancing large language models’ reasoning abilities. Specifically, we divide ultra long output decoding into short segments, enabling efficient training by mitigating delays caused by long-tail samples. Additionally, we introduce dynamic masking of well-Mastered Positive Tokens (MPTs) to prevent entropy collapse. Experimental results demonstrate the effectiveness of our approach. On the Qwen3-30B-A3B model, RL with segment rollout achieved 2.06x increase in training speed, while RL training with 128k-token outputs improves the model’s performance on AIME2025 from 70.9% to 85.1% and on BeyondAIME from 50.7% to 61.9%, even surpassing Qwen3-235B-A22B with remarkable gains. These findings underscore the potential of our methods to advance the reasoning capabilities of LLMs with ultra-long sequence generation. We will release our code and model for further use by the community.

[17] Flora: Effortless Context Construction to Arbitrary Length and Scale

Tianxiang Chen, Zhentao Tan, Xiaofan Bo, Yue Wu, Tao Gong, Qi Chu, Jieping Ye, Nenghai Yu

Main category: cs.CL

TL;DR: Flora is a human/LLM-free strategy to enhance long-context performance in LLMs by assembling short instructions, maintaining short-context abilities.

Details

Motivation: Challenges in handling long contexts due to rarity, computational demands, and forgetting short-context abilities in LLMs.

Method: Flora constructs long contexts by assembling short instructions based on categories and using meta-instructions for LLM responses.

Result: Improved long-context performance in benchmarks (Llama3-8B-Instruct, QwQ-32B) with minimal short-context performance drop.

Conclusion: Flora offers a scalable, diverse, and efficient solution for long-context LLM enhancement without human/LLM intervention.

Abstract: Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \href{https://github.com/txchen-USTC/Flora}{https://github.com/txchen-USTC/Flora}.

[18] HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

Dongquan Yang, Yifan Yang, Xiaotian Yu, Xianbiao Qi, Rong Xiao

Main category: cs.CL

TL;DR: HCAttention is a framework for efficient long-context processing in LLMs, reducing KV cache memory to 25% without accuracy loss.

Details

Motivation: Addressing the challenge of high memory requirements for KV cache in LLMs during inference, especially under extreme constraints.

Method: Uses key quantization, value offloading, and dynamic KV eviction in a GPU-CPU collaboration framework.

Result: Achieves full-attention accuracy with 25% KV cache memory and extends Llama-3-8B to process 4M tokens on an A100 GPU.

Conclusion: HCAttention sets a new SOTA in KV cache compression, enabling efficient long-context processing without model fine-tuning.

Abstract: Processing long-context inputs with large language models presents a significant challenge due to the enormous memory requirements of the Key-Value (KV) cache during inference. Existing KV cache compression methods exhibit noticeable performance degradation when memory is reduced by more than 85%. Additionally, strategies that leverage GPU-CPU collaboration for approximate attention remain underexplored in this setting. We propose HCAttention, a heterogeneous attention computation framework that integrates key quantization, value offloading, and dynamic KV eviction to enable efficient inference under extreme memory constraints. The method is compatible with existing transformer architectures and does not require model fine-tuning. Experimental results on the LongBench benchmark demonstrate that our approach preserves the accuracy of full-attention model while shrinking the KV cache memory footprint to 25% of its original size. Remarkably, it stays competitive with only 12.5% of the cache, setting a new state-of-the-art in LLM KV cache compression. To the best of our knowledge, HCAttention is the first to extend the Llama-3-8B model to process 4 million tokens on a single A100 GPU with 80GB memory.

[19] DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments

Anshul Chavda, M Jagadeesh, Chintalapalli Raja Kullayappa, B Jayaprakash, Medchalimi Sruthi, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: DiscoDrive is a synthetic corpus of 3500 multi-turn dialogs with disfluencies, improving conversational AI performance in automotive domains.

Details

Motivation: Existing datasets lack real-world disfluencies in driver-AI dialogs, limiting AI effectiveness.

Method: A two-stage, prompt-driven pipeline dynamically integrates disfluencies during dialog synthesis.

Result: DiscoDrive outperforms KVRET-trained models in metrics like BLEU-4, METEOR, and human evaluations.

Conclusion: DiscoDrive bridges the gap in datasets, enhancing AI robustness for real-world in-car interactions.

Abstract: In-car conversational AI is becoming increasingly critical as autonomous vehicles and smart assistants gain widespread adoption. Yet, existing datasets fail to capture the spontaneous disfluencies such as hesitations, false starts, repetitions, and self-corrections that characterize real driver-AI dialogs. To address this, we introduce DiscoDrive, a synthetic corpus of 3500 multi-turn dialogs across seven automotive domains, generated using a two-stage, prompt-driven pipeline that dynamically integrates disfluencies during synthesis. We show that DiscoDrive is effective both as a training resource, enabling DialoGPT-Medium and T5-Base to match or exceed KVRET-trained models on the MultiWOZ 2.2 and Schema-Guided Dialogue (SGD) relevant test sets (BLEU-4 improvements of 0.26 to 0.61; METEOR +2.10; ROUGE-L +3.48; BERTScore F1 improvements of 1.35 to 3.48), and as a data augmentation resource in low-resource scenarios, delivering additional gains of up to BLEU-4 +0.38, METEOR +1.95, ROUGE-L +2.87, and BERTScore F1 +4.00 when combined with 10 percent of KVRET. Human evaluations further confirm that dialogs sampled from DiscoDrive are rated higher than KVRET’s human-collected dialogs in naturalness (3.8 vs 3.6) and coherence (4.1 vs 4.0), and are perceived as more context-appropriate than leading post-hoc methods (such as LARD), without compromising clarity. DiscoDrive fills a critical gap in existing resources and serves as a versatile corpus for both training and augmenting conversational AI, enabling robust handling of real-world, disfluent in-car interactions.

[20] Scaling Analysis of Interleaved Speech-Text Language Models

Gallil Maimon, Michael Hassid, Amit Roth, Yossi Adi

Main category: cs.CL

TL;DR: Interleaved Speech Language Models (SLMs) scale more efficiently than textless-SLMs, requiring less compute and data while achieving comparable performance.

Details

Motivation: To investigate if interleaved SLMs, initialized from pre-trained TextLMs, scale more efficiently than textless-SLMs.

Method: Conducted scaling analysis by training dozens of interleaved SLMs and analyzing trends, focusing on compute allocation and synthetic data.

Result: Interleaved SLMs scale more efficiently, with dynamics differing from textless-SLMs, favoring model size over training tokens. Scaled models match leading models’ performance with less compute.

Conclusion: Interleaved SLMs are a feasible and efficient alternative to textless-SLMs, with potential unlocked by synthetic data and TextLM families.

Abstract: Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. It predicts that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question

“Do interleaved SLMs scale more efficiently than textless-SLMs?” In this paper we answer a resounding yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling dynamics significantly differ from textless-SLMs, suggesting one should allocate notably more of the compute budget to increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest that our scaled up model achieves comparable semantic speech performance to leading models, while using less compute and data. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims/ .

[21] The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment

Danil Fokin, Monika Płużyczka, Grigory Golovin

Main category: cs.CL

TL;DR: The Polish Vocabulary Size Test (PVST) is a new tool for measuring vocabulary size in native and non-native Polish speakers, using adaptive testing for accuracy and efficiency.

Details

Motivation: To create an accurate and efficient tool for assessing vocabulary size in Polish speakers, addressing the need for such a resource.

Method: Based on Item Response Theory and Computerized Adaptive Testing, PVST dynamically adjusts to the test-taker’s proficiency.

Result: A pilot study with 1,475 participants showed native speakers had larger vocabularies, with size correlating positively with age.

Conclusion: PVST is a validated, efficient tool for vocabulary assessment, available online.

Abstract: We present the Polish Vocabulary Size Test (PVST), a novel tool for assessing the receptive vocabulary size of both native and non-native Polish speakers. Based on Item Response Theory and Computerized Adaptive Testing, PVST dynamically adjusts to each test-taker’s proficiency level, ensuring high accuracy while keeping the test duration short. To validate the test, a pilot study was conducted with 1.475 participants. Native Polish speakers demonstrated significantly larger vocabularies compared to non-native speakers. For native speakers, vocabulary size showed a strong positive correlation with age. The PVST is available online at myvocab.info/pl.

[22] ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem

Main category: cs.CL

TL;DR: ChildGuard is a new dataset for detecting hate speech targeting children, addressing gaps in existing datasets by providing age-specific labels and nuanced linguistic cues.

Details

Motivation: Current NLP systems fail to detect child-directed hate speech due to inadequate datasets focused on adults, lacking age-specific labels and nuanced cues.

Method: ChildGuard includes 351,877 annotated examples from social media, split into contextual and lexical subsets for fine-grained analysis.

Result: Benchmarking shows state-of-the-art models perform poorly on ChildGuard, underscoring the challenge of detecting child-directed hate speech.

Conclusion: ChildGuard fills a critical gap, enabling better detection of hate speech targeting children and highlighting the need for age-specific NLP solutions.

Abstract: Hate speech targeting children on social media is a serious and growing problem, yet current NLP systems struggle to detect it effectively. This gap exists mainly because existing datasets focus on adults, lack age specific labels, miss nuanced linguistic cues, and are often too small for robust modeling. To address this, we introduce ChildGuard, the first large scale English dataset dedicated to hate speech aimed at children. It contains 351,877 annotated examples from X (formerly Twitter), Reddit, and YouTube, labeled by three age groups: younger children (under 11), pre teens (11–12), and teens (13–17). The dataset is split into two subsets for fine grained analysis: a contextual subset (157K) focusing on discourse level features, and a lexical subset (194K) emphasizing word-level sentiment and vocabulary. Benchmarking state of the art hate speech models on ChildGuard reveals notable drops in performance, highlighting the challenges of detecting child directed hate speech.

[23] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Yujiao Du, Ting Han, Yuxiang Hu, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Jun Zhang, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu

Main category: cs.CL

TL;DR: Seed-LiveInterpret 2.0 is an advanced end-to-end SI model addressing key challenges like transcription quality and latency, achieving high accuracy and near-real-time performance.

Details

Motivation: To overcome persistent issues in automatic simultaneous interpretation (SI), such as poor transcription, high latency, and multi-speaker confusion.

Method: Uses a novel duplex speech-to-speech understanding-generating framework, leveraging large-scale pretraining and reinforcement learning.

Result: Achieves over 70% correctness in complex scenarios, reduces latency from 10s to 3s (70% improvement), and outperforms commercial SI solutions.

Conclusion: Seed-LiveInterpret 2.0 significantly advances SI technology, offering high-fidelity, low-latency translation with practical usability.

Abstract: Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

[24] Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam

Cesar Augusto Madid Truyts, Amanda Gomes Rabelo, Gabriel Mesquita de Souza, Daniel Scaldaferri Lages, Adriano Jose Pereira, Uri Adrian Prync Flato, Eduardo Pontes dos Reis, Joaquim Edson Vieira, Paulo Sergio Panse Silveira, Edson Amaro Junior

Main category: cs.CL

TL;DR: The study evaluates six LLMs and four MLLMs on Brazilian Portuguese medical exam questions, finding performance gaps, especially in multimodal tasks, and highlighting language biases in AI healthcare applications.

Details

Motivation: To assess the performance of LLMs and MLLMs in non-English medical contexts, addressing biases and gaps in AI healthcare applications.

Method: Benchmarked six LLMs and four MLLMs against human candidates using Brazilian Portuguese medical exam questions, analyzing accuracy, processing time, and explanation coherence.

Result: Some models (Claude-3.5-Sonnet, Claude-3-Opus) matched human accuracy, but gaps persisted, especially in multimodal tasks. Language disparities were evident.

Conclusion: Emphasizes the need for evaluating AI in diverse linguistic and clinical settings, with future research focusing on better training, multimodal reasoning, and real-world integration.

Abstract: Artificial intelligence (AI) has shown the potential to revolutionize healthcare by improving diagnostic accuracy, optimizing workflows, and personalizing treatment plans. Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have achieved notable advancements in natural language processing and medical applications. However, the evaluation of these models has focused predominantly on the English language, leading to potential biases in their performance across different languages. This study investigates the capability of six LLMs (GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8x7B Instruct, Titan Text G1-Express, and Command R+) and four MLLMs (Claude-3.5-Sonnet, Claude-3-Opus, Claude-3-Sonnet, and Claude-3-Haiku) to answer questions written in Brazilian spoken portuguese from the medical residency entrance exam of the Hospital das Cl'inicas da Faculdade de Medicina da Universidade de S~ao Paulo (HCFMUSP) - the largest health complex in South America. The performance of the models was benchmarked against human candidates, analyzing accuracy, processing time, and coherence of the generated explanations. The results show that while some models, particularly Claude-3.5-Sonnet and Claude-3-Opus, achieved accuracy levels comparable to human candidates, performance gaps persist, particularly in multimodal questions requiring image interpretation. Furthermore, the study highlights language disparities, emphasizing the need for further fine-tuning and data set augmentation for non-English medical AI applications. Our findings reinforce the importance of evaluating generative AI in various linguistic and clinical settings to ensure a fair and reliable deployment in healthcare. Future research should explore improved training methodologies, improved multimodal reasoning, and real-world clinical integration of AI-driven medical assistance.

Prajval Bolegave, Pushpak Bhattacharya

Main category: cs.CL

TL;DR: The paper introduces a fine-grained dataset for detecting depression from social media posts and evaluates LLMs’ ability to generate clinical explanations.

Details

Motivation: To enable early detection of depression and improve AI-driven mental health interventions by providing a detailed, expert-annotated dataset.

Method: Developed a dataset of 1,017 posts labeled with depressive spans and symptoms. Evaluated LLMs (GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet) using zero-shot and few-shot prompting for clinical explanations.

Result: Significant differences in LLM performance on clinical tasks, highlighting the importance of human expertise.

Conclusion: The work advances safer, more transparent AI for mental health by integrating human-guided LLM evaluations.

Abstract: Early detection of depression from online social media posts holds promise for providing timely mental health interventions. In this work, we present a high-quality, expert-annotated dataset of 1,017 social media posts labeled with depressive spans and mapped to 12 depression symptom categories. Unlike prior datasets that primarily offer coarse post-level labels \cite{cohan-etal-2018-smhd}, our dataset enables fine-grained evaluation of both model predictions and generated explanations. We develop an evaluation framework that leverages this clinically grounded dataset to assess the faithfulness and quality of natural language explanations generated by large language models (LLMs). Through carefully designed prompting strategies, including zero-shot and few-shot approaches with domain-adapted examples, we evaluate state-of-the-art proprietary LLMs including GPT-4.1, Gemini 2.5 Pro, and Claude 3.7 Sonnet. Our comprehensive empirical analysis reveals significant differences in how these models perform on clinical explanation tasks, with zero-shot and few-shot prompting. Our findings underscore the value of human expertise in guiding LLM behavior and offer a step toward safer, more transparent AI systems for psychological well-being.

[26] CaliDrop: KV Cache Compression with Calibration

Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang

Main category: cs.CL

TL;DR: The paper introduces CaliDrop, a calibration-enhanced token eviction strategy to reduce memory usage in LLMs without significant accuracy loss.

Details

Motivation: The KV cache in LLMs grows memory-intensive with sequence length, batch size, and model size, creating bottlenecks. Existing token eviction methods degrade accuracy under high compression.

Method: CaliDrop leverages high query similarity at nearby positions to calibrate discarded tokens, mitigating accuracy loss from eviction.

Result: Experiments show CaliDrop significantly improves accuracy over existing token eviction methods.

Conclusion: CaliDrop effectively balances memory efficiency and accuracy in LLM generation.

Abstract: Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.

[27] KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

Seorin Kim, Dongyoung Lee, Jaejin Lee

Main category: cs.CL

TL;DR: KLAAD is an attention-based debiasing framework for LLMs that aligns attention distributions between biased and unbiased contexts without altering model weights, improving bias mitigation while preserving language quality.

Details

Motivation: Address societal biases in LLM outputs to ensure fairness and reduce harm.

Method: Proposes KLAAD, using a composite training objective (Cross-Entropy, KL divergence, Triplet losses) to align attention distributions implicitly.

Result: Improved bias mitigation on BBQ and BOLD benchmarks with minimal impact on language quality.

Conclusion: Attention-level alignment is a principled approach for bias mitigation in generative language models.

Abstract: Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.

[28] Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque

Main category: cs.CL

TL;DR: Text2Vis is a benchmark for evaluating text-to-visualization models, addressing gaps in current evaluation methods. It includes diverse chart types and data queries, benchmarking 11 models and proposing a novel framework to improve performance.

Details

Motivation: The lack of comprehensive benchmarks for evaluating LLMs in generating visualizations from text limits rigorous assessment of their capabilities.

Method: Text2Vis introduces 1,985 samples with data tables, queries, answers, and visualization code. It benchmarks 11 models and proposes a cross-modal actor-critic framework to refine outputs.

Result: The framework improves GPT-4’s pass rate from 26% to 42% and enhances chart quality. An automated LLM-based evaluation method is also introduced.

Conclusion: Text2Vis provides a robust benchmark and framework for advancing text-to-visualization models, with potential for scalable, automated evaluation.

Abstract: Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at https://github.com/vis-nlp/Text2Vis.

[29] Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

Dan Song, Won-Chan Lee, Hong Jiao

Main category: cs.CL

TL;DR: The study evaluates reliability of LLMs in scoring AP Chinese writing tasks, finding human raters more reliable but LLMs consistent for story narration. Hybrid scoring improves reliability.

Details

Motivation: To assess the reliability of LLMs in scoring writing tasks compared to human raters, focusing on AP Chinese exams.

Method: Used generalizability theory to compare score consistency between human and AI raters for two writing tasks (story narration, email response). Essays were scored by 2 humans and 7 AI raters, with holistic and analytic scores.

Result: Human raters were more reliable overall, but LLMs showed consistency for story narration. Hybrid scoring (human + AI) improved reliability.

Conclusion: Hybrid scoring models combining human and AI raters may enhance reliability in large-scale writing assessments.

Abstract: This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture Exam. Using generalizability theory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free-response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corresponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.

[30] VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

Tan-Minh Nguyen, Hoang-Trung Nguyen, Trong-Khoi Dao, Xuan-Hieu Phan, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong

Main category: cs.CL

TL;DR: The paper introduces the VLQA dataset for Vietnamese legal NLP, addressing resource scarcity and evaluating its effectiveness with state-of-the-art models.

Details

Motivation: Legal NLP faces challenges in low-resource languages like Vietnamese due to lack of annotated data, despite the potential of LLMs.

Method: The authors create the VLQA dataset and analyze it statistically, then test it on legal information retrieval and question-answering tasks.

Result: The VLQA dataset proves effective for legal NLP tasks in Vietnamese, as demonstrated by experiments with advanced models.

Conclusion: The VLQA dataset fills a critical gap for Vietnamese legal NLP, though broader automation of legal tasks remains a challenge.

Abstract: The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal corpora for supervised training, validation, and supervised fine-tuning is critical. In this paper, we introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain. We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness through experiments with state-of-the-art models on legal information retrieval and question-answering tasks.

[31] Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach

Saurav Singla, Aarav Singla, Advik Gupta, Parnika Gupta

Main category: cs.CL

TL;DR: A meta-learning framework for few-shot anomaly detection in human language across domains with limited labeled data, outperforming baselines in F1 and AUC.

Details

Motivation: Addressing the challenge of detecting sparse and variable language anomalies (e.g., spam, fake news, hate speech) with minimal labeled data.

Method: Treats anomaly detection as a few-shot binary classification, using meta-learning with episodic training, prototypical networks, and domain resampling.

Result: Outperforms baselines in F1 and AUC scores on datasets like SMS spam, COVID-19 fake news, and hate speech.

Conclusion: The framework generalizes well to unseen tasks with minimal labeled anomalies, and code/benchmarks are released for further research.

Abstract: We propose a meta learning framework for detecting anomalies in human language across diverse domains with limited labeled data. Anomalies in language ranging from spam and fake news to hate speech pose a major challenge due to their sparsity and variability. We treat anomaly detection as a few shot binary classification problem and leverage meta-learning to train models that generalize across tasks. Using datasets from domains such as SMS spam, COVID-19 fake news, and hate speech, we evaluate model generalization on unseen tasks with minimal labeled anomalies. Our method combines episodic training with prototypical networks and domain resampling to adapt quickly to new anomaly detection tasks. Empirical results show that our method outperforms strong baselines in F1 and AUC scores. We also release the code and benchmarks to facilitate further research in few-shot text anomaly detection.

[32] FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression

Runchao Li, Yao Fu, Mu Sheng, Xianxuan Long, Haotian Yu, Pan Li

Main category: cs.CL

TL;DR: FAEDKV is a training-free KV cache compression framework using frequency-domain transformation to retain unbiased information in LLMs, outperforming existing methods by up to 22%.

Details

Motivation: Addressing the biased representations and high computational costs of current KV cache compression methods in LLMs.

Method: Transforms KV cache into the frequency domain using Infinite-Window Fourier Transform (IWDFT) for equalized token contribution and targeted compression.

Result: Achieves up to 22% better performance on LongBench and superior position-agnostic retrieval accuracy.

Conclusion: FAEDKV offers an efficient, unbiased solution for KV cache compression without requiring retraining.

Abstract: The efficacy of Large Language Models (LLMs) in long-context tasks is often hampered by the substantial memory footprint and computational demands of the Key-Value (KV) cache. Current compression strategies, including token eviction and learned projections, frequently lead to biased representations – either by overemphasizing recent/high-attention tokens or by repeatedly degrading information from earlier context – and may require costly model retraining. We present FAEDKV (Frequency-Adaptive Infinite-Window for KV cache), a novel, training-free KV cache compression framework that ensures unbiased information retention. FAEDKV operates by transforming the KV cache into the frequency domain using a proposed Infinite-Window Fourier Transform (IWDFT). This approach allows for the equalized contribution of all tokens to the compressed representation, effectively preserving both early and recent contextual information. A preliminary frequency ablation study identifies critical spectral components for layer-wise, targeted compression. Experiments on LongBench benchmark demonstrate FAEDKV’s superiority over existing methods by up to 22%. In addition, our method shows superior, position-agnostic retrieval accuracy on the Needle-In-A-Haystack task compared to compression based approaches.

[33] Infogen: Generating Complex Statistical Infographics from Documents

Akash Ghosh, Aparna Garimella, Pritika Ramu, Sambaran Bandyopadhyay, Sriparna Saha

Main category: cs.CL

TL;DR: The paper introduces Infogen, a framework for generating complex statistical infographics from text-heavy documents, addressing a gap in AI capabilities. It includes a benchmark dataset (Infodat) and achieves state-of-the-art performance.

Details

Motivation: Existing AI tools generate only simple charts, lacking the ability to create complex infographics from text-heavy documents requiring deep content understanding.

Method: Proposes Infogen, a two-stage framework: fine-tuned LLMs generate metadata (title, insights, sub-chart details), which is then converted into infographic code.

Result: Infogen outperforms closed and open-source LLMs in generating contextually accurate and visually aligned infographics, as demonstrated on the Infodat dataset.

Conclusion: The work advances AI-driven infographic generation, offering a robust solution for complex, multi-chart infographics from text documents.

Abstract: Statistical infographics are powerful tools that simplify complex data into visually engaging and easy-to-understand formats. Despite advancements in AI, particularly with LLMs, existing efforts have been limited to generating simple charts, with no prior work addressing the creation of complex infographics from text-heavy documents that demand a deep understanding of the content. We address this gap by introducing the task of generating statistical infographics composed of multiple sub-charts (e.g., line, bar, pie) that are contextually accurate, insightful, and visually aligned. To achieve this, we define infographic metadata that includes its title and textual insights, along with sub-chart-specific details such as their corresponding data and alignment. We also present Infodat, the first benchmark dataset for text-to-infographic metadata generation, where each sample links a document to its metadata. We propose Infogen, a two-stage framework where fine-tuned LLMs first generate metadata, which is then converted into infographic code. Extensive evaluations on Infodat demonstrate that Infogen achieves state-of-the-art performance, outperforming both closed and open-source LLMs in text-to-statistical infographic generation.

[34] A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications

Avaljot Singh, Yamin Chandini Sarita, Aditya Mishra, Ishaan Goyal, Gagandeep Singh, Charith Mendis

Main category: cs.CL

TL;DR: A compiler framework is proposed to bridge the gap between neuron-level DNN certifier designs and tensor-level implementations, enabling easier development and modification of certifiers with performance comparable to hand-optimized solutions.

Details

Motivation: The difficulty in developing and modifying DNN certifiers due to the semantic gap between neuron-level design and tensor-level implementation.

Method: A compiler framework with a stack-based intermediate representation (IR) and shape analysis to translate neuron-level specifications into tensor-based implementations, along with a novel double-compression format (g-BCSR) for sparse tensors.

Result: The framework simplifies certifier development, supports diverse DNNs, and achieves performance comparable to hand-optimized implementations.

Conclusion: The proposed compiler and g-BCSR format effectively address the semantic gap and sparsity challenges, facilitating practical DNN certification.

Abstract: The uninterpretability of DNNs has led to the adoption of abstract interpretation-based certification as a practical means to establish trust in real-world systems that rely on DNNs. However, the current landscape supports only a limited set of certifiers, and developing new ones or modifying existing ones for different applications remains difficult. This is because the mathematical design of certifiers is expressed at the neuron level, while their implementations are optimized and executed at the tensor level. This mismatch creates a semantic gap between design and implementation, making manual bridging both complex and expertise-intensive – requiring deep knowledge in formal methods, high-performance computing, etc. We propose a compiler framework that automatically translates neuron-level specifications of DNN certifiers into tensor-based, layer-level implementations. This is enabled by two key innovations: a novel stack-based intermediate representation (IR) and a shape analysis that infers the implicit tensor operations needed to simulate the neuron-level semantics. During lifting, the shape analysis creates tensors in the minimal shape required to perform the corresponding operations. The IR also enables domain-specific optimizations as rewrites. At runtime, the resulting tensor computations exhibit sparsity tied to the DNN architecture. This sparsity does not align well with existing formats. To address this, we introduce g-BCSR, a double-compression format that represents tensors as collections of blocks of varying sizes, each possibly internally sparse. Using our compiler and g-BCSR, we make it easy to develop new certifiers and analyze their utility across diverse DNNs. Despite its flexibility, the compiler achieves performance comparable to hand-optimized implementations.

[35] RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation

Ran Xu, Yuchen Zhuang, Yue Yu, Haoyu Wang, Wenqi Shi, Carl Yang

Main category: cs.CL

TL;DR: RAG enhances LLMs with external knowledge but shows limitations in diverse retrieval scenarios, such as benefiting smaller models more and struggling with heterogeneous sources. Adaptive strategies are needed for real-world use.

Details

Motivation: To evaluate RAG's effectiveness beyond general-domain benchmarks and identify its limitations in diverse, realistic retrieval scenarios.

Method: Evaluated RAG systems using MassiveDS, a large-scale datastore with mixed knowledge, analyzing retrieval benefits, reranker value, and source consistency.

Result: Retrieval mainly aids smaller models, rerankers add little value, no single source excels, and LLMs struggle with heterogeneous knowledge routing.

Conclusion: Adaptive retrieval strategies are essential before deploying RAG in real-world applications due to identified limitations.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved at inference time. While RAG demonstrates strong performance on benchmarks largely derived from general-domain corpora like Wikipedia, its effectiveness under realistic, diverse retrieval scenarios remains underexplored. We evaluated RAG systems using MassiveDS, a large-scale datastore with mixture of knowledge, and identified critical limitations: retrieval mainly benefits smaller models, rerankers add minimal value, and no single retrieval source consistently excels. Moreover, current LLMs struggle to route queries across heterogeneous knowledge sources. These findings highlight the need for adaptive retrieval strategies before deploying RAG in real-world settings. Our code and data can be found at https://github.com/ritaranx/RAG_in_the_Wild.

[36] AI-Driven Generation of Old English: A Framework for Low-Resource Languages

Rodrigo Gabriel Salazar Alva, Matías Nuñez, Cristian López, Javier Martín Arista

Main category: cs.CL

TL;DR: A scalable framework using LLMs to generate high-quality Old English texts, combining LoRA, backtranslation, and a dual-agent pipeline, achieving significant improvements in translation quality.

Details

Motivation: Old English is under-resourced, limiting NLP accessibility; the goal is to preserve cultural and linguistic heritage.

Method: Parameter-efficient fine-tuning (LoRA), data augmentation via backtranslation, and a dual-agent pipeline for content generation and translation.

Result: BLEU scores increased from 26 to over 65; expert human assessment confirms high grammatical accuracy and stylistic fidelity.

Conclusion: The method expands the Old English corpus and provides a blueprint for revitalizing other endangered languages, merging AI innovation with cultural preservation.

Abstract: Preserving ancient languages is essential for understanding humanity’s cultural and linguistic heritage, yet Old English remains critically under-resourced, limiting its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts, addressing this gap. Our approach combines parameter-efficient fine-tuning (Low-Rank Adaptation, LoRA), data augmentation via backtranslation, and a dual-agent pipeline that separates the tasks of content generation (in English) and translation (into Old English). Evaluation with automated metrics (BLEU, METEOR, and CHRF) shows significant improvements over baseline models, with BLEU scores increasing from 26 to over 65 for English-to-Old English translation. Expert human assessment also confirms high grammatical accuracy and stylistic fidelity in the generated texts. Beyond expanding the Old English corpus, our method offers a practical blueprint for revitalizing other endangered languages, effectively uniting AI innovation with the goals of cultural preservation.

[37] Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering

Anas Mohamed, Azal Ahmad Khan, Xinran Wang, Ahmad Faraz Khan, Shuwen Ge, Saman Bahzad Khan, Ayaan Ahmad, Ali Anwar

Main category: cs.CL

TL;DR: Sem-DPO improves DPO by adding semantic consistency, achieving better results in prompt optimization.

Details

Motivation: Address semantic drift in DPO for prompt engineering, ensuring prompts stay true to user intent.

Method: Sem-DPO scales DPO loss by cosine distance in embedding space to down-weight mismatched prompts.

Result: 8-12% higher CLIP similarity and 5-9% higher human-preference scores than DPO.

Conclusion: Sem-DPO sets a new standard for prompt optimization and enables semantics-aware preference tuning.

Abstract: Generative AI can now synthesize strikingly realistic images from text, yet output quality remains highly sensitive to how prompts are phrased. Direct Preference Optimization (DPO) offers a lightweight, off-policy alternative to RL for automatic prompt engineering, but its token-level regularization leaves semantic inconsistency unchecked as prompts that win higher preference scores can still drift away from the user’s intended meaning. We introduce Sem-DPO, a variant of DPO that preserves semantic consistency yet retains its simplicity and efficiency. Sem-DPO scales the DPO loss by an exponential weight proportional to the cosine distance between the original prompt and winning candidate in embedding space, softly down-weighting training signals that would otherwise reward semantically mismatched prompts. We provide the first analytical bound on semantic drift for preference-tuned prompt generators, showing that Sem-DPO keeps learned prompts within a provably bounded neighborhood of the original text. On three standard text-to-image prompt-optimization benchmarks and two language models, Sem-DPO achieves 8-12% higher CLIP similarity and 5-9% higher human-preference scores (HPSv2.1, PickScore) than DPO, while also outperforming state-of-the-art baselines. These findings suggest that strong flat baselines augmented with semantic weighting should become the new standard for prompt-optimization studies and lay the groundwork for broader, semantics-aware preference optimization in language models.

Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim

Main category: cs.CL

TL;DR: Team CRUISE’s solution for KDD Cup 2025 CRAG-MM challenge addresses VLMs’ hallucination issues with a multi-stage framework, achieving 3rd place by prioritizing factual accuracy.

Details

Motivation: Modern VLMs often hallucinate with egocentric imagery, long-tail entities, and multi-hop questions, limiting real-world applicability for fact-seeking queries.

Method: A multi-stage framework with query routing, retrieval, summarization, dual-pathways generation, and post-hoc verification to minimize hallucinations.

Result: Achieved 3rd place in Task 1, validating the framework’s effectiveness in ensuring answer reliability.

Conclusion: Prioritizing factual accuracy over completeness in multi-modal RAG systems reduces hallucinations and improves performance.

Abstract: This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition’s scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM .

[39] Multi-Agent Interactive Question Generation Framework for Long Document Understanding

Kesen Wang, Daulet Toibazar, Abdulrahman Alfulayt, Abdulaziz S. Albadawi, Ranya A. Alkahtani, Asma A. Ibrahim, Haneen A. Alhomoud, Sherif Mohamed, Pedro J. Moreno

Main category: cs.CL

TL;DR: The paper addresses the challenge of Document Understanding (DU) in long-context scenarios, proposing an automated multi-agent framework to generate high-quality questions for training LVLMs, particularly for low-resource languages like Arabic.

Details

Motivation: The performance of Large Vision-Language Models (LVLMs) declines in long-context DU tasks due to limited fine-grained training data, especially for low-resource languages. Human annotation is costly and inefficient.

Method: A fully automated, multi-agent interactive framework is introduced to generate long-context questions efficiently for English and Arabic documents.

Result: The generated questions (AraEngLongBench) are challenging for major LVLMs, demonstrating the framework’s effectiveness.

Conclusion: The proposed framework facilitates LVLM development with enhanced long-context understanding, offering a scalable solution to data scarcity.

Abstract: Document Understanding (DU) in long-contextual scenarios with complex layouts remains a significant challenge in vision-language research. Although Large Vision-Language Models (LVLMs) excel at short-context DU tasks, their performance declines in long-context settings. A key limitation is the scarcity of fine-grained training data, particularly for low-resource languages such as Arabic. Existing state-of-the-art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently. Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long-context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (\textbf{AraEngLongBench}) are quite challenging to major open- and close-source LVLMs. The code and data proposed in this work can be found in https://github.com/wangk0b/Multi_Agentic_QA_Long_Doc.git. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.

[40] Goal Alignment in LLM-Based User Simulators for Conversational AI

Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: The paper introduces User Goal State Tracking (UGST) to improve goal-oriented behavior in user simulators for conversational AI, demonstrating significant improvements in benchmarks.

Details

Motivation: Current Large Language Models (LLMs) struggle with consistent goal-oriented behavior in multi-turn conversations, limiting their reliability in applications.

Method: The authors propose UGST, a framework for tracking user goal progression, and a three-stage methodology for developing goal-aligned user simulators.

Result: The approach shows substantial improvements in goal alignment across benchmarks (MultiWOZ 2.4 and τ-Bench).

Conclusion: UGST addresses a critical gap in conversational AI, providing a framework for developing reliable, goal-aligned user simulators.

Abstract: User simulators are essential to conversational AI, enabling scalable agent development and evaluation through simulated interactions. While current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multi-turn conversations–a critical limitation that compromises their reliability in downstream applications. We introduce User Goal State Tracking (UGST), a novel framework that tracks user goal progression throughout conversations. Leveraging UGST, we present a three-stage methodology for developing user simulators that can autonomously track goal progression and reason to generate goal-aligned responses. Moreover, we establish comprehensive evaluation metrics for measuring goal alignment in user simulators, and demonstrate that our approach yields substantial improvements across two benchmarks (MultiWOZ 2.4 and {\tau}-Bench). Our contributions address a critical gap in conversational AI and establish UGST as an essential framework for developing goal-aligned user simulators.

[41] SGPO: Self-Generated Preference Optimization based on Self-Improver

Hyeonji Lee, Daejin Jo, Seohwan Yun, Sungwoong Kim

Main category: cs.CL

TL;DR: SGPO is a self-improving alignment framework for LLMs that generates its own preference data, outperforming DPO and baselines without external data.

Details

Motivation: Address limitations of conventional alignment methods like off-policy learning and human-annotated datasets, which cause distribution shifts and narrow applicability.

Method: SGPO uses an on-policy self-improving mechanism where a unified model refines responses to generate preference data for direct optimization.

Result: SGPO significantly improves performance on AlpacaEval 2.0 and Arena-Hard over DPO and baseline methods.

Conclusion: SGPO offers a scalable and effective alignment solution for LLMs without relying on external preference data.

Abstract: Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make incremental yet discernible improvements to the current responses by referencing supervised fine-tuning outputs. Experimental results on AlpacaEval 2.0 and Arena-Hard show that the proposed SGPO significantly improves performance over DPO and baseline self-improving methods without using external preference data.

[42] SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding

Yuqi Yang, Weiqi Wang, Baixuan Xu, Wei Fan, Qing Zong, Chunkit Chan, Zheye Deng, Xin Liu, Yifan Gao, Changlong Yu, Chen Luo, Yang Li, Zheng Li, Qingyu Yin, Bing Yin, Yangqiu Song

Main category: cs.CL

TL;DR: The paper introduces an intention tree concept and a dataset curation pipeline to model customer intentions in E-commerce sessions, addressing gaps in prior work. It also presents SessionIntentBench, a multimodal benchmark for evaluating intention understanding.

Details

Motivation: Prior works inadequately model customer intentions due to insufficient data exploitation and lack of benchmarks. The paper aims to fill these gaps by leveraging session history data.

Method: Proposes an intention tree and dataset curation pipeline. Constructs SessionIntentBench with four subtasks to evaluate intention understanding using mined session data.

Result: The benchmark includes 1,952,177 intention entries and 1,132,145 session trajectories. Experiments show current models struggle with intention understanding, but injecting intention improves performance.

Conclusion: The work provides a scalable solution for intention modeling in E-commerce sessions, highlighting the need for better intention-aware models.

Abstract: Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don’t satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs’ capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs’ performances.

[43] Diversity-Enhanced Reasoning for Subjective Questions

Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Yi R. Fung

Main category: cs.CL

TL;DR: MultiRole-R1 enhances reasoning diversity in LRMs for subjective tasks by incorporating multiple role perspectives and using diversity as a reward signal in reinforcement learning.

Details

Motivation: LRMs struggle with subjective questions due to homogeneous reasoning from single ground truth reliance. Increasing role perspectives improves performance.

Method: Proposes MultiRole-R1, featuring unsupervised data construction for diverse reasoning chains and GRPO reinforcement learning with diversity rewards.

Result: Improves accuracy and diversity in subjective reasoning, showing a positive link between diversity and accuracy.

Conclusion: MultiRole-R1 effectively enhances reasoning in LRMs, demonstrating the value of diversity-enhanced training.

Abstract: Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1’s effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.

[44] IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty

Main category: cs.CL

TL;DR: The paper proposes a new evaluation paradigm using factor analysis to identify latent skills in LLMs, addressing the limitations of current benchmark-based assessments.

Details

Motivation: Current benchmark scores for LLMs lack interpretability and fail to capture holistic model strengths and limitations, necessitating a better evaluation method.

Method: The authors use factor analysis on a dataset of 60 LLMs evaluated on 44 tasks to identify latent skills driving performance.

Result: A small set of latent skills is identified, explaining most of the performance variance, and practical tools are developed for task redundancy, model selection, and skill profiling.

Conclusion: The proposed paradigm offers a more nuanced and interpretable way to evaluate LLMs, moving beyond simplistic averaged benchmark scores.

Abstract: Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model’s overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models’ wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.

[45] Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

Minh Hoang Nguyen, Thuat Thien Nguyen, Minh Nhat Ta

Main category: cs.CL

TL;DR: Co-NAML-LSTUR is a hybrid news recommendation framework combining multi-view news modeling and dual-scale user interest representation, outperforming state-of-the-art baselines.

Details

Motivation: Addressing the challenge of modeling multi-view news representations and dynamic user interests (short- and long-term) in news recommendation systems.

Method: Integrates NAML for multi-view news modeling, LSTUR for dual-scale user representation, and BERT-based embeddings for semantic feature extraction.

Result: Achieves significant improvements over baselines on MIND-small and MIND-large benchmarks.

Conclusion: Combining multi-view news representations with dual-scale user modeling is effective for news recommendation.

Abstract: News recommendation systems play a vital role in mitigating information overload by delivering personalized news content. A central challenge is to effectively model both multi-view news representations and the dynamic nature of user interests, which often span both short- and long-term preferences. Existing methods typically rely on single-view features of news articles (e.g., titles or categories) or fail to comprehensively capture user preferences across time scales. In this work, we propose Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news modeling and LSTUR for capturing both long- and short-term user representations. Our model also incorporates BERT-based word embeddings to enhance semantic feature extraction. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Experimental results show that Co-NAML-LSTUR achieves substantial improvements over most state-of-the-art baselines on MIND-small and MIND-large, respectively. These results demonstrate the effectiveness of combining multi-view news representations with dual-scale user modeling. The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR.

[46] Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

Yi Feng, Jiaqi Wang, Wenxuan Zhang, Zhuang Chen, Yutong Shen, Xiyao Xiao, Minlie Huang, Liping Jing, Jian Yu

Main category: cs.CL

TL;DR: The paper introduces INT and IMA, a framework for simulating narrative therapy using LLMs, outperforming standard models in therapeutic quality.

Details

Motivation: Current LLM-based mental health support lacks realism and therapeutic progression, while narrative therapy is underutilized due to access and stigma.

Method: INT simulates expert narrative therapists with planned stages and expert-like responses; IMA evaluates therapy progress via ‘Innovative Moments.’

Result: INT outperforms standard LLMs in therapeutic quality and depth, validated on simulated and human participants.

Conclusion: The framework enhances narrative therapy simulation and evaluation, offering high-quality support for social applications.

Abstract: Recent progress in large language models (LLMs) has opened new possibilities for mental health support, yet current approaches lack realism in simulating specialized psychotherapy and fail to capture therapeutic progression over time. Narrative therapy, which helps individuals transform problematic life stories into empowering alternatives, remains underutilized due to limited access and social stigma. We address these limitations through a comprehensive framework with two core components. First, INT (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses. Second, IMA (Innovative Moment Assessment) provides a therapy-centric evaluation method that quantifies effectiveness by tracking “Innovative Moments” (IMs), critical narrative shifts in client speech signaling therapy progress. Experimental results on 260 simulated clients and 230 human participants reveal that INT consistently outperforms standard LLMs in therapeutic quality and depth. We further demonstrate the effectiveness of INT in synthesizing high-quality support conversations to facilitate social applications.

[47] Modeling Professionalism in Expert Questioning through Linguistic Differentiation

Giulia D’Agostino, Chung-Chi Chen

Main category: cs.CL

TL;DR: The paper explores professionalism in expert communication, focusing on linguistic features to model and evaluate it in financial analyst questions. It introduces an annotation framework and datasets, showing linguistic features correlate with professionalism judgments and authorship. A classifier based on these features outperforms baselines.

Details

Motivation: Professionalism is underexplored in expert communication, especially in high-stakes domains like finance. The study aims to model and evaluate professionalism using linguistic features.

Method: Introduces an annotation framework for structural and pragmatic elements in financial analyst questions. Uses human-authored and LLM-generated questions to create datasets annotated for professionalism and authorship.

Result: Linguistic features correlate with both human judgments and authorship origin. A classifier based on these features outperforms gemini-2.0 and SVM baselines.

Conclusion: Professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.

Abstract: Professionalism is a crucial yet underexplored dimension of expert communication, particularly in high-stakes domains like finance. This paper investigates how linguistic features can be leveraged to model and evaluate professionalism in expert questioning. We introduce a novel annotation framework to quantify structural and pragmatic elements in financial analyst questions, such as discourse regulators, prefaces, and request types. Using both human-authored and large language model (LLM)-generated questions, we construct two datasets: one annotated for perceived professionalism and one labeled by question origin. We show that the same linguistic features correlate strongly with both human judgments and authorship origin, suggesting a shared stylistic foundation. Furthermore, a classifier trained solely on these interpretable features outperforms gemini-2.0 and SVM baselines in distinguishing expert-authored questions. Our findings demonstrate that professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.

[48] Post-Completion Learning for Language Models

Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Can Huang

Main category: cs.CL

TL;DR: Post-Completion Learning (PCL) extends training beyond the token, enhancing reasoning and self-evaluation by leveraging post-output space.

Details

Motivation: Traditional training stops at , missing learning opportunities in post-completion space. PCL aims to improve model reasoning and self-assessment.

Method: Uses white-box reinforcement learning for self-evaluation and reward alignment. Combines dual-track SFT and RL for multi-objective optimization.

Result: Consistent improvements over SFT and RL methods across datasets and models.

Conclusion: PCL offers a novel training approach, improving output quality without sacrificing deployment efficiency.

Abstract: Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (}) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

Abeer Aldayel, Areej Alokaili

Main category: cs.CL

TL;DR: The paper critiques current methods for inclusive representation in conversation models, proposing a framework to evaluate implicit opinions via stance analysis for fairer alignment.

Details

Motivation: Existing methods focus on overt demographic mentions, missing nuanced opinions and risking harmful stereotypes. The study aims to address this gap by evaluating implicit opinion representation.

Method: Introduces an alignment evaluation framework using stance of responses as opinion proxies. Evaluated via PU learning and instruction-tuned language models.

Result: Provides insights into implicit opinion misrepresentation and pathways for more inclusive model behavior.

Conclusion: The framework offers a reflective approach to align models with diverse social viewpoints, improving inclusivity.

Abstract: Shaping inclusive representations that embrace diversity and ensure fair participation and reflections of values is at the core of many conversation-based models. However, many existing methods rely on surface inclusion using mention of user demographics or behavioral attributes of social groups. Such methods overlook the nuanced, implicit expression of opinion embedded in conversations. Furthermore, the over-reliance on overt cues can exacerbate misalignment and reinforce harmful or stereotypical representations in model outputs. Thus, we took a step back and recognized that equitable inclusion needs to account for the implicit expression of opinion and use the stance of responses to validate the normative alignment. This study aims to evaluate how opinions are represented in NLP or computational models by introducing an alignment evaluation framework that foregrounds implicit, often overlooked conversations and evaluates the normative social views and discourse. Our approach models the stance of responses as a proxy for the underlying opinion, enabling a considerate and reflective representation of diverse social viewpoints. We evaluate the framework using both (i) positive-unlabeled (PU) online learning with base classifiers, and (ii) instruction-tuned language models to assess post-training alignment. Through this, we provide a lens on how implicit opinions are (mis)represented and offer a pathway toward more inclusive model behavior.

[50] MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

Kang Yang, Jingxue Chen, Qingkun Tang, Tianxiang Zhang, Qianchun Lu

Main category: cs.CL

TL;DR: MoL-RL is a new training paradigm for LLMs that integrates multi-step environmental feedback (EF) signals, enabling robust feedback-independent reasoning without external loops. It outperforms existing methods on benchmarks.

Details

Motivation: Existing methods for leveraging EF in LLMs either lose contextual information or fail to exploit the multi-step nature of EF, limiting their effectiveness.

Method: MoL-RL combines MoL continual training (decoupling EF signals and general language capabilities) with GRPO-based post-training to distill sequential EF into single-step inferences.

Result: MoL-RL achieves state-of-the-art performance on mathematical reasoning and code generation benchmarks, maintaining strong generalization across model scales.

Conclusion: MoL-RL offers a promising approach to enhance LLMs’ reasoning capabilities by effectively leveraging multi-step textual feedback.

Abstract: Large language models (LLMs) face significant challenges in effectively leveraging sequential environmental feedback (EF) signals, such as natural language evaluations, for feedback-independent chain-of-thought (CoT) reasoning. Existing approaches either convert EF into scalar rewards, losing rich contextual information, or employ refinement datasets, failing to exploit the multi-step and discrete nature of EF interactions. To address these limitations, we propose MoL-RL, a novel training paradigm that integrates multi-step EF signals into LLMs through a dual-objective optimization framework. Our method combines MoL (Mixture-of-Losses) continual training, which decouples domain-specific EF signals (optimized via cross-entropy loss) and general language capabilities (preserved via Kullback-Leibler divergence), with GRPO-based post-training to distill sequential EF interactions into single-step inferences. This synergy enables robust feedback-independent reasoning without relying on external feedback loops. Experimental results on mathematical reasoning (MATH-500, AIME24/AIME25) and code generation (CodeAgent-Test) benchmarks demonstrate that MoL-RL achieves state-of-the-art performance with the Qwen3-8B model, while maintaining strong generalization across model scales (Qwen3-4B). This work provides a promising approach for leveraging multi-step textual feedback to enhance LLMs’ reasoning capabilities in diverse domains.

[51] What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

Katharina Trinley, Toshiki Nakai, Tatiana Anikina, Tanja Baeumel

Main category: cs.CL

TL;DR: Aya-23-8B, a multilingual LLM, processes code-mixed, cloze, and translation tasks differently than monolingual models, with typologically related language representations and script similarity influencing its internal mechanisms.

Details

Motivation: To understand how multilingual training in LLMs like Aya-23-8B affects internal language processing compared to monolingual models.

Method: Analyzed Aya-23-8B using logit lens and neuron specialization techniques, comparing it to monolingual models (Llama 3, Chinese-LLaMA-2) on code-mixed, cloze, and translation tasks.

Result: Aya-23 activates typologically related languages during translation, shows base-language-dependent code-mixed neuron activation, and has language-specific neurons in final layers. Script similarity and typology influence processing.

Conclusion: Multilingual training shapes LLM internals uniquely, offering insights for cross-lingual transfer research.

Abstract: Large language models (LLMs) excel at multilingual tasks, yet their internal language processing remains poorly understood. We analyze how Aya-23-8B, a decoder-only LLM trained on balanced multilingual data, handles code-mixed, cloze, and translation tasks compared to predominantly monolingual models like Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization analyses, we find: (1) Aya-23 activates typologically related language representations during translation, unlike English-centric models that rely on a single pivot language; (2) code-mixed neuron activation patterns vary with mixing rates and are shaped more by the base language than the mixed-in one; and (3) Aya-23’s languagespecific neurons for code-mixed inputs concentrate in final layers, diverging from prior findings on decoder-only models. Neuron overlap analysis further shows that script similarity and typological relations impact processing across model types. These findings reveal how multilingual training shapes LLM internals and inform future cross-lingual transfer research.

[52] Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Abdullah Alabdullah, Lifeng Han, Chenghua Lin

Main category: cs.CL

TL;DR: The paper addresses challenges in Dialectal Arabic (DA) NLP by evaluating training-free prompting techniques and developing a resource-efficient fine-tuning pipeline for DA-MSA translation, achieving notable results with GPT-4o and quantized models.

Details

Motivation: The linguistic divide between DA and MSA limits digital access and hinders Arabic machine translation, necessitating solutions for low-resource settings.

Method: Evaluated prompting techniques (few-shot, zero-shot, chain-of-thought, Ara-TEaR) across six LLMs and developed a fine-tuning pipeline with quantized models.

Result: GPT-4o performed best in prompting, while a quantized Gemma2-9B model achieved a CHrF++ score of 49.88. Multi-dialect training and 4-bit quantization improved performance and efficiency.

Conclusion: The study provides a practical approach to enhance DA-MSA translation, demonstrating feasibility in low-resource settings and promoting inclusive Arabic NLP.

Abstract: Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday communication in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide limits access to digital services and educational resources and impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: a comprehensive evaluation of training-free prompting techniques, and the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (LLMs) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. GPT-4o achieved the highest performance across all prompting settings. For fine-tuning, a quantized Gemma2-9B model achieved a CHrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% CHrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.

[53] DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

Bernd J. Kröger

Main category: cs.CL

TL;DR: DYNARTmo is a dynamic articulatory model for visualizing speech articulation in 2D, built on UK-DYNAMO, with applications in education and therapy.

Details

Motivation: To enhance understanding and training in speech articulation processes through a dynamic, visually accessible model.

Method: Integrates articulatory underspecification, segmental/gestural control, and coarticulation, simulating six articulators via continuous/discrete parameters.

Result: Implemented in a web app (SpeechArticulationTrainer) with multiple views for phonetics education and speech therapy.

Conclusion: Current focus is static modeling; future work will expand to dynamic movement and articulatory-acoustic integration.

Abstract: We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.

[54] RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun

Main category: cs.CL

TL;DR: RMTBench is a user-centric bilingual role-playing benchmark for evaluating LLMs, featuring diverse characters and multi-turn dialogues based on user motivations.

Details

Motivation: Existing benchmarks are character-centric and oversimplify interactions, failing to reflect real-world applications. RMTBench addresses this by focusing on user intentions.

Method: RMTBench includes 80 diverse characters and 8,000+ dialogue rounds, with explicit user motivations and a multi-turn dialogue simulation mechanism.

Result: The benchmark provides a more effective framework for assessing LLM role-playing capabilities by aligning with practical user needs.

Conclusion: RMTBench bridges the gap between academic evaluation and practical deployment, offering a comprehensive tool for LLM role-playing assessment.

Abstract: Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbf{character-centric} approach, simplify user-character interactions to isolated Q&A tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric} bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we construct an authentic multi-turn dialogue simulation mechanism. With carefully selected evaluation dimensions and LLM-based scoring, this mechanism captures the complex intention of conversations between the user and the character. By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements, offering a more effective framework for assessing role-playing capabilities in LLMs. All code and datasets will be released soon.

[55] Length Representations in Large Language Models

Sangjun Moon, Dasom Choi, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

Main category: cs.CL

TL;DR: LLMs encode output sequence length in internal representations, with multi-head attention playing a key role. Length control is partially disentangled from semantics, and hidden units adjust to length-specific prompts.

Details

Motivation: To understand the internal mechanisms in LLMs that control output sequence length, which remains unexplored despite their capabilities.

Method: Empirical analysis of LLMs’ internal representations, focusing on multi-head attention and hidden units, to study how output length is encoded and adjusted.

Result: Multi-head attention is critical for length control, and specific hidden units can scale to adjust length without losing semantic quality. Hidden units reflect length awareness in prompts.

Conclusion: LLMs have adaptable internal mechanisms for output length control, showing disentanglement of length and semantic information.

Abstract: Large language models (LLMs) have shown remarkable capabilities across various tasks, that are learned from massive amounts of text-based data. Although LLMs can control output sequence length, particularly in instruction-based settings, the internal mechanisms behind this control have been unexplored yet. In this study, we provide empirical evidence on how output sequence length information is encoded within the internal representations in LLMs. In particular, our findings show that multi-head attention mechanisms are critical in determining output sequence length, which can be adjusted in a disentangled manner. By scaling specific hidden units within the model, we can control the output sequence length without losing the informativeness of the generated text, thereby indicating that length information is partially disentangled from semantic information. Moreover, some hidden units become increasingly active as prompts become more length-specific, thus reflecting the model’s internal awareness of this attribute. Our findings suggest that LLMs have learned robust and adaptable internal mechanisms for controlling output length without any external control.

Eunkyu Park, Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap

Main category: cs.CL

TL;DR: CoCoT, a three-stage cognitive prompting strategy, outperforms CoT and direct prompting in visual tasks by enhancing interpretability and social awareness in VLMs.

Details

Motivation: Address the breakdown of flat CoT in tasks requiring simultaneous perception, understanding, and judgment, especially in social contexts.

Method: Introduces CoCoT, a strategy with three stages: perception, situation, and norm, inspired by cognitive processes.

Result: CoCoT consistently outperforms CoT and direct prompting by +8% on average across multimodal benchmarks.

Conclusion: Cognitively grounded reasoning stages improve VLMs’ interpretability and social awareness, leading to safer and more reliable systems.

Abstract: Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge-all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.

[57] CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

George Ibrahim, Rita Ramos, Yova Kementchedjhieva

Main category: cs.CL

TL;DR: CONCAP improves multilingual image captioning by integrating retrieved captions with image-specific concepts, reducing data needs and performance gaps.

Details

Motivation: Multilingual vision-language models lag behind English ones due to limited training data and costly parameterization. RAG helps but introduces biases via translated captions.

Method: CONCAP combines retrieved captions with image-specific concepts for better contextualization and grounding in multilingual captioning.

Result: Experiments on XM3600 show CONCAP performs well on low- and mid-resource languages with reduced data requirements.

Conclusion: Concept-aware retrieval augmentation effectively bridges multilingual performance gaps.

Abstract: Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved examples in the target language, reducing the need for extensive multilingual training. However, multilingual RAG captioning models often depend on retrieved captions translated from English, which can introduce mismatches and linguistic biases relative to the source language. We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts, enhancing the contextualization of the input image and grounding the captioning process across different languages. Experiments on the XM3600 dataset indicate that CONCAP enables strong performance on low- and mid-resource languages, with highly reduced data requirements. Our findings highlight the effectiveness of concept-aware retrieval augmentation in bridging multilingual performance gaps.

[58] Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

Khloud AL Jallad, Nada Ghneim, Ghaida Rebdawi

Main category: cs.CL

TL;DR: A survey reviewing NLU benchmarks, focusing on diagnostics datasets and linguistic phenomena, highlighting gaps like lack of standardized evaluation metrics.

Details

Motivation: To analyze and compare NLU benchmarks, especially diagnostics datasets, and address the lack of standardized evaluation metrics.

Method: Comprehensive review and comparison of English, Arabic, and Multilingual NLU benchmarks, focusing on diagnostics datasets and linguistic phenomena.

Result: Identified gaps in standardization, proposed research question on evaluation metrics, and suggested a global hierarchy for linguistic phenomena.

Conclusion: Standardized evaluation metrics for diagnostics benchmarks are needed to improve NLU model comparisons and insights.

Abstract: Natural Language Understanding (NLU) is a basic task in Natural Language Processing (NLP). The evaluation of NLU capabilities has become a trending research topic that attracts researchers in the last few years, resulting in the development of numerous benchmarks. These benchmarks include various tasks and datasets in order to evaluate the results of pretrained models via public leaderboards. Notably, several benchmarks contain diagnostics datasets designed for investigation and fine-grained error analysis across a wide range of linguistic phenomena. This survey provides a comprehensive review of available English, Arabic, and Multilingual NLU benchmarks, with a particular emphasis on their diagnostics datasets and the linguistic phenomena they covered. We present a detailed comparison and analysis of these benchmarks, highlighting their strengths and limitations in evaluating NLU tasks and providing in-depth error analysis. When highlighting the gaps in the state-of-the-art, we noted that there is no naming convention for macro and micro categories or even a standard set of linguistic phenomena that should be covered. Consequently, we formulated a research question regarding the evaluation metrics of the evaluation diagnostics benchmarks: “Why do not we have an evaluation standard for the NLU evaluation diagnostics benchmarks?” similar to ISO standard in industry. We conducted a deep analysis and comparisons of the covered linguistic phenomena in order to support experts in building a global hierarchy for linguistic phenomena in future. We think that having evaluation metrics for diagnostics evaluation could be valuable to gain more insights when comparing the results of the studied models on different diagnostics benchmarks.

[59] CodeNER: Code Prompting for Named Entity Recognition

Sungwoo Han, Hyeyeon Kim, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

Main category: cs.CL

TL;DR: A novel code-based prompting method improves LLMs’ NER performance by embedding detailed BIO schema instructions, outperforming text-based prompting across multiple languages.

Details

Motivation: Existing NER methods using LLMs like ChatGPT rely only on input context, missing detailed labeling requirements.

Method: Proposes code-based prompting to embed BIO schema instructions, leveraging LLMs’ ability to understand programming languages.

Result: Outperforms text-based prompting on ten benchmarks in English, Arabic, Finnish, Danish, and German datasets.

Conclusion: Code-based prompting with chain-of-thought further enhances NER performance, proving the value of structured instructions.

Abstract: Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.

[60] Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems

Tuan Bui, Trong Le, Phat Thai, Sang Nguyen, Minh Hua, Ngan Pham, Thang Bui, Tho Quan

Main category: cs.CL

TL;DR: Text-JEPA is a lightweight framework for converting natural language to first-order logic, improving efficiency and explainability in closed-domain QA systems.

Details

Motivation: Address inefficiencies in translating natural language to formal logic in neural-symbolic frameworks for specialized domains like education, healthcare, and law.

Method: Introduces Text-JEPA, inspired by dual-system cognitive theory, combining efficient logic generation (System 1) with Z3 solver for robust inference (System 2). Evaluated using custom metrics: conversion, reasoning, and Spearman rho scores.

Result: Text-JEPA achieves competitive performance with lower computational overhead compared to larger LLM-based systems.

Conclusion: Structured, interpretable frameworks like Text-JEPA are promising for efficient and explainable QA in specialized domains.

Abstract: Recent advances in large language models (LLMs) have significantly enhanced question-answering (QA) capabilities, particularly in open-domain contexts. However, in closed-domain scenarios such as education, healthcare, and law, users demand not only accurate answers but also transparent reasoning and explainable decision-making processes. While neural-symbolic (NeSy) frameworks have emerged as a promising solution, leveraging LLMs for natural language understanding and symbolic systems for formal reasoning, existing approaches often rely on large-scale models and exhibit inefficiencies in translating natural language into formal logic representations. To address these limitations, we introduce Text-JEPA (Text-based Joint-Embedding Predictive Architecture), a lightweight yet effective framework for converting natural language into first-order logic (NL2FOL). Drawing inspiration from dual-system cognitive theory, Text-JEPA emulates System 1 by efficiently generating logic representations, while the Z3 solver operates as System 2, enabling robust logical inference. To rigorously evaluate the NL2FOL-to-reasoning pipeline, we propose a comprehensive evaluation framework comprising three custom metrics: conversion score, reasoning score, and Spearman rho score, which collectively capture the quality of logical translation and its downstream impact on reasoning accuracy. Empirical results on domain-specific datasets demonstrate that Text-JEPA achieves competitive performance with significantly lower computational overhead compared to larger LLM-based systems. Our findings highlight the potential of structured, interpretable reasoning frameworks for building efficient and explainable QA systems in specialized domains.

[61] AQUA: A Large Language Model for Aquaculture & Fisheries

Praneeth Narisetty, Uday Kumar Reddy Kattamanchi, Lohit Akshant Nimma, Sri Ram Kaushik Karnati, Shiva Nagendra Babu Kore, Mounika Golamari, Tejashree Nageshreddy

Main category: cs.CL

TL;DR: AQUA, a large language model (LLM) for aquaculture, addresses industry challenges like disease and inefficiency using AQUADAPT, a framework for synthetic data generation.

Details

Motivation: Aquaculture faces challenges like disease outbreaks and inefficiencies; existing AI methods lack domain-specific solutions.

Method: Introduces AQUA, an aquaculture-specific LLM, and AQUADAPT, a framework combining expert knowledge and automated techniques for synthetic data.

Result: AQUA and AQUADAPT provide a foundation for LLM-driven innovations in aquaculture research and decision-making.

Conclusion: The work bridges the gap in AI applications for aquaculture, offering tailored solutions for industry challenges.

Abstract: Aquaculture plays a vital role in global food security and coastal economies by providing sustainable protein sources. As the industry expands to meet rising demand, it faces growing challenges such as disease outbreaks, inefficient feeding practices, rising labor costs, logistical inefficiencies, and critical hatchery issues, including high mortality rates and poor water quality control. Although artificial intelligence has made significant progress, existing machine learning methods fall short of addressing the domain-specific complexities of aquaculture. To bridge this gap, we introduce AQUA, the first large language model (LLM) tailored for aquaculture, designed to support farmers, researchers, and industry practitioners. Central to this effort is AQUADAPT (Data Acquisition, Processing and Tuning), an Agentic Framework for generating and refining high-quality synthetic data using a combination of expert knowledge, largescale language models, and automated evaluation techniques. Our work lays the foundation for LLM-driven innovations in aquaculture research, advisory systems, and decision-making tools.

[62] SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

Chaitanya Manem, Pratik Prabhanjan Brahma, Prakamya Mishra, Zicheng Liu, Emad Barsoum

Main category: cs.CL

TL;DR: SAND-Math introduces a pipeline to generate and enhance difficult math problems, improving LLM performance by 17.85 points on AIME25.

Details

Motivation: Addressing the scarcity of challenging math training data for LLMs.

Method: Generates problems from scratch and elevates complexity via Difficulty Hiking.

Result: Boosts baseline performance by 17.85 points; Difficulty Hiking increases average problem difficulty and performance.

Conclusion: SAND-Math provides a scalable toolkit for enhancing mathematical reasoning in LLMs.

Abstract: The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbf{SAND-Math} (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf{$\uparrow$ 17.85 absolute points} on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38% to 49.23%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \href{https://huggingface.co/datasets/amd/SAND-MATH}{https://huggingface.co/datasets/amd/SAND-MATH}

Effi Levi, Gal Ron, Odelia Oshri, Shaul R. Shenhav

Main category: cs.CL

TL;DR: A multi-labeled scheme for joint annotation of hate and counter-hate speech in social media, analyzing thematic and rhetorical dimensions using Aristotle’s Logos, Ethos, and Pathos.

Details

Motivation: To understand the interaction between hate and counter-hate speech in social media conversations and their impact on online behavior.

Method: Annotated 92 conversations (720 tweets) and conducted statistical analyses incorporating public metrics.

Result: Revealed patterns of interaction between thematic and rhetorical dimensions in hate and counter-hate speech.

Conclusion: Provides insights into hate message spread, counter strategies, and their online impact.

Abstract: We introduce a novel multi-labeled scheme for joint annotation of hate and counter-hate speech in social media conversations, categorizing hate and counter-hate messages into thematic and rhetorical dimensions. The thematic categories outline different discursive aspects of each type of speech, while the rhetorical dimension captures how hate and counter messages are communicated, drawing on Aristotle’s Logos, Ethos and Pathos. We annotate a sample of 92 conversations, consisting of 720 tweets, and conduct statistical analyses, incorporating public metrics, to explore patterns of interaction between the thematic and rhetorical dimensions within and between hate and counter-hate speech. Our findings provide insights into the spread of hate messages on social media, the strategies used to counter them, and their potential impact on online behavior.

[64] Enhancing Hallucination Detection via Future Context

Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo

Main category: cs.CL

TL;DR: A framework for detecting hallucinations in black-box LLM-generated text by sampling future contexts to improve detection accuracy.

Details

Motivation: The challenge of detecting hallucinations in black-box LLM outputs due to their hidden generation process.

Method: Sampling future contexts to detect hallucinations, integrating with existing sampling-based methods.

Result: Performance improvements across multiple methods using the proposed sampling approach.

Conclusion: The framework effectively enhances hallucination detection in black-box LLM outputs.

Abstract: Large Language Models (LLMs) are widely used to generate plausible text on online platforms, without revealing the generation process. As users increasingly encounter such black-box outputs, detecting hallucinations has become a critical challenge. To address this challenge, we focus on developing a hallucination detection framework for black-box generators. Motivated by the observation that hallucinations, once introduced, tend to persist, we sample future contexts. The sampled future contexts provide valuable clues for hallucination detection and can be effectively integrated with various sampling-based methods. We extensively demonstrate performance improvements across multiple methods using our proposed sampling approach.

[65] ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning

Duc-Tai Dinh, Duc Anh Khoa Dinh

Main category: cs.CL

TL;DR: ZSE-Cap is a zero-shot ensemble system for image retrieval and captioning, achieving 4th place in EVENTA by combining CLIP, SigLIP, and DINOv2 for retrieval and prompting Gemma 3 for captioning.

Details

Motivation: To develop a zero-shot system for image retrieval and captioning without finetuning, leveraging ensemble methods and prompting for high performance.

Method: Ensembles similarity scores from CLIP, SigLIP, and DINOv2 for retrieval; uses engineered prompts with Gemma 3 for captioning.

Result: Achieved a score of 0.42002, ranking 4th on the private test set.

Conclusion: Combining foundation models through ensembling and prompting is effective for zero-shot tasks.

Abstract: We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition’s data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at https://github.com/ductai05/ZSE-Cap.

[66] Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior

Anaïs Ollagnier

Main category: cs.CL

TL;DR: A systematic review of 49 studies on antisocial behavior (ASB) prediction in social media, proposing a taxonomy of five core tasks and analyzing trends in modeling techniques and dataset challenges.

Details

Motivation: Address the fragmentation in ASB prediction research by providing a unified taxonomy and synthesis of methods to improve platform safety and societal wellbeing.

Method: Review and analyze 49 studies, categorizing them into five core task types, and examining modeling techniques and dataset characteristics.

Result: Identified five task types, trends in modeling (e.g., pre-trained language models), and challenges like dataset scarcity and temporal drift.

Conclusion: The review provides a framework to guide future research toward robust and socially responsible ASB prediction, highlighting emerging directions like multilingual modeling.

Abstract: Antisocial behavior (ASB) on social media-including hate speech, harassment, and trolling-poses growing challenges for platform safety and societal wellbeing. While prior work has primarily focused on detecting harmful content after it appears, predictive approaches aim to forecast future harmful behaviors-such as hate speech propagation, conversation derailment, or user recidivism-before they fully unfold. Despite increasing interest, the field remains fragmented, lacking a unified taxonomy or clear synthesis of existing methods. This paper presents a systematic review of over 49 studies on ASB prediction, offering a structured taxonomy of five core task types: early harm detection, harm emergence prediction, harm propagation prediction, behavioral risk prediction, and proactive moderation support. We analyze how these tasks differ by temporal framing, prediction granularity, and operational goals. In addition, we examine trends in modeling techniques-from classical machine learning to pre-trained language models-and assess the influence of dataset characteristics on task feasibility and generalization. Our review highlights methodological challenges, such as dataset scarcity, temporal drift, and limited benchmarks, while outlining emerging research directions including multilingual modeling, cross-platform generalization, and human-in-the-loop systems. By organizing the field around a coherent framework, this survey aims to guide future work toward more robust and socially responsible ASB prediction.

[67] Ontology-Enhanced Knowledge Graph Completion using Large Language Models

Wenbin Guo, Xin Wang, Jiaoyan Chen, Zhao Li, Zirui Chen

Main category: cs.CL

TL;DR: OL-KGC integrates neural-perceptual structural information with ontological knowledge to enhance LLM-based KGC, outperforming existing methods.

Details

Motivation: Current LLM-based KGC methods rely on implicit knowledge, propagating errors and lacking decisive reasoning. The goal is to combine structural and ontological knowledge for deeper understanding.

Method: OL-KGC embeds structural information into textual space using neural perceptual mechanisms and extracts ontological knowledge from KGs, converting it into LLM-readable text for logic guidance.

Result: OL-KGC outperforms mainstream KGC methods on benchmarks (FB15K-237, UMLS, WN18RR), achieving state-of-the-art performance.

Conclusion: OL-KGC effectively combines structural and ontological knowledge, improving LLM-based KGC with superior results.

Abstract: Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs – OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks – FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.

[68] Geometric-Mean Policy Optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei

Main category: cs.CL

TL;DR: GMPO improves GRPO by optimizing the geometric mean of token-level rewards for stability and outperforms GRPO on benchmarks.

Details

Motivation: GRPO's instability due to outlier importance-weighted rewards during policy updates.

Method: GMPO maximizes the geometric mean of token-level rewards, reducing sensitivity to outliers.

Result: GMPO-7B outperforms GRPO by 4.1% on math benchmarks and 1.4% on multimodal reasoning.

Conclusion: GMPO offers stable training and better performance than GRPO.

Abstract: Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.

[69] When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Hanna Shcharbakova, Tatiana Anikina, Natalia Skachkova, Josef van Genabith

Main category: cs.CL

TL;DR: Smaller language models like XLM-R outperform larger LLMs in multilingual fact verification, achieving 57.7% macro-F1, a 15.8% improvement over previous benchmarks.

Details

Motivation: Addressing the need for robust automated fact verification systems to combat multilingual misinformation with nuanced classification.

Method: Comprehensive evaluation of five state-of-the-art language models (XLM-R, mT5, Llama 3.1, Qwen 2.5, Mistral Nemo) on the X-Fact dataset (25 languages, 7 veracity categories) using prompting and fine-tuning.

Result: XLM-R (270M parameters) outperformed larger LLMs (7-12B parameters) with 57.7% macro-F1 vs. 16.9%, revealing LLM biases and evidence-leveraging issues.

Conclusion: Specialized smaller models may be more effective than general-purpose LLMs for fine-grained multilingual fact verification, impacting practical deployment.

Abstract: The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.

[70] Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas

Main category: cs.CL

TL;DR: Text2VLM is a pipeline converting text-only datasets into multimodal formats to evaluate VLMs’ resilience against typographic prompt injection attacks, revealing vulnerabilities in current models.

Details

Motivation: Existing datasets focus on text-only prompts, leaving visual vulnerabilities in VLMs under-evaluated.

Method: Text2VLM adapts text datasets into multimodal formats, converting harmful text into typographic images for VLM evaluation.

Result: Open-source VLMs show increased susceptibility to prompt injection with visual inputs, with a performance gap compared to closed-source models.

Conclusion: Text2VLM enhances multimodal safety evaluation, aiding robust VLM deployment in real-world applications.

Abstract: The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.

[71] Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata

Main category: cs.CL

TL;DR: The paper proposes compressing Multimodal Large Language Models (MLLMs) via structural pruning and recovery training, showing widthwise pruning and minimal data suffice for effective performance retention.

Details

Motivation: Current parameter reduction techniques for MLLMs are inflexible and computationally intensive, limiting practical deployment.

Method: Structural pruning (layerwise and widthwise) combined with supervised finetuning and knowledge distillation, tested on LLaVA-v1.5-7B and Bunny-v1.0-3B.

Result: Widthwise pruning performs better in low-resource scenarios; recovery training with 5% data retains 95% performance.

Conclusion: The study provides practical insights for compressing MLLMs efficiently with minimal resources or data.

Abstract: While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms–layerwise and widthwise pruning–applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels (< 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.

[72] Multilingual Self-Taught Faithfulness Evaluators

Carlo Alfano, Aymen Al Marjani, Zeno Jonke, Amin Mantrach, Saab Mansour, Marcello Federico

Main category: cs.CL

TL;DR: A framework for multilingual faithfulness evaluation of LLMs using synthetic data and cross-lingual transfer learning, outperforming existing methods.

Details

Motivation: Address the lack of multilingual faithfulness evaluators for LLMs without relying on expensive human-labeled data.

Method: Uses synthetic multilingual summarization data and cross-lingual transfer learning, comparing language-specific and mixed-language fine-tuning.

Result: Shows consistent relationship between LLM language capabilities and evaluation performance, outperforming baselines.

Conclusion: The framework provides an effective solution for multilingual faithfulness evaluation without extensive labeled data.

Abstract: The growing use of large language models (LLMs) has increased the need for automatic evaluation systems, particularly to address the challenge of information hallucination. Although existing faithfulness evaluation approaches have shown promise, they are predominantly English-focused and often require expensive human-labeled training data for fine-tuning specialized models. As LLMs see increased adoption in multilingual contexts, there is a need for accurate faithfulness evaluators that can operate across languages without extensive labeled data. This paper presents Self-Taught Evaluators for Multilingual Faithfulness, a framework that learns exclusively from synthetic multilingual summarization data while leveraging cross-lingual transfer learning. Through experiments comparing language-specific and mixed-language fine-tuning approaches, we demonstrate a consistent relationship between an LLM’s general language capabilities and its performance in language-specific evaluation tasks. Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.

[73] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: A survey on general-purpose text embeddings (GPTE) in the era of pretrained language models (PLMs), covering their architecture, roles of PLMs, advanced applications, and future research directions.

Details

Motivation: To provide a comprehensive overview of GPTE, highlighting the roles of PLMs in their development and exploring future potential.

Method: Examines fundamental architecture, basic and advanced roles of PLMs, and outlines future research directions.

Result: Identifies key roles of PLMs in GPTE, such as embedding extraction and multilingual support, and suggests future improvements like bias mitigation.

Conclusion: The survey serves as a reference for understanding GPTE’s current state and future potential, emphasizing PLMs’ impact.

Abstract: Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, such as retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. Then, we describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.

[74] Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models

Sam Osian, Arpan Dutta, Sahil Bhandari, Iain E. Buchan, Dan W. Joyce

Main category: cs.CL

TL;DR: An automated language-model pipeline (PFD Toolkit) was developed to analyze coroners’ PFD reports, identifying child-suicide cases more efficiently and reliably than manual methods.

Details

Motivation: To address the inefficiency and limitations of manual curation and coding of PFD reports, which are critical for identifying systemic hazards.

Method: A fully automated, open-source “text-to-table” language-model pipeline processed 4,249 PFD reports, identifying child-suicide cases and coding them for themes.

Result: The PFD Toolkit identified 72 child-suicide cases (double the manual count) with high reliability (Cohen’s κ = 0.82) and reduced processing time from months to minutes.

Conclusion: Automated LLM analysis can efficiently and reliably replicate manual thematic reviews, offering scalable and timely insights for public health.

Abstract: Prevention of Future Deaths (PFD) reports, issued by coroners in England and Wales, flag systemic hazards that may lead to further loss of life. Analysis of these reports has previously been constrained by the manual effort required to identify and code relevant cases. In 2025, the Office for National Statistics (ONS) published a national thematic review of child-suicide PFD reports ($\leq$ 18 years), identifying 37 cases from January 2015 to November 2023 - a process based entirely on manual curation and coding. We evaluated whether a fully automated, open source “text-to-table” language-model pipeline (PFD Toolkit) could reproduce the ONS’s identification and thematic analysis of child-suicide PFD reports, and assessed gains in efficiency and reliability. All 4,249 PFD reports published from July 2013 to November 2023 were processed via PFD Toolkit’s large language model pipelines. Automated screening identified cases where the coroner attributed death to suicide in individuals aged 18 or younger, and eligible reports were coded for recipient category and 23 concern sub-themes, replicating the ONS coding frame. PFD Toolkit identified 72 child-suicide PFD reports - almost twice the ONS count. Three blinded clinicians adjudicated a stratified sample of 144 reports to validate the child-suicide screening. Against the post-consensus clinical annotations, the LLM-based workflow showed substantial to almost-perfect agreement (Cohen’s $\kappa$ = 0.82, 95% CI: 0.66-0.98, raw agreement = 91%). The end-to-end script runtime was 8m 16s, transforming a process that previously took months into one that can be completed in minutes. This demonstrates that automated LLM analysis can reliably and efficiently replicate manual thematic reviews of coronial data, enabling scalable, reproducible, and timely insights for public health and safety. The PFD Toolkit is openly available for future research.

[75] Latent Inter-User Difference Modeling for LLM Personalization

Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, Fuli Feng

Main category: cs.CL

TL;DR: DEP is a framework for personalized LLM outputs by modeling inter-user differences in latent space, outperforming baselines in review generation.

Details

Motivation: Current methods overlook inter-user differences and rely on ineffective language prompts for personalization.

Method: DEP uses latent space embeddings to contrast user behavior with peers, filters features with a sparse autoencoder, and injects them into a frozen LLM.

Result: DEP outperforms baseline methods in personalized review generation across multiple metrics.

Conclusion: DEP effectively models inter-user differences for better personalization in LLMs, with demonstrated success in review generation.

Abstract: Large language models (LLMs) are increasingly integrated into users’ daily lives, leading to a growing demand for personalized outputs. Previous work focuses on leveraging a user’s own history, overlooking inter-user differences that are crucial for effective personalization. While recent work has attempted to model such differences, the reliance on language-based prompts often hampers the effective extraction of meaningful distinctions. To address these issues, we propose Difference-aware Embedding-based Personalization (DEP), a framework that models inter-user differences in the latent space instead of relying on language prompts. DEP constructs soft prompts by contrasting a user’s embedding with those of peers who engaged with similar content, highlighting relative behavioral signals. A sparse autoencoder then filters and compresses both user-specific and difference-aware embeddings, preserving only task-relevant features before injecting them into a frozen LLM. Experiments on personalized review generation show that DEP consistently outperforms baseline methods across multiple metrics. Our code is available at https://github.com/SnowCharmQ/DEP.

[76] A survey of diversity quantification in natural language processing: The why, what, where and how

Louis Estève, Marie-Catherine de Marneffe, Nurit Melnik, Agata Savary, Olha Kanishcheva

Main category: cs.CL

TL;DR: The paper surveys diversity in NLP, proposing a unified taxonomy and framework inspired by ecology and economy to standardize its measurement.

Details

Motivation: To address the ad hoc and inconsistent treatment of diversity in NLP and link it to better-theorized domains.

Method: Surveyed ACL Anthology papers from the past 6 years with “diversity” or “diverse” in their title, analyzing and categorizing diversity measures.

Result: Identified inconsistent terminology and specialized settings, proposing a unified taxonomy and framework (variety, balance, disparity) for measuring diversity.

Conclusion: The study aims to improve formalization, understanding, and comparability of diversity in NLP.

Abstract: The concept of diversity has received increased consideration in Natural Language Processing (NLP) in recent years. This is due to various motivations like promoting and inclusion, approximating human linguistic behavior, and increasing systems’ performance. Diversity has however often been addressed in an ad hoc manner in NLP, and with few explicit links to other domains where this notion is better theorized. We survey articles in the ACL Anthology from the past 6 years, with “diversity” or “diverse” in their title. We find a wide range of settings in which diversity is quantified, often highly specialized and using inconsistent terminology. We put forward a unified taxonomy of why, what on, where, and how diversity is measured in NLP. Diversity measures are cast upon a unified framework from ecology and economy (Stirling, 2007) with 3 dimensions of diversity: variety, balance and disparity. We discuss the trends which emerge due to this systematized approach. We believe that this study paves the way towards a better formalization of diversity in NLP, which should bring a better understanding of this notion and a better comparability between various approaches.

[77] Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings

Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering

Main category: cs.CL

TL;DR: Open-source LLMs like Phi-4-14B and Llama-3.3-70B perform well on Dutch clinical info extraction, with native-language processing being key. A framework, llm_extractinator, enables scalable, privacy-conscious solutions.

Details

Motivation: Proprietary LLMs lack transparency and raise privacy concerns, limiting their use in healthcare. Open-source alternatives are needed for clinical NLP.

Method: Evaluated nine open-source LLMs on the DRAGON benchmark (28 Dutch clinical tasks) using the llm_extractinator framework in a zero-shot setting.

Result: 14B-parameter models (Phi-4-14B, Qwen-2.5-14B, DeepSeek-R1-14B) were competitive; Llama-3.3-70B performed slightly better but was costlier. English translation hurt performance.

Conclusion: Open-source LLMs with native-language processing offer effective, scalable, and privacy-friendly solutions for clinical info extraction in low-resource settings.

Abstract: Medical reports contain rich clinical information but are often unstructured and written in domain-specific language, posing challenges for information extraction. While proprietary large language models (LLMs) have shown promise in clinical natural language processing, their lack of transparency and data privacy concerns limit their utility in healthcare. This study therefore evaluates nine open-source generative LLMs on the DRAGON benchmark, which includes 28 clinical information extraction tasks in Dutch. We developed \texttt{llm_extractinator}, a publicly available framework for information extraction using open-source generative LLMs, and used it to assess model performance in a zero-shot setting. Several 14 billion parameter models, Phi-4-14B, Qwen-2.5-14B, and DeepSeek-R1-14B, achieved competitive results, while the bigger Llama-3.3-70B model achieved slightly higher performance at greater computational cost. Translation to English prior to inference consistently degraded performance, highlighting the need of native-language processing. These findings demonstrate that open-source LLMs, when used with our framework, offer effective, scalable, and privacy-conscious solutions for clinical information extraction in low-resource settings.

[78] Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning

Jungwon Park, Wonjong Rhee

Main category: cs.CL

TL;DR: Soft Injection of task embeddings outperforms traditional ICL by reducing memory and compute costs while improving performance across 57 tasks.

Details

Motivation: To improve the efficiency and effectiveness of task conditioning in LLMs beyond multi-example prompting.

Method: Constructs task embeddings once from few-shot ICL prompts and softly mixes them with attention head activations using pre-optimized parameters.

Result: Outperforms 10-shot ICL by 10.1%-13.9% across 12 LLMs, with reduced memory and compute costs.

Conclusion: Soft Injection shifts task conditioning from prompts to activations, offering a new paradigm for efficient and effective task performance.

Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks by conditioning on input-output examples in the prompt, without requiring any update in model parameters. While widely adopted, it remains unclear whether prompting with multiple examples is the most effective and efficient way to convey task information. In this work, we propose Soft Injection of task embeddings. The task embeddings are constructed only once using few-shot ICL prompts and repeatedly used during inference. Soft injection is performed by softly mixing task embeddings with attention head activations using pre-optimized mixing parameters, referred to as soft head-selection parameters. This method not only allows a desired task to be performed without in-prompt demonstrations but also significantly outperforms existing ICL approaches while reducing memory usage and compute cost at inference time. An extensive evaluation is performed across 57 tasks and 12 LLMs, spanning four model families of sizes from 4B to 70B. Averaged across 57 tasks, our method outperforms 10-shot ICL by 10.1%-13.9% across 12 LLMs. Additional analyses show that our method also serves as an insightful tool for analyzing task-relevant roles of attention heads, revealing that task-relevant head positions selected by our method transfer across similar tasks but not across dissimilar ones – underscoring the task-specific nature of head functionality. Our soft injection method opens a new paradigm for reducing prompt length and improving task performance by shifting task conditioning from the prompt space to the activation space.

[79] MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

Adrien Bazoge

Main category: cs.CL

TL;DR: MediQAl is a French medical QA dataset with 32,603 questions across 41 subjects, featuring three task types and cognitive labels. It evaluates 14 language models, revealing a gap between factual recall and reasoning performance.

Details

Motivation: To address the lack of multilingual medical QA resources and evaluate language models' capabilities in factual recall and reasoning in French clinical scenarios.

Method: The dataset includes 32,603 questions from French medical exams, categorized into three tasks (multiple-choice unique/multiple answers, open-ended short-answer) and labeled as Understanding or Reasoning. Evaluated 14 language models.

Result: A significant performance gap was observed between factual recall and reasoning tasks, highlighting model limitations in reasoning.

Conclusion: MediQAl provides a benchmark for French medical QA, filling a gap in multilingual resources and revealing model performance disparities.

Abstract: This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models’ cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models’ performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

[80] FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models

Roberto Labadie-Tamayo, Adrian Jaques Böck, Djordje Slijepčević, Xihui Chen, Andreas Babic, Matthias Zeppelzauer

Main category: cs.CL

TL;DR: The paper presents solutions for identifying and classifying sexism in social media posts, using three models (SCBM, SCBMT, and XLM-RoBERTa) for subtasks in the EXIST challenge. Results show competitive performance, with interpretability and metadata integration explored.

Details

Motivation: To address the widespread issue of sexism in online conversations by developing interpretable and effective models for sexism identification and classification.

Method: Three models are implemented: SCBM (using human-interpretable adjectives), SCBMT (combining adjectives with transformer embeddings), and a fine-tuned XLM-RoBERTa. Metadata like annotators’ demographics is also explored.

Result: XLM-RoBERTa ranks 6th for English and Spanish, 4th for English in Soft-Soft evaluation. SCBMT achieves 7th for English and Spanish, 6th for Spanish.

Conclusion: The proposed models offer competitive performance and interpretability, with potential for leveraging metadata to improve sexism detection in social media.

Abstract: Sexism has become widespread on social media and in online conversation. To help address this issue, the fifth Sexism Identification in Social Networks (EXIST) challenge is initiated at CLEF 2025. Among this year’s international benchmarks, we concentrate on solving the first task aiming to identify and classify sexism in social media textual posts. In this paper, we describe our solutions and report results for three subtasks: Subtask 1.1 - Sexism Identification in Tweets, Subtask 1.2 - Source Intention in Tweets, and Subtask 1.3 - Sexism Categorization in Tweets. We implement three models to address each subtask which constitute three individual runs: Speech Concept Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT), and a fine-tuned XLM-RoBERTa transformer model. SCBM uses descriptive adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to encode input texts into a human-interpretable representation of adjectives, then used to train a lightweight classifier for downstream tasks. SCBMT extends SCBM by fusing adjective-based representation with contextual embeddings from transformers to balance interpretability and classification performance. Beyond competitive results, these two models offer fine-grained explanations at both instance (local) and class (global) levels. We also investigate how additional metadata, e.g., annotators’ demographic profiles, can be leveraged. For Subtask 1.1, XLM-RoBERTa, fine-tuned on provided data augmented with prior datasets, ranks 6th for English and Spanish and 4th for English in the Soft-Soft evaluation. Our SCBMT achieves 7th for English and Spanish and 6th for Spanish.

[81] FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

Likun Tan, Kuan-Wei Huang, Kevin Wu

Main category: cs.CL

TL;DR: The paper introduces a method to detect and edit factual inaccuracies in large language model responses, focusing on finance, using synthetic datasets and fine-tuned models like Phi-4 and Qwen3.

Details

Motivation: Addressing hallucinations in large language models is critical for factual reliability, especially in high-stakes domains like finance.

Method: Constructs a synthetic dataset with tagged errors, fine-tunes four models (Phi-4, Phi-4-mini, Qwen3-4B, Qwen3-14B) for detection and editing.

Result: Fine-tuned Phi-4 improves binary F1 by 8% and overall detection by 30% over OpenAI-o3; Phi-4-mini remains competitive with minimal performance drop.

Conclusion: The approach offers a practical solution for factual inconsistency detection and editing in finance, with a generalizable framework for broader applications.

Abstract: Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at https://github.com/pegasi-ai/fine-grained-editting.

[82] Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models

Max Peeperkorn, Tom Kouwenhoven, Dan Brown, Anna Jordanous

Main category: cs.CL

TL;DR: Instruction-tuning reduces output diversity in LLMs, especially for creative tasks. A new decoding strategy, conformative decoding, helps reintroduce diversity while maintaining quality.

Details

Motivation: To investigate the 'diversity gap' caused by instruction-tuning in LLMs and propose a solution to mitigate it.

Method: Analyze diversity metrics for various LLMs, study fine-tuning stages (e.g., OLMo models), and introduce conformative decoding.

Result: Instruction-tuning significantly reduces diversity, with DPO having the most impact. Conformative decoding increases diversity without compromising quality.

Conclusion: Conformative decoding effectively addresses the diversity loss in instruction-tuned LLMs, offering a practical solution for creative tasks.

Abstract: Instruction-tuning large language models (LLMs) reduces the diversity of their outputs, which has implications for many tasks, particularly for creative tasks. This paper investigates the ``diversity gap’’ for a writing prompt narrative generation task. This gap emerges as measured by current diversity metrics for various open-weight and open-source LLMs. The results show significant decreases in diversity due to instruction-tuning. We explore the diversity loss at each fine-tuning stage for the OLMo and OLMo 2 models to further understand how output diversity is affected. The results indicate that DPO has the most substantial impact on diversity. Motivated by these findings, we present a new decoding strategy, conformative decoding, which guides an instruct model using its more diverse base model to reintroduce output diversity. We show that conformative decoding typically increases diversity and even maintains or improves quality.

[83] Memorization in Fine-Tuned Large Language Models

Danil Savine, Muni Sreenivas Pydi, Jamal Atif, Olivier Cappé

Main category: cs.CL

TL;DR: The study explores memorization in fine-tuned LLMs, focusing on medical data. It uses membership inference and generation tasks to analyze memorization, revealing key factors like matrix impact, perplexity, and LoRA rank.

Details

Motivation: To understand how fine-tuning affects memorization in LLMs, especially in privacy-sensitive domains like medicine, to balance performance and privacy.

Method: Uses membership inference attacks and generation tasks with the PHEE dataset, analyzing weight matrices, perplexity, and LoRA rank effects.

Result: Value and Output matrices contribute more to memorization; lower perplexity increases memorization; higher LoRA ranks boost memorization with diminishing returns.

Conclusion: The study highlights trade-offs between performance and privacy in fine-tuned LLMs, guiding responsible adaptation strategies.

Abstract: This study investigates the mechanisms and factors influencing memorization in fine-tuned large language models (LLMs), with a focus on the medical domain due to its privacy-sensitive nature. We examine how different aspects of the fine-tuning process affect a model’s propensity to memorize training data, using the PHEE dataset of pharmacovigilance events. Our research employs two main approaches: a membership inference attack to detect memorized data, and a generation task with prompted prefixes to assess verbatim reproduction. We analyze the impact of adapting different weight matrices in the transformer architecture, the relationship between perplexity and memorization, and the effect of increasing the rank in low-rank adaptation (LoRA) fine-tuning. Key findings include: (1) Value and Output matrices contribute more significantly to memorization compared to Query and Key matrices; (2) Lower perplexity in the fine-tuned model correlates with increased memorization; (3) Higher LoRA ranks lead to increased memorization, but with diminishing returns at higher ranks. These results provide insights into the trade-offs between model performance and privacy risks in fine-tuned LLMs. Our findings have implications for developing more effective and responsible strategies for adapting large language models while managing data privacy concerns.

[84] Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang

Main category: cs.CL

TL;DR: MAJ-EVAL is a framework using LLM agents to simulate human evaluators by creating diverse personas from documents, enabling multi-dimensional feedback through group debates.

Details

Motivation: Real human evaluators are scarce and costly, and existing LLM-as-a-judge methods lack well-designed personas and generalizability.

Method: MAJ-EVAL automatically constructs evaluator personas from documents, instantiates LLM agents, and uses group debates for multi-dimensional feedback.

Result: MAJ-EVAL aligns better with human expert ratings than conventional metrics and existing LLM-as-a-judge methods.

Conclusion: MAJ-EVAL offers a scalable and generalizable solution for multi-dimensional NLP evaluation.

Abstract: Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging “LLM-as-a-judge” paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts’ ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.

Leonardo Castro-Gonzalez, Yi-Ling Chung, Hannak Rose Kirk, John Francis, Angus R. Williams, Pica Johansson, Jonathan Bright

Main category: cs.CL

TL;DR: The paper reviews three cost-effective machine learning techniques (weak supervision, transfer learning, prompt engineering) for social sciences, demonstrating their effectiveness in six applications and highlighting the low-cost potential of prompting large language models.

Details

Motivation: To address the challenge of limited labeled data in social sciences by leveraging cheaper machine learning techniques.

Method: Review and application of weak supervision, transfer learning, and prompt engineering (including zero-shot prompting) across six social science tasks.

Result: All techniques perform well, with prompt engineering showing high accuracy at very low cost.

Conclusion: The paper aims to encourage adoption of these techniques in social sciences, supported by a code repository for reproducibility.

Abstract: The field of machine learning has recently made significant progress in reducing the requirements for labelled training data when building new models. These cheaper' learning techniques hold significant potential for the social sciences, where development of large labelled training datasets is often a significant practical impediment to the use of machine learning for analytical tasks. In this article we review three cheap’ techniques that have developed in recent years: weak supervision, transfer learning and prompt engineering. For the latter, we also review the particular case of zero-shot prompting of large language models. For each technique we provide a guide of how it works and demonstrate its application across six different realistic social science applications (two different tasks paired with three different dataset makeups). We show good performance for all techniques, and in particular we demonstrate how prompting of large language models can achieve high accuracy at very low cost. Our results are accompanied by a code repository to make it easy for others to duplicate our work and use it in their own research. Overall, our article is intended to stimulate further uptake of these techniques in the social sciences.

[86] Juru: Legal Brazilian Large Language Model from Reputable Sources

Roseval Malaquias Junior, Ramon Pires, Roseli Romero, Rodrigo Nogueira

Main category: cs.CL

TL;DR: Specializing Mistral-7B with Brazilian legal data improves legal task performance but reduces general knowledge accuracy.

Details

Motivation: High compute costs for pretraining large language models limit research; domain specialization and high-quality data are explored to reduce costs.

Method: Specialized Mistral-7B with 1.9B legal tokens; evaluated on legal and general knowledge tests.

Result: Juru excels in legal benchmarks but shows forgetting in general knowledge tasks.

Conclusion: Domain specialization enhances performance in targeted areas but sacrifices generalizability, supporting cost-effective model exploration.

Abstract: The high compute cost associated with pretraining large language models limits their research. Two strategies have emerged to address this issue: domain specialization and pretraining with high-quality data. To explore these strategies, we specialized the Mistral-7B model with 1.9 billion unique tokens from reputable Brazilian legal sources and conducted few-shot evaluations on legal and general knowledge test suites. Our model, Juru, demonstrates the benefits of domain specialization by achieving improved performance on legal benchmarks, even with a reduced amount of pretraining data. However, this domain specialization through continued pretraining comes at the cost of increased forgetting in unrelated domains, as evidenced by performance degradation on general knowledge test suites in both Portuguese and English. This study contributes to the growing body of scientific evidence showing that pretraining data selection may enhance the performance of large language models, enabling the exploration of these models at a lower cost. Juru is publicly available at https://huggingface.co/roseval/Juru-7B .

[87] Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Maximillian Chen, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arık

Main category: cs.CL

TL;DR: The paper introduces Action-Based Contrastive Self-Training (ACT), a method to improve LLMs’ conversational skills, particularly disambiguation, using data-efficient dialogue policy learning.

Details

Motivation: LLMs often lack conversational skills like disambiguation, and high-quality training samples are limited, hindering optimal dialogue policy learning.

Method: ACT, a quasi-online preference optimization algorithm based on DPO, is proposed for data-efficient dialogue policy learning in multi-turn conversations.

Result: ACT outperforms standard tuning methods like supervised fine-tuning and DPO in tasks like tabular QA, MRC, and AmbigSQL.

Conclusion: ACT effectively enhances LLMs’ conversational modeling, especially in recognizing and reasoning about ambiguity, even with limited data.

Abstract: Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation – when they are faced with ambiguity, they often overhedge or implicitly guess users’ true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs’ ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT’s efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs’ ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.

[88] Language Models Resist Alignment: Evidence From Data Compression

Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang

Main category: cs.CL

TL;DR: The paper explores the ’elasticity’ of post-alignment LLMs, showing they revert to pre-training behaviors upon further fine-tuning, undermining alignment efforts.

Details

Motivation: To investigate whether alignment fine-tuning has robust effects or is superficial, given anomalies in LLM behavior post-alignment.

Method: Combines theoretical analysis using compression theory with empirical experiments on models of varying scales and types.

Result: Demonstrates elasticity: models revert to pre-training behaviors, with decline rates dropping post-reversion. Elasticity increases with model size and pre-training data.

Conclusion: Highlights the need to address LLM elasticity to improve alignment robustness.

Abstract: Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

[89] DoubleDipper: Improving Long-Context LLMs via Context Recycling

Arie Cattan, Alon Jacovi, Alex Fabrikant, Jonathan Herzig, Roee Aharoni, Hannah Rashkin, Dror Marcus, Avinatan Hassidim, Yossi Matias, Idan Szpektor, Avi Caciularu

Main category: cs.CL

TL;DR: DoubleDipper improves LLM performance on long-context QA tasks by recycling contexts to generate few-shot examples, reducing token usage and enhancing attribution.

Details

Motivation: Address sub-optimal LLM performance on long-context tasks by efficiently leveraging the same context for demonstrations.

Method: Generates few-shot examples from the input context, identifies relevant paragraphs for attribution, and introduces minimal token overhead.

Result: Achieves +16 absolute point improvement on QA datasets and generalizes to multi-hop tasks despite single-hop examples.

Conclusion: DoubleDipper effectively enhances LLM performance on long-context QA with minimal overhead and improved attribution.

Abstract: Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In this work, we propose DoubleDipper, a novel In-Context-Learning method that automatically generates few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+16 absolute points on average across models) on various QA datasets with long context. Surprisingly, despite introducing only single-hop ICL examples, LLMs successfully generalize to multi-hop long-context QA using our approach.

[90] The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints

Thanh-Dung Le, Ti Ti Nguyen, Vu Nguyen Ha, Symeon Chatzinotas, Philippe Jouvet, Rita Noumeir

Main category: cs.CL

TL;DR: The study evaluates adapter techniques for fine-tuning LLMs in clinical NLP under resource constraints, finding lightweight Transformers outperform adapter-augmented LLMs. GRN is the best adapter.

Details

Motivation: Address challenges of domain gap, limited data, and hardware constraints in clinical NLP.

Method: Evaluated four adapter techniques (Adapter, Lightweight, TinyAttention, GRN) on biomedical LLMs and lightweight Transformers under strict GPU limits.

Result: Lightweight Transformers outperformed adapter-augmented LLMs; GRN was the best adapter (accuracy, precision, recall, F1 = 0.88).

Conclusion: In low-resource clinical settings, lightweight Transformers are more practical than large LLMs; GRN is viable for minimal adaptation.

Abstract: Fine-tuning Large Language Models (LLMs) for clinical Natural Language Processing (NLP) poses significant challenges due to domain gap, limited data, and stringent hardware constraints. In this study, we evaluate four adapter techniques-Adapter, Lightweight, TinyAttention, and Gated Residual Network (GRN) - equivalent to Low-Rank Adaptation (LoRA), for clinical note classification under real-world, resource-constrained conditions. All experiments were conducted on a single NVIDIA Quadro P620 GPU (2 GB VRAM, 512 CUDA cores, 1.386 TFLOPS FP32), limiting batch sizes to <8 sequences and maximum sequence length to 256 tokens. Our clinical corpus comprises only 580 000 tokens, several orders of magnitude smaller than standard LLM pre-training datasets. We fine-tuned three biomedical pre-trained LLMs (CamemBERT-bio, AliBERT, DrBERT) and two lightweight Transformer models trained from scratch. Results show that 1) adapter structures provide no consistent gains when fine-tuning biomedical LLMs under these constraints, and 2) simpler Transformers, with minimal parameter counts and training times under six hours, outperform adapter-augmented LLMs, which required over 1000 GPU-hours. Among adapters, GRN achieved the best metrics (accuracy, precision, recall, F1 = 0.88). These findings demonstrate that, in low-resource clinical settings with limited data and compute, lightweight Transformers trained from scratch offer a more practical and efficient solution than large LLMs, while GRN remains a viable adapter choice when minimal adaptation is needed.

[91] A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Luo Ji

Main category: cs.CL

TL;DR: The paper explores optimal hyper-parameters for Continual Pre-Training (CPT) of LLMs, focusing on mixture ratios and learning rates, and demonstrates improved performance in Chinese and specific domains.

Details

Motivation: To bridge the gap between theoretical hyper-parameter choices and actual model performance in CPT, especially for enhancing Chinese language skills in LLMs.

Method: CPT on Llama-3 8B and 70B models, studying the Additional Language Mixture Ratio (ALMR) and Learning Rate (LR) correlation, followed by fine-tuning.

Result: Improved performance on Chinese benchmarks and domains like math, coding, and emotional intelligence, with successful deployment of the 70B model in a chat system.

Conclusion: Optimal hyper-parameter selection in CPT enhances model capabilities and practical deployment, validating the approach for large-scale LLMs.

Abstract: Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.

[92] MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning

Ningyuan Xi, Xiaoyu Wang, Yetao Wu, Teng Chen, Qingqing Gu, Yue Zhao, Jinxian Qu, Zhonglin Jiang, Yong Chen, Luo Ji

Main category: cs.CL

TL;DR: The paper introduces MeTHanol, a modularized thinking language model that enhances LLM’s cognitive abilities by mimicking human brain architecture through dual-layer fine-tuning and a two-pass inference mechanism.

Details

Motivation: To improve the thinking and reasoning capabilities of large language models by adopting a modular approach inspired by human cognition.

Method: Dual-layer fine-tuning using annotated (query, thought, answer) samples and a two-pass inference mechanism to generate thoughts and formal responses.

Result: MeTHanol demonstrates improved cognitive behaviors, including planning, self-reflection, and human-like responses, even on unseen tasks.

Conclusion: The modular approach shows promise for significant cognitive gains in LLMs, with potential applications in personalized and open-domain tasks.

Abstract: Current research efforts are focused on enhancing the thinking and reasoning capability of large language model (LLM) by prompting, data-driven emergence and inference-time computation. In this study, we consider stimulating language model’s thinking and cognitive abilities from a modular perspective, which mimics the human brain architecture. We select a specific intermediate attention layer with newly implemented language heads. We conduct dual-layer fine-tuning by annotated (query, thought, answer) samples and show that the intermediate layer can also learn to decode fluent and reasonable language tokens. A two-pass inference mechanism is designed to generate thoughts then formal responses. The entire framework is called modularized thinking language model (MeTHanol) which can enhance LLM’s cognitive behaviors as indicated by Theory of Mind (ToM) and Vignette-based experiments. Case studies also show that MeTHanol can plan and self-reflect and generate human-like thoughts and answers, even on unseen and open-domain tasks. MeTHanol can also adapt to a personalized prompt and behave as the specified character. Our study holds promise for significant cognitive gains from a modular perspective. Our code, model and data are available at https://bachozean.github.io/methanol-page

[93] Real-time Factuality Assessment from Adversarial Feedback

Sanxing Chen, Yukun Huang, Bhuwan Dhingra

Main category: cs.CL

TL;DR: Existing factuality evaluations for news are flawed; a new pipeline using RAG-based feedback creates deceptive variants to better test LLMs, revealing vulnerabilities in retrieval-free detectors.

Details

Motivation: Current evaluations for news factuality are inadequate as they rely on outdated or shallow patterns, failing to test reasoning about current events.

Method: Developed a pipeline using RAG-based feedback to iteratively modify real-time news into deceptive variants, challenging LLMs.

Result: Decreased binary classification ROC-AUC by 17.5% for a RAG-based GPT-4o detector, showing vulnerabilities in retrieval-free LLMs.

Conclusion: RAG is crucial for evaluating and generating challenging news examples, as retrieval-free detectors are prone to adversarial attacks and unseen events.

Abstract: We show that existing evaluations for assessing the factuality of news from conventional sources, such as claims on fact-checking websites, result in high accuracies over time for LLM-based detectors-even after their knowledge cutoffs. This suggests that recent popular false information from such sources can be easily identified due to its likely presence in pre-training/retrieval corpora or the emergence of salient, yet shallow, patterns in these datasets. Instead, we argue that a proper factuality evaluation dataset should test a model’s ability to reason about current events by retrieving and reading related evidence. To this end, we develop a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive variants that challenge LLMs. Our iterative rewrite decreases the binary classification ROC-AUC by an absolute 17.5 percent for a strong RAG-based GPT-4o detector. Our experiments reveal the important role of RAG in both evaluating and generating challenging news examples, as retrieval-free LLM detectors are vulnerable to unseen events and adversarial attacks, while feedback from RAG-based evaluation helps discover more deceitful patterns.

[94] Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

Yanzhu Guo, Simone Conia, Zelin Zhou, Min Li, Saloni Potdar, Henry Xiao

Main category: cs.CL

TL;DR: The paper addresses English-centric biases in multilingual LLMs, introduces metrics to assess naturalness, and proposes an alignment method to improve non-English outputs.

Details

Motivation: Current LLMs exhibit English-centric biases, leading to unnatural outputs in non-English languages, which has been understudied.

Method: Introduces automatic corpus-level metrics to evaluate lexical and syntactic naturalness, and proposes an alignment method to enhance naturalness in target languages.

Result: Evaluation on French and Chinese shows English-influenced patterns; the alignment method improves naturalness without harming general performance.

Conclusion: Highlights the need for multilingual metrics and methods to address biases in LLMs.

Abstract: Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.

[95] What is Wrong with Perplexity for Long-context Language Modeling?

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang

Main category: cs.CL

TL;DR: The paper identifies the unreliability of perplexity (PPL) for evaluating long-context capabilities in LLMs and proposes LongPPL, a novel metric focusing on key tokens, and LongCE loss for improved fine-tuning.

Details

Motivation: Current evaluation metrics like PPL fail to accurately assess long-context understanding in LLMs, necessitating a better approach.

Method: Proposes LongPPL, a metric using long-short context contrast to identify key tokens, and LongCE loss for fine-tuning.

Result: LongPPL shows strong correlation (-0.96 Pearson) with long-context benchmarks, outperforming PPL. LongCE improves performance across benchmarks.

Conclusion: The work provides insights into PPL’s limitations and introduces effective solutions for evaluating and enhancing LLMs’ long-context capabilities.

Abstract: Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at https://github.com/PKU-ML/LongPPL.

[96] Summarization of Opinionated Political Documents with Varied Perspectives

Nicholas Deas, Kathleen McKeown

Main category: cs.CL

TL;DR: The paper introduces a dataset and task for summarizing political perspectives in opinionated news articles to reduce polarization. It evaluates 11 models, finding that even advanced models like GPT-4 struggle with faithfulness to perspectives.

Details

Motivation: To address rising global partisan hostility and polarization, especially around elections, by exposing users to diverse perspectives through accurate summaries.

Method: Proposes a framework for evaluating perspective summaries, benchmarks 11 models (including LLMs) via automatic and human evaluation, and analyzes extraction behavior.

Result: While models like GPT-4 perform well, all struggle to generate summaries faithful to the intended perspective.

Conclusion: The study highlights challenges in perspective summarization and suggests further improvements are needed for model faithfulness.

Abstract: Global partisan hostility and polarization has increased, and this polarization is heightened around presidential elections. Models capable of generating accurate summaries of diverse perspectives can help reduce such polarization by exposing users to alternative perspectives. In this work, we introduce a novel dataset and task for independently summarizing each political perspective in a set of passages from opinionated news articles. For this task, we propose a framework for evaluating different dimensions of perspective summary performance. We benchmark 11 summarization models and LLMs of varying sizes and architectures through both automatic and human evaluation. While recent models like GPT-4o perform well on this task, we find that all models struggle to generate summaries that are faithful to the intended perspective. Our analysis of summaries focuses on how extraction behavior is impacted by features of the input documents.

[97] Benchmarking Linguistic Diversity of Large Language Models

Yanzhu Guo, Guokan Shang, Chloé Clavel

Main category: cs.CL

TL;DR: The paper highlights the gap in evaluating LLMs’ linguistic diversity (lexical, syntactic, semantic) and proposes a framework to assess it, benchmarking state-of-the-art models and analyzing development impacts.

Details

Motivation: Current LLM evaluations focus on task-solving, ignoring linguistic diversity, despite the rise of machine-generated content. This paper addresses this gap.

Method: Proposes a framework to evaluate LLMs across lexical, syntactic, and semantic diversity dimensions, benchmarking models and analyzing development choices.

Result: Benchmarks show variations in linguistic diversity among LLMs, with an in-depth case study on syntactic diversity.

Conclusion: The study underscores the need to prioritize linguistic diversity in LLM development to better mimic human language richness.

Abstract: The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We propose a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth case study for syntactic diversity. Finally, we analyze how different development and deployment choices impact the linguistic diversity of LLM outputs.

[98] Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models

Shamus Sim, Tyrone Chen

Main category: cs.CL

TL;DR: The paper highlights the need to understand reasoning behavior in medical LLMs for explainable AI (XAI), surveys current methods, proposes frameworks for insight, and identifies open challenges.

Details

Motivation: To address the lack of studies on reasoning behavior in medical LLMs, emphasizing its importance for XAI in healthcare.

Method: Adapts reasoning behavior concepts, surveys state-of-the-art approaches, and proposes theoretical frameworks for insight into medical LLMs.

Result: Provides frameworks to understand reasoning in medical LLMs and outlines key challenges for future development.

Conclusion: Increased transparency and trust in medical AI will accelerate its integration and development in healthcare.

Abstract: Background: Despite the current ubiquity of Large Language Models (LLMs) across the medical domain, there is a surprising lack of studies which address their reasoning behaviour. We emphasise the importance of understanding reasoning behaviour as opposed to high-level prediction accuracies, since it is equivalent to explainable AI (XAI) in this context. In particular, achieving XAI in medical LLMs used in the clinical domain will have a significant impact across the healthcare sector. Results: Therefore, in this work, we adapt the existing concept of reasoning behaviour and articulate its interpretation within the specific context of medical LLMs. We survey and categorise current state-of-the-art approaches for modeling and evaluating reasoning reasoning in medical LLMs. Additionally, we propose theoretical frameworks which can empower medical professionals or machine learning engineers to gain insight into the low-level reasoning operations of these previously obscure models. We also outline key open challenges facing the development of Large Reasoning Models. Conclusion: The subsequent increased transparency and trust in medical machine learning models by clinicians as well as patients will accelerate the integration, application as well as further development of medical AI for the healthcare system as a whole.

[99] Computational Analysis of Character Development in Holocaust Testimonies

Esther Shizgal, Eitan Wagner, Renana Keydar, Omri Abend

Main category: cs.CL

TL;DR: A computational method analyzes character development in Holocaust survivor testimonies, focusing on religious belief and practice trajectories, revealing common patterns.

Details

Motivation: To understand the inner and outer changes in protagonists' religious trajectories in narratives, using Holocaust survivor testimonies as a case study.

Method: Natural language processing techniques are applied to cluster and analyze religious belief and practice trajectories in testimonies.

Result: Common structures of religiosity are identified: constant belief and oscillating practice patterns.

Conclusion: The study showcases NLP’s potential for thematic trajectory analysis in narratives, offering insights for historical and sociological research.

Abstract: This work presents a computational approach to analyze character development along the narrative timeline. The analysis characterizes the inner and outer changes the protagonist undergoes within a narrative, and the interplay between them. We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms. We focus on the survivor’s religious trajectory, examining the evolution of their disposition toward religious belief and practice along the testimony. Clustering the resulting trajectories in the dataset, we identify common sequences in the data. Our findings highlight multiple common structures of religiosity across the narratives: in terms of belief, most present a constant disposition, while for practice, most present an oscillating structure, serving as valuable material for historical and sociological research. This work demonstrates the potential of natural language processing techniques for analyzing character evolution through thematic trajectories in narratives.

[100] Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin

Main category: cs.CL

TL;DR: The paper proposes a method to convert multi-head attention (MHA) to grouped-query attention (GQA) with flexible KV head compression, using Procrustes analysis and L0 regularization, achieving significant compression with minimal performance loss.

Details

Motivation: Address the inefficiency of linearly increasing KV cache in large language models (LLMs) by introducing a cost-effective MHA-to-GQA conversion method.

Method: Uses Procrustes analysis to enhance attention head similarity and L0 regularization to prune redundant parameters, adapting the model to GQA.

Result: Compresses 87.5% KV heads in LLaMA2-7B and 75% in Sheared-LLaMA-1.3B with acceptable performance degradation.

Conclusion: The method effectively reduces KV cache overhead while maintaining model performance, offering a practical solution for LLM efficiency.

Abstract: Large language models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, as the model size and the input sequence’s length increase, the linearly increasing key-value (KV) cache significantly degrades inference throughput. Therefore, grouped-query attention (GQA), as an alternative to multi-head attention (MHA), has been widely introduced into LLMs. In this work, we propose a cost-effective method for converting MHA into GQA with any compression ratio of KV heads. The key point of our method lies in the application of Procrustes analysis to the attention heads, which enhances the similarity among attention heads while preserving computational invariance, thereby improving the model’s post-training performance. Subsequently, we employ $\mathit{L_0}$ regularization to prune redundant parameters. The model after pruning can be adapted to the standard GQA framework. Experimental results show that our strategy can compress up to 87.5% KV heads of LLaMA2-7B model and 75% KV heads of Sheared-LLaMA-1.3B with acceptable performance degradation. Our code is released at https://github.com/fpcsong/mha2gqa.

[101] FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Tong Liu, Xiao Yu, Wenxuan Zhou, Jindong Gu, Volker Tresp

Main category: cs.CL

TL;DR: FocalPO, a variant of DPO, improves LLM alignment by down-weighing misranked pairs and enhancing correct ones, outperforming DPO on benchmarks.

Details

Motivation: DPO often fails to improve misranked preference pairs despite its gradient focus, prompting the need for a more effective method.

Method: FocalPO introduces a modulating factor to dynamically scale DPO loss, prioritizing correctly ranked pairs.

Result: FocalPO outperforms DPO and variants on benchmarks like Alpaca Eval 2.0, with fixed hyperparameters.

Conclusion: FocalPO effectively enhances LLM alignment by focusing on correct pairs and empirically demonstrates superior performance.

Abstract: Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citep{chen2024preference} empirically finds that DPO training \textit{rarely improves these misranked preference pairs}, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead \textit{down-weighs} misranked preference pairs and prioritizes enhancing the model’s understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B, with the introduced hyperparameter fixed. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness.

[102] Large Language Models Are Human-Like Internally

Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, Timothy Baldwin

Main category: cs.CL

TL;DR: Larger LMs’ cognitive plausibility was underestimated due to focus on final layers; internal layers align better with human data.

Details

Motivation: Reassess claims of larger LMs' cognitive implausibility by examining internal layers, not just final ones.

Method: Analyze next-word probabilities from internal layers of LMs and compare with human behavioral and neurophysiological measures.

Result: Internal layers of larger LMs align well with human data, challenging prior conclusions; layer-specific alignments with human measures identified.

Conclusion: Larger LMs’ cognitive plausibility is higher than previously thought, with layer-specific human-like processing patterns, opening new research avenues.

Abstract: Recent cognitive modeling studies have reported that larger language models (LMs) exhibit a poorer fit to human reading behavior (Oh and Schuler, 2023b; Shain et al., 2024; Kuribayashi et al., 2024), leading to claims of their cognitive implausibility. In this paper, we revisit this argument through the lens of mechanistic interpretability and argue that prior conclusions were skewed by an exclusive focus on the final layers of LMs. Our analysis reveals that next-word probabilities derived from internal layers of larger LMs align with human sentence processing data as well as, or better than, those from smaller LMs. This alignment holds consistently across behavioral (self-paced reading times, gaze durations, MAZE task processing times) and neurophysiological (N400 brain potentials) measures, challenging earlier mixed results and suggesting that the cognitive plausibility of larger LMs has been underestimated. Furthermore, we first identify an intriguing relationship between LM layers and human measures: earlier layers correspond more closely with fast gaze durations, while later layers better align with relatively slower signals such as N400 potentials and MAZE processing times. Our work opens new avenues for interdisciplinary research at the intersection of mechanistic interpretability and cognitive modeling.

[103] LIMO: Less is More for Reasoning

Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu

Main category: cs.CL

TL;DR: LIMO model achieves high accuracy in mathematical reasoning with minimal training data, challenging the need for massive datasets.

Details

Motivation: To disprove the assumption that complex reasoning in LLMs requires extensive training data.

Method: Simple supervised fine-tuning with minimal examples.

Result: 63.3% accuracy on AIME24 and 95.6% on MATH500, outperforming models with 100x more data.

Conclusion: Proposes the LIMO Hypothesis: complex reasoning emerges with minimal, strategic demonstrations, dependent on pre-trained knowledge and effective post-training examples.

Abstract: We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3% accuracy on AIME24 and 95.6% on MATH500, surpassing previous fine-tuned models (6.5% on AIME24, 59.2% on MATH500) while using only 1% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model’s pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as “cognitive templates” that guide reasoning.

[104] Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM

Yuqi Liu, Yan Zheng

Main category: cs.CL

TL;DR: The paper proposes using RankSVM, a pairwise learning-to-rank method, to improve similar case retrieval performance in Legal AI, outperforming traditional classifiers and mitigating overfitting.

Details

Motivation: To enhance ranking performance in similar case retrieval by leveraging learning-to-rank techniques instead of relying solely on language models.

Method: Experiments using RankSVM as a classifier substitute for a fully connected layer, combined with language models on LeCaRDv1 and LeCaRDv2 datasets.

Result: RankSVM improves retrieval performance and mitigates overfitting due to class imbalance.

Conclusion: RankSVM is effective for optimizing ranking in similar case retrieval tasks and addresses overfitting issues.

Abstract: Given the rapid development of Legal AI, a lot of attention has been paid to one of the most important legal AI tasks–similar case retrieval, especially with language models to use. In our paper, however, we try to improve the ranking performance of current models from the perspective of learning to rank instead of language models. Specifically, we conduct experiments using a pairwise method–RankSVM as the classifier to substitute a fully connected layer, combined with commonly used language models on similar case retrieval datasets LeCaRDv1 and LeCaRDv2. We finally come to the conclusion that RankSVM could generally help improve the retrieval performance on the LeCaRDv1 and LeCaRDv2 datasets compared with original classifiers by optimizing the precise ranking. It could also help mitigate overfitting owing to class imbalance. Our code is available in https://github.com/liuyuqi123study/RankSVM_for_SLR

[105] Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu

Main category: cs.CL

TL;DR: A framework to identify and regularize unintended features in LLM embeddings for text classification, improving generalizability and compliance.

Details

Motivation: LLM embeddings are effective but opaque, making it hard to remove unintended features like sensitive or irrelevant ones, which affects compliance and generalizability.

Method: Pre-train a sparse autoencoder (SAE) to extract interpretable features, fine-tune it on task-specific data, and use a regularizer to minimize similarity between classifier weights and unintended features.

Result: The framework improves classifier generalizability by regularizing non-task-relevant features, demonstrated on toxic chat detection, reward modeling, and disease diagnosis.

Conclusion: The work enables controllable text classification on LLM latent spaces, addressing generalizability, fairness, and privacy challenges.

Abstract: Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impact of these unintended features on classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed self-regularization framework can improve the classifier’s generalizability by regularizing those features that are not semantically correlated to the task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. The code and data are publicly available at https://github.com/JacksonWuxs/Controllable_LLM_Classifier.

[106] FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

Radu Marinescu, Debarun Bhattacharjya, Junkyu Lee, Tigran Tchrakian, Javier Carnerero Cano, Yufang Hou, Elizabeth Daly, Alessandra Pascale

Main category: cs.CL

TL;DR: FactReasoner improves factual accuracy in LLM-generated content by decomposing responses, retrieving contexts, and using probabilistic reasoning to assess support.

Details

Motivation: LLMs struggle with factual correctness, making them unreliable for tasks requiring accurate responses.

Method: FactReasoner decomposes responses into atomic units, retrieves relevant contexts, and uses probabilistic reasoning to assess factual support.

Result: FactReasoner outperforms state-of-the-art prompt-based methods in factual precision and recall.

Conclusion: FactReasoner is a promising solution for enhancing the factual reliability of LLM-generated content.

Abstract: Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.

[107] In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long T. Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, Tomas Pfister

Main category: cs.CL

TL;DR: Proposes Reflective Memory Management (RMM) for LLMs to improve long-term dialogue personalization by dynamically summarizing and refining memory retrieval.

Details

Motivation: Addresses limitations of rigid memory granularity and fixed retrieval in LLMs for sustained personalization in dialogues.

Method: Introduces RMM with Prospective Reflection (dynamic summarization) and Retrospective Reflection (adaptive retrieval refinement via RL).

Result: Achieves over 10% accuracy improvement on LongMemEval compared to baselines.

Conclusion: RMM enhances LLM performance in long-term interactions by improving memory management.

Abstract: Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.

[108] Data Caricatures: On the Representation of African American Language in Pretraining Corpora

Nicholas Deas, Blake Vente, Amith Ananthram, Jessica A. Grieser, Desmond Patton, Shana Kleiner, James Shepard, Kathleen McKeown

Main category: cs.CL

TL;DR: The paper evaluates the representation of African American Language (AAL) in 12 English pretraining corpora, finding it underrepresented and often inappropriate or stereotypical. Automated filters also favor White Mainstream English (WME) over AAL.

Details

Motivation: To assess the quantity and quality of AAL representation in pretraining corpora, addressing underrepresentation and potential harm in language models.

Method: Combines quantitative experiments, human judgments, and qualitative analyses to evaluate AAL sources, variation, and naturalness in 12 corpora.

Result: AAL is underrepresented (0.007%-0.18% of documents) and over 25% of AAL texts in C4 may reinforce harmful stereotypes. Automated filters favor WME.

Conclusion: AAL is underrepresented and often inappropriately represented in pretraining corpora, with automated systems biased toward WME, highlighting the need for better inclusion practices.

Abstract: With a combination of quantitative experiments, human judgments, and qualitative analyses, we evaluate the quantity and quality of African American Language (AAL) representation in 12 predominantly English, open-source pretraining corpora. We specifically focus on the sources, variation, and naturalness of included AAL texts representing the AAL-speaking community. We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as few as 0.007% and at most 0.18% of documents. We also find that more than 25% of AAL texts in C4 may be perceived as inappropriate for LLMs to generate and to reinforce harmful stereotypes. Finally, we find that most automated filters are more likely to conserve White Mainstream English (WME) texts over AAL in pretraining corpora.

[109] Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs

Vincent Li, Tim Knappe, Yule Fu, Kevin Han, Kevin Zhu

Main category: cs.CL

TL;DR: KG-Prover enhances LLMs for theorem proving using knowledge graphs, improving performance by up to 21% without additional fine-tuning.

Details

Motivation: Addressing challenges in theorem proving, such as identifying key concepts and formalizing proofs, by leveraging knowledge graphs.

Method: KG-Prover augments LLMs with knowledge graphs mined from mathematical texts to construct and formalize proofs.

Result: Performance improvements of 2-21% on datasets like ProofNet and miniF2F-test, with KG-Prover achieving over 50% on miniF2F-test.

Conclusion: KG-Prover offers a scalable, effective approach to enhance proof reasoning in LLMs using knowledge graphs.

Abstract: Large language models have demonstrated remarkable capabilities in natural language processing tasks requiring multi-step logical reasoning capabilities, such as automated theorem proving. However, challenges persist within theorem proving, such as the identification of key mathematical concepts, understanding their interrelationships, and formalizing proofs correctly within natural language. We present KG-prover, a novel framework that leverages knowledge graphs mined from reputable mathematical texts to augment general-purpose LLMs to construct and formalize mathematical proofs. We also study the effects of scaling graph-based, test-time compute using KG-Prover, demonstrating significant performance improvements over baselines across multiple datasets. General-purpose LLMs improve up to 21% on miniF2F-test when combined with KG-Prover, with consistent improvements ranging from 2-11% on the ProofNet, miniF2F-test, and MUSTARD datasets without additional scaling. Furthermore, KG-Prover with o4-mini achieves over 50% miniF2F-test. This work provides a promising approach for augmenting natural language proof reasoning with knowledge graphs without the need for additional finetuning.

[110] Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs

Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik

Main category: cs.CL

TL;DR: The paper explores how misalignment in common ground (conversational friction) affects task success in text-based technical support chats, finding LLMs limited in detecting subtle cases.

Details

Motivation: To understand the impact of conversational grounding on task-oriented conversations, particularly in technical support settings.

Method: Analysis of the Ubuntu IRC dataset to identify grounding failures (conversational friction) and their correlation with task success.

Result: Conversational friction, caused by misaligned beliefs/assumptions, correlates with task success; LLMs detect overt but not subtle cases.

Conclusion: Common ground alignment is crucial for task success, but LLMs need improvement for nuanced grounding issues.

Abstract: While it is commonly accepted that maintaining common ground plays a role in conversational success, little prior research exists connecting conversational grounding to success in task-oriented conversations. We study failures of grounding in the Ubuntu IRC dataset, where participants use text-only communication to resolve technical issues. We find that disruptions in conversational flow often stem from a misalignment in common ground, driven by a divergence in beliefs and assumptions held by participants. These disruptions, which we call conversational friction, significantly correlate with task success. We find that although LLMs can identify overt cases of conversational friction, they struggle with subtler and more context-dependent instances requiring pragmatic or domain-specific reasoning.

[111] Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, Dongbin Zhao

Main category: cs.CL

TL;DR: DPO is a scalable, cost-effective alternative to RL for enhancing LLM reasoning, achieving RL-level performance with lower computational overhead.

Details

Motivation: High computational costs of RL-based methods for LLMs drive interest in alternatives like DPO.

Method: Investigates DPO for iterative preference-based learning, including a framework for mutual improvement of generator and reward model.

Result: Single round of DPO with coarse filtering boosts mathematical reasoning; iterative DPO-VP achieves RL-level performance with lower overhead.

Conclusion: DPO is a practical, scalable solution for improving LLM reasoning in resource-limited scenarios.

Abstract: Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.

[112] TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling

Cheng Huang, Fan Gao, Yutong Liu, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu

Main category: cs.CL

TL;DR: The paper introduces TIB-STC, a large-scale, expert-curated benchmark for Tibetan language LLMs, and validates its utility with a reference model, Sun-Shine, showing effectiveness in culturally aligned tasks.

Details

Motivation: To address the uneven distribution of LLM advancements for low-resource and culturally rich languages like Tibetan.

Method: Developed TIB-STC, a multi-domain Tibetan benchmark, and trained Sun-Shine using a three-stage pipeline (pretraining, fine-tuning, preference optimization).

Result: Sun-Shine performed well on Tibetan-specific tasks (Ti-MMLU, Ti-SafetyBench), demonstrating robust instruction-following and cultural alignment.

Conclusion: TIB-STC advances low-resource language modeling and promotes inclusivity in multilingual NLP.

Abstract: Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain benchmark specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the benchmark’s effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available at: https://github.com/Vicentvankor/sun-shine

Hao Lin, Yongjun Zhang

Main category: cs.CL

TL;DR: The paper evaluates the use of LLMs in computational social science, focusing on text classification in social movement studies, and provides a framework for their integration, addressing both benefits and risks.

Details

Motivation: To assess the potential and risks of LLMs in automating text analysis for social science research, particularly in social movement studies.

Method: Proposes a framework for integrating LLMs into text annotation, including tools for optimizing prompts and evaluating validity, reliability, and transparency.

Result: Identifies epistemic risks (validity, reliability, replicability, transparency) and offers guidelines for using LLMs in text annotation.

Conclusion: Provides practical guidelines for LLM use in text annotation and recommendations for communicating epistemic risks in research.

Abstract: Large language models (LLMs) have the potential to revolutionize computational social science, particularly in automated textual analysis. In this paper, we conduct a systematic evaluation of the promises and risks associated with using LLMs for text classification tasks, using social movement studies as an example. We propose a framework for social scientists to incorporate LLMs into text annotation, either as the primary coding decision-maker or as a coding assistant. This framework offers researchers tools to develop the potential best-performing prompt, and to systematically examine and report the validity and reliability of LLMs as a methodological tool. Additionally, we evaluate and discuss its epistemic risks associated with validity, reliability, replicability, and transparency. We conclude with several practical guidelines for using LLMs in text annotation tasks and offer recommendations for more effectively communicating epistemic risks in research.

[114] Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities

Nikita Tatarinov, Siddhant Sukhani, Agam Shah, Sudheer Chava

Main category: cs.CL

TL;DR: A review of 374 NLP papers (2017-2024) highlights trends and opportunities in applying NLP to finance, including forecasting, financial metrics, multilingual datasets, and model efficiency.

Details

Motivation: To systematically examine the application of NLP techniques in finance and identify research opportunities.

Method: Review of 374 NLP papers (221 finance-focused) across 38 conferences/workshops, evaluated on 11 dimensions.

Result: Identified opportunities: expanding forecasting tasks, using financial metrics, leveraging multilingual/crisis datasets, and balancing PLMs with efficient/interpretable models.

Conclusion: Provides actionable research directions and recommendations for academia and industry, emphasizing practical applications.

Abstract: Recent advances in language modeling have led to growing interest in applying Natural Language Processing (NLP) techniques to financial problems, enabling new approaches to analysis and decision-making. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 quantitative and qualitative dimensions, and our study identifies the following opportunities: (i) expanding the scope of forecasting tasks; (ii) enriching evaluation with financial metrics; (iii) leveraging multilingual and crisis-period datasets; and (iv) balancing PLMs with efficient or interpretable alternatives. We identify actionable directions for research and practice, supported by dataset and tool recommendations, with implications for both the academia and industry communities.

[115] Memorization: A Close Look at Books

Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes

Main category: cs.CL

TL;DR: The paper explores extracting entire books from LLMs using Llama 3 70B and ‘prefix-prompting,’ achieving high similarity for some books like ‘Alice’s Adventures in Wonderland,’ but success varies by book popularity. It also examines the undoing of mitigations in instruction-tuned Llama 3.1, linking it to minor weight changes.

Details

Motivation: To assess the extent of verbatim memorization in LLMs and the effectiveness of regurgitation mitigation strategies.

Method: Uses Llama 3 70B models and ‘prefix-prompting’ to extract books, analyzing extraction rates and the impact of fine-tuning on memorization.

Result: High extraction rates for popular books, but not all; undoing of mitigations in Llama 3.1 tied to small weight changes in lower transformer blocks.

Conclusion: Current mitigation strategies have limits, and fine-tuning affects memorization retrieval, highlighting vulnerabilities in aligned LLMs.

Abstract: To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the “prefix-prompting” extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice’s Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.

[116] When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji Suzuki

Main category: cs.CL

TL;DR: Prepending metadata in pre-training improves downstream task performance but inconsistently affects next-token prediction. Effectiveness depends on context length for latent semantics inference.

Details

Motivation: To understand why metadata prepending improves some downstream tasks but not others, and its inconsistent impact on next-token prediction.

Method: Investigated using artificial data and probabilistic context-free grammars to analyze model behavior with and without metadata.

Result: Metadata helps when context is long enough to infer latent semantics but harms performance when context lacks sufficient information.

Conclusion: The technique’s effectiveness is context-dependent, requiring sufficient information for accurate latent semantics inference.

Abstract: The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task’s prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model’s performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.

[117] Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs

Elisa Forcada Rodríguez, Olatz Perez-de-Viñaspre, Jon Ander Campos, Dietrich Klakow, Vagrant Gautam

Main category: cs.CL

TL;DR: The study examines multilingual intersecting biases (country and gender) in large language models, revealing persistent biases despite individual parity, and emphasizes the need for intersectional and multilingual fairness research.

Details

Motivation: To address limitations in fairness research, which often focuses on single biases (e.g., gender) and English, by exploring intersectional biases in multilingual contexts.

Method: Constructed a benchmark with prompts in English, Spanish, and German, varying country and gender (25 countries, 4 pronoun sets), and evaluated 5 Llama-based models.

Result: Found significant gender and country biases, with intersectional biases persisting even when individual biases showed parity. Prompting language affected bias, and instruction-tuned models had the lowest bias levels.

Conclusion: Highlights the necessity for fairness research to adopt intersectional and multilingual approaches to mitigate biases in NLP systems.

Abstract: One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.

[118] Measuring Information Distortion in Hierarchical Ultra long Novel Reconstruction:The Optimal Expansion Ratio

Hanwen Shen, Ting Ying

Main category: cs.CL

TL;DR: The paper analyzes the impact of compression-expansion ratios on semantic distortion in ultra-long novel reconstruction using a two-stage framework and information-theoretic methods.

Details

Motivation: To address the lack of research on ultra-long novel reconstruction (>1M words) using the two-stage framework (outline -> section outline -> manuscript) and quantify semantic distortion under varying compression-expansion ratios.

Method: Utilizes text compression techniques (e.g., LLMZip, LLM2Vec) for information-theoretic analysis to measure semantic distortion and examines the effect of outline length on information preservation.

Result: Optimal compression-expansion ratios significantly reduce semantic distortion in ultra-long novels compared to non-optimal ratios.

Conclusion: The study highlights the importance of optimizing compression-expansion ratios for better semantic preservation in ultra-long novel reconstruction.

Abstract: A two stage novel generation framework (outline -> section outline -> manuscript) is widely used in long novel generation,(e.g., \textsc{DOME}, \textsc{Plan&Write}, \textsc{Long Writer}), but study of such framework in ultra long novel(>1M words) reconstruction is little. Building on recent text compression methods (\textsc{LLMZip}, \textsc{LLM2Vec}), we conduct an information-theoretic analysis to quantify semantic distortion under different compression-expansion ratios. We examine how outline length affects information preservation. Experiments on ultra-long novels show that the optimal compression-expansion ratio significantly reduces semantic distortion compared to other non-optimal compression-expansion ratio.

[119] Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara

Main category: cs.CL

TL;DR: The paper introduces KnowRecall and VisRecall benchmarks to evaluate cross-lingual consistency in multimodal large language models (MLLMs), revealing their struggles with multilingual and cultural knowledge.

Details

Motivation: The rapid evolution of MLLMs lacks consistent performance across languages and cultural knowledge integration, prompting the need for better evaluation tools.

Method: Two benchmarks, KnowRecall (visual question answering) and VisRecall (visual memory consistency), are introduced to assess MLLMs in multiple languages.

Result: State-of-the-art MLLMs, including proprietary ones, fail to achieve cross-lingual consistency, highlighting performance gaps.

Conclusion: More robust approaches are needed to develop truly multilingual and culturally aware MLLMs.

Abstract: The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.

[120] Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin

Main category: cs.CL

TL;DR: The paper explores unexpected vulnerabilities in fine-tuned LLMs due to dataset characteristics, analyzing factors like linguistic features and toxicity, and proposes insights for adversarial defense strategies.

Details

Motivation: To investigate how fine-tuning LLMs on domain-specific data can introduce unintended vulnerabilities, focusing on dataset traits.

Method: Identify correlation factors (linguistic features, semantic similarity, toxicity), evaluate adversarial robustness, and analyze persona shifts and interpretability.

Result: Findings reveal how dataset factors influence attack success rates and highlight causal relationships for adversarial defense.

Conclusion: Dataset design is critical for preserving model alignment and mitigating accidental vulnerabilities in fine-tuned LLMs.

Abstract: As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.

[121] Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

Kristin Qi, Jiali Cheng, Youxiang Zhu, Hadi Amiri, Xiaohui Liang

Main category: cs.CL

TL;DR: The paper proposes a framework for detecting Mild Cognitive Impairment (MCI) in multilingual and multi-picture settings, improving performance over text-only baselines.

Details

Motivation: Prior work focused on English speakers and single pictures, leaving gaps for multilingual and multi-picture scenarios.

Method: The framework includes supervised contrastive learning, image modality integration, and a Product of Experts (PoE) strategy.

Result: Achieved +7.1% UAR and +2.9% F1 score improvements over the text-only baseline.

Conclusion: The framework effectively addresses challenges in multilingual and multi-picture MCI detection.

Abstract: Detecting Mild Cognitive Impairment from picture descriptions is critical yet challenging, especially in multilingual and multiple picture settings. Prior work has primarily focused on English speakers describing a single picture (e.g., the ‘Cookie Theft’). The TAUKDIAL-2024 challenge expands this scope by introducing multilingual speakers and multiple pictures, which presents new challenges in analyzing picture-dependent content. To address these challenges, we propose a framework with three components: (1) enhancing discriminative representation learning via supervised contrastive learning, (2) involving image modality rather than relying solely on speech and text modalities, and (3) applying a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. Our framework improves MCI detection performance, achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to 75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to the text unimodal baseline. Notably, the contrastive learning component yields greater gains for the text modality compared to speech. These results highlight our framework’s effectiveness in multilingual and multi-picture MCI detection.

Kristin Qi, Youxiang Zhu, Caroline Summerour, John A. Batsis, Xiaohui Liang

Main category: cs.CL

TL;DR: The paper proposes Cog-TiPRO, a framework using voice assistant systems (VAS) to detect cognitive decline by analyzing speech patterns. It combines LLM-driven linguistic feature extraction, HuBERT-based acoustic analysis, and transformer-based modeling, achieving 73.80% accuracy in detecting mild cognitive impairment (MCI).

Details

Motivation: Early detection of cognitive decline is vital for intervention, but traditional methods are labor-intensive. The study explores VAS as a non-invasive, scalable alternative for monitoring speech patterns.

Method: Cog-TiPRO integrates LLM-driven prompt refinement for linguistic features, HuBERT for acoustic features, and iTransformer for temporal modeling. Data from 35 older adults over 18 months, including daily VAS interactions, was analyzed.

Result: The framework achieved 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming baselines by 27.13%. Linguistic features unique to cognitive decline were identified.

Conclusion: Cog-TiPRO demonstrates the potential of VAS for early cognitive decline detection, offering a scalable, non-invasive solution with promising accuracy.

Abstract: Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.

[123] Scaling Physical Reasoning with the PHYSICS Dataset

Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, Haonan He, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye

Main category: cs.CL

TL;DR: The paper introduces PHYSICS, a dataset of 16,568 physics problems to improve LLMs’ reasoning in physics, addressing gaps in current research and evaluation methods.

Details

Motivation: Physics is reasoning-intensive but understudied in LLMs. The paper aims to bridge this gap by providing a high-quality dataset and tailored evaluation framework.

Method: PHYSICS is curated from 100+ textbooks, covering five physics domains and difficulty levels. It includes training/test splits and reasoning paths for training. A Rule+Model evaluation framework is introduced to address biases.

Result: Evaluations show current LLMs struggle with physics tasks, highlighting the need for improved models and methods.

Conclusion: The dataset and evaluation framework aim to advance LLMs’ capabilities in physics, addressing current limitations.

Abstract: Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model’s physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics.

[124] Minimal Pair-Based Evaluation of Code-Switching

Igor Sterner, Simone Teufel

Main category: cs.CL

TL;DR: The paper proposes a method to evaluate how LLMs use code-switching (CS) by comparing minimal pairs of CS sentences, showing larger models align better with bilingual preferences.

Details

Motivation: Existing methods for evaluating CS in LLMs lack language coverage, diversity, or scalability.

Method: Uses minimal pairs of CS sentences (natural vs. manipulated) across 11 language pairs, tested on humans and LLMs.

Result: Bilinguals prefer natural CS sentences; larger LLMs align more with this preference, especially for closed-class word manipulations.

Conclusion: Larger LLMs better mimic bilingual CS usage, supporting theoretical claims about their capabilities.

Abstract: There is a lack of an evaluation methodology that estimates the extent to which large language models (LLMs) use code-switching (CS) in the same way as bilinguals. Existing methods do not have wide language coverage, fail to account for the diverse range of CS phenomena, or do not scale. We propose an intervention based on minimal pairs of CS. Each minimal pair contains one naturally occurring CS sentence and one minimally manipulated variant. We collect up to 1,000 such pairs each for 11 language pairs. Our human experiments show that, for every language pair, bilinguals consistently prefer the naturally occurring CS sentence. Meanwhile our experiments with current LLMs show that the larger the model, the more consistently it assigns higher probability to the naturally occurring CS sentence than to the variant. In accordance with theoretical claims, the largest probability differences arise in those pairs where the manipulated material consisted of closed-class words.

[125] Code-Switching and Syntax: A Large-Scale Experiment

Igor Sterner, Simone Teufel

Main category: cs.CL

TL;DR: Syntax alone suffices for predicting code-switching patterns, matching human performance and generalizing across languages.

Details

Motivation: To test if syntax alone can explain code-switching patterns across languages, addressing gaps in large-scale, cross-phenomena experiments.

Method: Designed an experiment where an automatic system predicts code-switching points using only syntactic information, tested on minimal pairs of CS.

Result: Syntax alone enables the system to match bilingual humans in distinguishing CS sentences and generalizes to unseen language pairs.

Conclusion: Syntax is a sufficient predictor of code-switching patterns, supporting theoretical claims and demonstrating cross-language applicability.

Abstract: The theoretical code-switching (CS) literature provides numerous pointwise investigations that aim to explain patterns in CS, i.e. why bilinguals switch language in certain positions in a sentence more often than in others. A resulting consensus is that CS can be explained by the syntax of the contributing languages. There is however no large-scale, multi-language, cross-phenomena experiment that tests this claim. When designing such an experiment, we need to make sure that the system that is predicting where bilinguals tend to switch has access only to syntactic information. We provide such an experiment here. Results show that syntax alone is sufficient for an automatic system to distinguish between sentences in minimal pairs of CS, to the same degree as bilingual humans. Furthermore, the learnt syntactic patterns generalise well to unseen language pairs.

[126] Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT

Timothy Do, Pranav Saran, Harshita Poojary, Pranav Prabhu, Sean O’Brien, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: A hybrid model combining mBERT, bidirectional LSTM, and a linear classifier addresses figurative language challenges in low-resource languages like Konkani, achieving 78% accuracy in metaphor classification and 83% in idiom classification using attention head pruning.

Details

Motivation: To tackle the difficulties figurative language poses for NLP systems, especially in low-resource languages such as Konkani.

Method: A hybrid model integrating mBERT, bidirectional LSTM, and a linear classifier, fine-tuned on a new annotated dataset, with gradient-based attention head pruning for efficiency.

Result: 78% accuracy in metaphor classification and 83% in idiom classification.

Conclusion: Attention head pruning is effective for building efficient NLP tools in underrepresented languages.

Abstract: In this paper, we address the persistent challenges that figurative language expressions pose for natural language processing (NLP) systems, particularly in low-resource languages such as Konkani. We present a hybrid model that integrates a pre-trained Multilingual BERT (mBERT) with a bidirectional LSTM and a linear classifier. This architecture is fine-tuned on a newly introduced annotated dataset for metaphor classification, developed as part of this work. To improve the model’s efficiency, we implement a gradient-based attention head pruning strategy. For metaphor classification, the pruned model achieves an accuracy of 78%. We also applied our pruning approach to expand on an existing idiom classification task, achieving 83% accuracy. These results demonstrate the effectiveness of attention head pruning for building efficient NLP tools in underrepresented languages.

[127] Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang

Main category: cs.CL

TL;DR: The paper proposes a causal framework to address challenges in Chain-of-Thought (CoT) prompting, focusing on sufficiency and necessity of reasoning steps, improving efficiency and reducing token usage without losing accuracy.

Details

Motivation: CoT prompting lacks mechanisms to ensure sufficiency and necessity of reasoning steps, limiting its effectiveness in complex reasoning tasks.

Method: A causal framework using Probability of Sufficiency and Necessity to analyze and optimize CoT reasoning steps.

Result: Improved reasoning efficiency and reduced token usage on mathematical and commonsense benchmarks, maintaining accuracy.

Conclusion: The framework enhances LLM reasoning performance and cost-effectiveness, offering a scalable solution.

Abstract: Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

[128] Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective

Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu

Main category: cs.CL

TL;DR: The paper introduces inter-JBBQ, a Japanese benchmark to evaluate intersectional bias in LLMs, revealing context-dependent bias variations in GPT-4o and Swallow.

Details

Motivation: Existing studies on LLM bias often overlook intersectionality, a key aspect of social bias, prompting the need for a dedicated benchmark.

Method: Constructed inter-JBBQ, a benchmark for evaluating intersectional bias in LLMs, and tested it on GPT-4o and Swallow.

Result: Biased outputs in LLMs vary contextually even with identical social attribute combinations.

Conclusion: Intersectional bias in LLMs is context-sensitive, highlighting the need for nuanced evaluation tools like inter-JBBQ.

Abstract: An increasing number of studies have examined the social bias of rapidly developed large language models (LLMs). Although most of these studies have focused on bias occurring in a single social attribute, research in social science has shown that social bias often occurs in the form of intersectionality – the constitutive and contextualized perspective on bias aroused by social attributes. In this study, we construct the Japanese benchmark inter-JBBQ, designed to evaluate the intersectional bias in LLMs on the question-answering setting. Using inter-JBBQ to analyze GPT-4o and Swallow, we find that biased output varies according to its contexts even with the equal combination of social attributes.

[129] A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy

Abdullah Al Shafi, Rowzatul Zannat, Abdul Muntakim, Mahmudul Hasan

Main category: cs.CL

TL;DR: A structured disease-symptom dataset is created from verified medical sources to improve diagnostics and AI applications, addressing a gap for the Bangla language.

Details

Motivation: The need for structured disease-symptom datasets to enhance diagnostic accuracy, early detection, and AI-driven health tools, especially for underrepresented languages like Bangla.

Method: Compilation of disease-symptom relationships from peer-reviewed articles, clinical studies, and health databases, structured in a binary tabular format.

Result: A verified, structured dataset useful for machine learning, clinical support, and epidemiological studies, with a focus on Bangla language support.

Conclusion: The dataset fills a gap for Bangla and multilingual tools, with future work suggested for region-specific diseases and symptom refinement.

Abstract: Disease-symptom datasets are significant and in demand for medical research, disease diagnosis, clinical decision-making, and AI-driven health management applications. These datasets help identify symptom patterns associated with specific diseases, thus improving diagnostic accuracy and enabling early detection. The dataset presented in this study systematically compiles disease-symptom relationships from various online sources, medical literature, and publicly available health databases. The data was gathered through analyzing peer-reviewed medical articles, clinical case studies, and disease-symptom association reports. Only the verified medical sources were included in the dataset, while those from non-peer-reviewed and anecdotal sources were excluded. The dataset is structured in a tabular format, where the first column represents diseases, and the remaining columns represent symptoms. Each symptom cell contains a binary value, indicating whether a symptom is associated with a disease. Thereby, this structured representation makes the dataset very useful for a wide range of applications, including machine learning-based disease prediction, clinical decision support systems, and epidemiological studies. Although there are some advancements in the field of disease-symptom datasets, there is a significant gap in structured datasets for the Bangla language. This dataset aims to bridge that gap by facilitating the development of multilingual medical informatics tools and improving disease prediction models for underrepresented linguistic communities. Further developments should include region-specific diseases and further fine-tuning of symptom associations for better diagnostic performance

[130] Reinforcement learning fine-tuning of language model for instruction following and math reasoning

Yifu Han, Geo Zhang

Main category: cs.CL

TL;DR: RL fine-tuning techniques (SFT, DPO, RLOO) are compared on Qwen2.5-0.5B for instruction following and math reasoning. RLOO with DeBERTa excels in alignment, while DPO is consistent. Synthetic data and best-of-N sampling boost math accuracy.

Details

Motivation: To explore effective RL fine-tuning methods for compact language models, focusing on instruction following and mathematical reasoning.

Method: Compared SFT, DPO, and RLOO with reward models. Used synthetic data augmentation and best-of-N sampling for math tasks.

Result: RLOO with DeBERTa reward modeling performed best for alignment. DPO was consistent. Math accuracy improved with synthetic data and best-of-N sampling.

Conclusion: Combining fine-tuning with inference-time tools (like verifiers) is effective. The study provides practical strategies for lightweight, task-aligned models.

Abstract: This study investigates the effectiveness of reinforcement learning (RL) fine-tuning techniques on a compact language model (Qwen2.5-0.5B Base) for two challenging tasks: instruction following and mathematical reasoning. We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models. Our experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results. For math reasoing tasks, synthetic data augmentation and best-of-N sampling with an external verifier significantly improve accuracy, showing the potential of combining fine-tuning with inference-time tools. This study highlights key trade-offs and practical strategies for training lightweight, task-aligned small-scale language models.

[131] From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding

Main category: cs.CL

TL;DR: The paper introduces SMART, a framework for enhancing multimodal reasoning in MLLMs by using answer-oriented chain-of-thought prompts to generate both positive and negative rationales automatically.

Details

Motivation: Current methods for multimodal reasoning focus on positive rationales and neglect negative reasoning, limiting model generalization and robustness.

Method: SMART uses answer-oriented chain-of-thought (AoT) prompts to construct high-quality data, leveraging both correct and incorrect answers to extract key visual information.

Result: Models trained with AoT-generated data outperform those using manual annotations, showing superior reasoning capabilities.

Conclusion: SMART significantly improves MLLMs across various architectures and datasets, establishing an iterative generation-optimization method for continuous enhancement.

Abstract: Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methods primarily focus on synthesizing positive rationales, typically relying on manual annotations or complex systems. Moreover, they often overlook negative reasoning, which limits the model’s generalization ability and robustness in multimodal inference. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). SMART employs an answer-oriented chain-of-thought (AoT) prompt to automatically construct high-quality data. Drawing inspiration from human proof-based strategies, AoT leverages both correct and incorrect answers to extract key visual information that links questions and answers. When provided with correct answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with incorrect alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model’s reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code is available at https://github.com/WentaoTan/SMART.

[132] Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

A. Bochkov

Main category: cs.CL

TL;DR: The paper challenges the idea that trainable input embeddings are foundational for semantic representation in LLMs. By using frozen, non-semantic visual embeddings, the models outperform conventional ones, suggesting semantics emerge from the Transformer’s architecture, not embeddings.

Details

Motivation: To understand if semantic representation in LLMs inherently relies on trainable input embeddings or if it emerges from the Transformer's architecture.

Method: Constructed Transformer models with frozen, precomputed visual embeddings (derived from Unicode glyphs) and tested performance against models with trainable embeddings.

Result: Models with frozen visual embeddings outperformed conventional models on the MMLU reasoning benchmark, indicating semantics are not tied to embeddings.

Conclusion: High-level semantics are emergent properties of the Transformer’s architecture, not input embeddings, which should be viewed as structural primitives.

Abstract: Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational “meaning vectors.” This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to “representational interference” in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer’s compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

[133] Checklist Engineering Empowers Multilingual LLM Judges

Mohammad Ghiasvand Mohammadkhani, Hamid Beigy

Main category: cs.CL

TL;DR: The paper introduces CE-Judge, a training-free framework for multilingual text evaluation using LLMs, outperforming baselines and matching GPT-4o performance.

Details

Motivation: Addressing the lack of exploration in multilingual LLM-as-a-Judge paradigms and the inefficiencies of proprietary models or extensive training.

Method: Proposes Checklist Engineering (CE-Judge), leveraging checklist intuition with open-source LLMs for multilingual evaluation.

Result: CE-Judge surpasses baselines and performs comparably to GPT-4o across multiple languages and datasets.

Conclusion: CE-Judge offers a cost-effective, efficient alternative for multilingual text evaluation without training.

Abstract: Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.

[134] Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Anirban Saha Anik, Xiaoying Song, Elliott Wang, Bryan Wang, Bengisu Yarimbas, Lingzi Hong

Main category: cs.CL

TL;DR: A Multi-agent Retrieval-Augmented Framework improves counterspeech generation against health misinformation by integrating multiple LLMs for optimized knowledge retrieval, evidence enhancement, and response refinement, outperforming baselines in quality and accuracy.

Details

Motivation: Current methods for generating counterspeech against misinformation rely on limited evidence and lack control over outputs, necessitating a more robust and controlled approach.

Method: Proposes a Multi-agent Retrieval-Augmented Framework using multiple LLMs to optimize knowledge retrieval, evidence enhancement (static and dynamic), and response refinement.

Result: Outperforms baselines in politeness, relevance, informativeness, and factual accuracy. Ablation and cross evaluations confirm component necessity and generalization across topics. Human evaluations show refinement enhances quality.

Conclusion: The framework effectively generates high-quality counterspeech, validated by performance metrics and human preference, demonstrating its robustness and generalizability.

Abstract: Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, cross evaluations show that our system generalizes well across diverse health misinformation topics and datasets. And human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.

[135] Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong

Main category: cs.CL

TL;DR: The paper addresses the underrepresentation of Tibetan in large language models by creating the largest Tibetan pre-training corpus and enhancing a multilingual model’s Tibetan capabilities.

Details

Motivation: Tibetan is underrepresented in existing models due to scarce high-quality training data.

Method: Curated the largest Tibetan corpus, applied a tailored cleaning pipeline, and pre/post-trained a multilingual model.

Result: The model outperforms similar-scale open-source and Tibetan-tailored models across tasks.

Conclusion: The approach effectively improves Tibetan language capabilities in large language models.

Abstract: Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

[136] A Survey of Deep Learning for Geometry Problem Solving

Jianzhe Ma, Wenxuan Wang, Qin Jin

Main category: cs.CL

TL;DR: A survey on deep learning applications in geometry problem solving, covering tasks, methods, evaluation metrics, challenges, and future directions.

Details

Motivation: Geometry problem solving is crucial in education and AI assessment, and recent advances in deep learning, especially multimodal models, have spurred research in this area.

Method: The paper reviews and summarizes deep learning methods for geometry problem solving, including tasks, techniques, evaluation metrics, and challenges.

Result: Provides a comprehensive reference for deep learning in geometry problem solving, with a GitHub repository for ongoing updates.

Conclusion: The survey aims to advance the field by offering insights into current challenges and future research directions.

Abstract: Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

[137] Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou

Main category: cs.CL

TL;DR: The study explores how annotator demographics affect labeling in sexism detection, finding content dominates variance. Generative AI with demographic personas often underperforms, and XAI shows models focus on content, not demographics.

Details

Motivation: To understand variability in annotations for fair NLP systems, especially in sexism detection, and assess the role of annotator demographics and Generative AI's reliability.

Method: Used a Generalized Linear Mixed Model to quantify demographic influence and evaluated Generative AI models with demographic personas, applying XAI techniques.

Result: Demographic factors account for 8% of variance, with content being dominant. Generative AI with personas often performs worse, and models rely on content-specific tokens.

Conclusion: Content-driven explanations and robust annotation protocols are more reliable for fairness than demographic persona simulation.

Abstract: Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.

[138] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: LEAR improves RAG by learning to extract rational evidence through explicit reasoning and conscious extraction, enhancing LLM accuracy.

Details

Motivation: Retrieval noises degrade LLM generation quality, and existing methods lack explicit reasoning, risking key clue omission and poor generalization.

Method: LEAR combines evidence reasoning and extraction into a unified response, uses knowledge token masks for disentanglement, and employs verifiable reward functions for training.

Result: LEAR outperforms on three benchmarks, providing high-quality evidence and boosting downstream task accuracy.

Conclusion: LEAR enhances RAG systems by delivering compact, high-quality evidence and improving LLM performance.

Abstract: Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose LEAR, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of LEAR, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.

[139] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Jingze Song, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang

Main category: cs.CL

TL;DR: Agentar-Fin-R1 series enhances financial LLMs with improved reasoning, reliability, and domain specialization, validated by benchmarks like Fineva and FinEval.

Details

Motivation: Current LLMs lack sophisticated reasoning, trustworthiness, and domain adaptation for financial applications.

Method: Optimization integrates a financial task label system and trustworthiness framework, using label-guided optimization and a two-stage training pipeline.

Result: Agentar-Fin-R1 excels in financial tasks and general reasoning, validated by benchmarks.

Conclusion: The model is a trustworthy solution for high-stakes financial applications, with the Finova benchmark for further evaluation.

Abstract: Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

[140] TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

Keyu Wu, Qianjin Yu, Manlin Mei, Ruiting Liu, Jun Wang, Kailai Zhang, Yelun Bao

Main category: cs.CL

TL;DR: Root Cause Analysis (RCA) in telecom networks is challenging for AI due to complex graph-based reasoning and lack of realistic benchmarks.

Details

Motivation: The paper addresses the difficulty of applying AI to RCA in telecom networks, highlighting the need for better methods and benchmarks.

Method: Not explicitly mentioned in the abstract, but likely involves AI techniques for graph-based reasoning.

Result: Not explicitly mentioned in the abstract.

Conclusion: The abstract suggests that RCA in telecom networks remains a significant challenge for AI, emphasizing the need for further research.

Abstract: Root Cause Analysis (RCA) in telecommunication networks is a critical task, yet it presents a formidable challenge for Artificial Intelligence (AI) due to its complex, graph-based reasoning requirements and the scarcity of realistic benchmarks.

[141] Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

Rachel M. Murphy, Nishant Mishra, Nicolette F. de Keizer, Dave A. Dongelmans, Kitty J. Jager, Ameen Abu-Hanna, Joanna E. Klopotowska, Iacer Calixto

Main category: cs.CL

TL;DR: The study benchmarks ADE detection in Dutch clinical texts using transformer models, with MedRoBERTa(.)nl performing best.

Details

Motivation: To establish a robust benchmark for ADE detection in Dutch clinical free-text documents using advanced models and fit-for-purpose metrics.

Method: Trained Bi-LSTM and four transformer models (BERTje, RobBERT, MedRoBERTa(.)nl, NuNER) on 102 annotated Dutch ICU notes for NER and RC tasks, with internal and external validation.

Result: MedRoBERTa(.)nl achieved the highest macro-averaged F1 scores (0.63 with gold standard, 0.62 with predicted entities) and recall (0.67-0.74) in external validation.

Conclusion: The study provides a clinically meaningful benchmark for ADE detection, emphasizing the need for task-specific performance measures.

Abstract: In this study, we establish a benchmark for adverse drug event (ADE) detection in Dutch clinical free-text documents using several transformer models, clinical scenarios, and fit-for-purpose performance measures. We trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model and four transformer-based Dutch and/or multilingual encoder models (BERTje, RobBERT, MedRoBERTa(.)nl, and NuNER) for the tasks of named entity recognition (NER) and relation classification (RC) using 102 richly annotated Dutch ICU clinical progress notes. Anonymized free-text clinical progress notes of patients admitted to the intensive care unit (ICU) of one academic hospital and discharge letters of patients admitted to Internal Medicine wards of two non-academic hospitals were reused. We evaluated our ADE RC models internally using the gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated for detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, given the dataset imbalance in ADEs. Although differences for the ADE RC task between the models were small, MedRoBERTa(.)nl was the best performing model with a macro-averaged F1 score of 0.63 using the gold standard and 0.62 using predicted entities. The MedRoBERTa(.)nl models also performed the best in our external validation and achieved a recall of between 0.67 to 0.74 using predicted entities, meaning between 67 to 74% of discharge letters with ADEs were detected. Our benchmark study presents a robust and clinically meaningful approach for evaluating language models for ADE detection in clinical free-text documents. Our study highlights the need to use appropriate performance measures fit for the task of ADE detection in clinical free-text documents and envisioned future clinical use.

cs.CV

[142] Tuning adaptive gamma correction (TAGC) for enhancing images in low ligh

Ghufran Abualhail Alhamzawi, Ali Saeed Alfoudi, Ali Hakem Alsaeedi, Suha Mohammed Hadi, Amjed Abbas Ahmed, Md. Riad Hassan, Nurhizam Safie Mohd Satar, Waeel Yahya Yasseen

Main category: cs.CV

TL;DR: The paper introduces TAGC, a model for enhancing low-light images by adaptively calculating gamma correction based on color luminance, improving image quality without manual adjustments.

Details

Motivation: Low-light conditions degrade image quality, causing issues like low contrast, noise, and blur. Enhancing such images is crucial for applications like surveillance, medical imaging, and photography.

Method: The TAGC model analyzes color luminance and calculates an adaptive gamma coefficient automatically, adjusting for varying illumination levels without human intervention.

Result: TAGC effectively improves low-light images, preserving details, contrast, and color distribution while providing natural visual quality.

Conclusion: TAGC is an efficient solution for low-light image enhancement, applicable in surveillance, medical imaging, and photography.

Abstract: Enhancing images in low-light conditions is an important challenge in computer vision. Insufficient illumination negatively affects the quality of images, resulting in low contrast, intensive noise, and blurred details. This paper presents a model for enhancing low-light images called tuning adaptive gamma correction (TAGC). The model is based on analyzing the color luminance of the low-light image and calculating the average color to determine the adaptive gamma coefficient. The gamma value is calculated automatically and adaptively at different illumination levels suitable for the image without human intervention or manual adjustment. Based on qualitative and quantitative evaluation, tuning adaptive gamma correction model has effectively improved low-light images while maintaining details, natural contrast, and correct color distribution. It also provides natural visual quality. It can be considered a more efficient solution for processing low-light images in multiple applications such as night surveillance, improving the quality of medical images, and photography in low-light environments.

[143] Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

Ayush Roy, Samin Enam, Jun Xia, Vishnu Suresh Lokhande, Won Hwa Kim

Main category: cs.CV

TL;DR: The paper addresses the ‘Data Addition Dilemma’ in medical imaging by proposing a method to control feature discrepancies in deep networks, improving segmentation performance across multiple datasets.

Details

Motivation: Data scarcity and distributional shifts in multi-source medical imaging datasets hinder deep learning model performance.

Method: The authors use causal frameworks to control foreground-background feature discrepancies in deep networks for medical image segmentation.

Result: State-of-the-art segmentation performance is achieved on histopathology and ultrasound images across five datasets, with qualitative improvements.

Conclusion: The proposed method effectively mitigates the challenges of data pooling and addition, enhancing segmentation accuracy.

Abstract: Data scarcity is a major challenge in medical imaging, particularly for deep learning models. While data pooling (combining datasets from multiple sources) and data addition (adding more data from a new dataset) have been shown to enhance model performance, they are not without complications. Specifically, increasing the size of the training dataset through pooling or addition can induce distributional shifts, negatively affecting downstream model performance, a phenomenon known as the “Data Addition Dilemma”. While the traditional i.i.d. assumption may not hold in multi-source contexts, assuming exchangeability across datasets provides a more practical framework for data pooling. In this work, we investigate medical image segmentation under these conditions, drawing insights from causal frameworks to propose a method for controlling foreground-background feature discrepancies across all layers of deep networks. This approach improves feature representations, which are crucial in data-addition scenarios. Our method achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets, including a novel ultrasound dataset that we have curated and contributed. Qualitative results demonstrate more refined and accurate segmentation maps compared to prominent baselines across three model architectures. The code will be available on Github.

[144] T-MPEDNet: Unveiling the Synergy of Transformer-aware Multiscale Progressive Encoder-Decoder Network with Feature Recalibration for Tumor and Liver Segmentation

Chandravardhan Singh Raghaw, Jasmer Singh Sanjotra, Mohammad Zia Ur Rehman, Shubhi Bansal, Shahid Shafi Dar, Nagendra Kumar

Main category: cs.CV

TL;DR: T-MPEDNet, a Transformer-aware Multiscale Progressive Encoder-Decoder Network, improves automated liver and tumor segmentation in CT scans by combining deep adaptive features, Transformer-inspired attention, and morphological refinement, outperforming 12 state-of-the-art methods.

Details

Motivation: Automated liver and tumor segmentation is crucial for diagnosis and treatment planning but is challenged by tumor heterogeneity and diverse liver characteristics.

Method: T-MPEDNet uses a progressive encoder-decoder with skip connections, a Transformer-inspired attention mechanism for long-range context, and multi-scale features for local details, followed by morphological boundary refinement.

Result: T-MPEDNet achieves DSC scores of 97.6% (liver) and 89.1% (tumor) on LiTS, and 98.3% (liver) and 83.3% (tumor) on 3DIRCADb, outperforming other methods.

Conclusion: T-MPEDNet is an effective and reliable framework for precise liver and tumor segmentation in CT scans.

Abstract: Precise and automated segmentation of the liver and its tumor within CT scans plays a pivotal role in swift diagnosis and the development of optimal treatment plans for individuals with liver diseases and malignancies. However, automated liver and tumor segmentation faces significant hurdles arising from the inherent heterogeneity of tumors and the diverse visual characteristics of livers across a broad spectrum of patients. Aiming to address these challenges, we present a novel Transformer-aware Multiscale Progressive Encoder-Decoder Network (T-MPEDNet) for automated segmentation of tumor and liver. T-MPEDNet leverages a deep adaptive features backbone through a progressive encoder-decoder structure, enhanced by skip connections for recalibrating channel-wise features while preserving spatial integrity. A Transformer-inspired dynamic attention mechanism captures long-range contextual relationships within the spatial domain, further enhanced by multi-scale feature utilization for refined local details, leading to accurate prediction. Morphological boundary refinement is then employed to address indistinct boundaries with neighboring organs, capturing finer details and yielding precise boundary labels. The efficacy of T-MPEDNet is comprehensively assessed on two widely utilized public benchmark datasets, LiTS and 3DIRCADb. Extensive quantitative and qualitative analyses demonstrate the superiority of T-MPEDNet compared to twelve state-of-the-art methods. On LiTS, T-MPEDNet achieves outstanding Dice Similarity Coefficients (DSC) of 97.6% and 89.1% for liver and tumor segmentation, respectively. Similar performance is observed on 3DIRCADb, with DSCs of 98.3% and 83.3% for liver and tumor segmentation, respectively. Our findings prove that T-MPEDNet is an efficacious and reliable framework for automated segmentation of the liver and its tumor in CT scans.

[145] LAVA: Language Driven Scalable and Versatile Traffic Video Analytics

Yanrui Yu, Tianfei Zhou, Jiaxin Sun, Lianpeng Qiao, Lizhong Ding, Ye Yuan, Guoren Wang

Main category: cs.CV

TL;DR: The paper introduces Lava, a language-driven video analytics system for flexible and efficient querying of large-scale video data using natural language, outperforming existing methods in accuracy and speed.

Details

Motivation: The need for scalable video analytics in urban environments with massive video data, overcoming the limitations of rigid SQL-based querying paradigms.

Method: Lava uses a multi-armed bandit sampling method, an open-world detection module, and trajectory extraction for object-level retrieval and temporal association.

Result: Lava improves F1-scores by 14%, reduces MPAE by 0.39, achieves 86% top-k precision, and processes videos 9.6x faster than baselines.

Conclusion: Lava demonstrates superior performance in flexible and efficient video analytics, validated by a novel benchmark.

Abstract: In modern urban environments, camera networks generate massive amounts of operational footage – reaching petabytes each day – making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build \textsc{Lava}, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. \textsc{Lava} comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that \textsc{Lava} improves $F_1$-scores for selection queries by $\mathbf{14%}$, reduces MPAE for aggregation queries by $\mathbf{0.39}$, and achieves top-$k$ precision of $\mathbf{86%}$, while processing videos $ \mathbf{9.6\times} $ faster than the most accurate baseline.

[146] SurgPIS: Surgical-instrument-level Instances and Part-level Semantics for Weakly-supervised Part-aware Instance Segmentation

Meng Wei, Charlie Budd, Oluwatosin Alabi, Miaojing Shi, Tom Vercauteren

Main category: cs.CV

TL;DR: SurgPIS introduces a unified part-aware instance segmentation (PIS) model for surgical instruments, combining instrument-level and part-level segmentation with a transformer-based approach and weakly-supervised learning.

Details

Motivation: Existing methods treat instrument-level instance segmentation (IIS) and part-level semantic segmentation (PSS) separately, lacking interaction between tasks.

Method: Uses a transformer-based mask classification with part-specific queries linked to instrument instances, and a weakly-supervised learning strategy for training on disjoint datasets.

Result: Achieves state-of-the-art performance in PIS, IIS, PSS, and instrument-level semantic segmentation across multiple datasets.

Conclusion: SurgPIS effectively unifies and improves surgical instrument segmentation by integrating part-aware instance segmentation with innovative training strategies.

Abstract: Consistent surgical instrument segmentation is critical for automation in robot-assisted surgery. Yet, existing methods only treat instrument-level instance segmentation (IIS) or part-level semantic segmentation (PSS) separately, without interaction between these tasks. In this work, we formulate a surgical tool segmentation as a unified part-aware instance segmentation (PIS) problem and introduce SurgPIS, the first PIS model for surgical instruments. Our method adopts a transformer-based mask classification approach and introduces part-specific queries derived from instrument-level object queries, explicitly linking parts to their parent instrument instances. In order to address the lack of large-scale datasets with both instance- and part-level labels, we propose a weakly-supervised learning strategy for SurgPIS to learn from disjoint datasets labelled for either IIS or PSS purposes. During training, we aggregate our PIS predictions into IIS or PSS masks, thereby allowing us to compute a loss against partially labelled datasets. A student-teacher approach is developed to maintain prediction consistency for missing PIS information in the partially labelled data, e.g., parts of the IIS labelled data. Extensive experiments across multiple datasets validate the effectiveness of SurgPIS, achieving state-of-the-art performance in PIS as well as IIS, PSS, and instrument-level semantic segmentation.

[147] Object-centric Video Question Answering with Visual Grounding and Referring

Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, Stratis Gavves

Main category: cs.CV

TL;DR: The paper introduces a VideoLLM model for object-centric video understanding, featuring object referring and grounding, a Spatial-Temporal Overlay Module (STOM), and the VideoInfer dataset. It outperforms baselines in video QA and segmentation.

Details

Motivation: Existing VideoLLMs lack flexibility for object-centric, multiround interactions and are limited to text-only responses.

Method: Proposes a VideoLLM model with object referring and grounding, introduces STOM for visual prompt propagation, and curates the VideoInfer dataset.

Result: Outperforms baselines on 12 benchmarks across 6 tasks, demonstrating robustness in multimodal video understanding.

Conclusion: The model enhances object-centric video reasoning, supported by STOM and VideoInfer, achieving superior performance in diverse tasks.

Abstract: Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring questionanswering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks of 6 tasks show that our proposed model consistently outperforms baselines in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. Project page: https://qirui-chen.github.io/RGA3-release/.

[148] T$^\text{3}$SVFND: Towards an Evolving Fake News Detector for Emergencies with Test-time Training on Short Video Platforms

Liyuan Zhang, Zeyun Cheng, Yan Yang, Yong Liu, Jinke Ma

Main category: cs.CV

TL;DR: The paper proposes T$^3$SVFND, a fake news video detection framework using Test-Time Training (TTT) to improve generalization across different news events, especially emergencies.

Details

Motivation: Existing fake news video detection methods struggle with generalization due to distribution shifts between events, particularly in emergencies.

Method: The framework uses a self-supervised auxiliary task based on Masked Language Modeling (MLM) to predict masked words by combining audio and video context, adapting to test data during training.

Result: Experiments show the model’s effectiveness, particularly for emergency news detection.

Conclusion: T$^3$SVFND enhances robustness in fake news video detection, addressing distribution shifts and improving performance for emergencies.

Abstract: The existing methods for fake news videos detection may not be generalized, because there is a distribution shift between short video news of different events, and the performance of such techniques greatly drops if news records are coming from emergencies. We propose a new fake news videos detection framework (T$^3$SVFND) using Test-Time Training (TTT) to alleviate this limitation, enhancing the robustness of fake news videos detection. Specifically, we design a self-supervised auxiliary task based on Mask Language Modeling (MLM) that masks a certain percentage of words in text and predicts these masked words by combining contextual information from different modalities (audio and video). In the test-time training phase, the model adapts to the distribution of test data through auxiliary tasks. Extensive experiments on the public benchmark demonstrate the effectiveness of the proposed model, especially for the detection of emergency news.

[149] Exemplar Med-DETR: Toward Generalized and Robust Lesion Detection in Mammogram Images and beyond

Sheethal Bhat, Bogdan Georgescu, Adarsh Bhandary Panambur, Mathias Zinnen, Tri-Thien Nguyen, Awais Mansoor, Karim Khalifa Elbarbary, Siming Bayer, Florin-Cristian Ghesu, Sasa Grbic, Andreas Maier

Main category: cs.CV

TL;DR: Exemplar Med-DETR, a multi-modal contrastive detector, improves lesion detection in medical images by leveraging class-specific exemplar features and cross-attention, achieving state-of-the-art results across diverse imaging modalities.

Details

Motivation: Existing methods struggle with learning effective class-specific features in medical imaging, especially in dense breast tissue, limiting their generalizability and performance.

Method: Introduces Exemplar Med-DETR, which uses cross-attention with intuitive class-specific exemplar features and an iterative training strategy.

Result: Achieves mAP of 0.7 for mass detection and 0.55 for calcifications in mammograms, with improvements in chest X-rays and angiography. Outperforms existing methods by significant margins.

Conclusion: Exemplar Med-DETR demonstrates robust and generalizable performance, advancing medical imaging detection systems.

Abstract: Detecting abnormalities in medical images poses unique challenges due to differences in feature representations and the intricate relationship between anatomical structures and abnormalities. This is especially evident in mammography, where dense breast tissue can obscure lesions, complicating radiological interpretation. Despite leveraging anatomical and semantic context, existing detection methods struggle to learn effective class-specific features, limiting their applicability across different tasks and imaging modalities. In this work, we introduce Exemplar Med-DETR, a novel multi-modal contrastive detector that enables feature-based detection. It employs cross-attention with inherently derived, intuitive class-specific exemplar features and is trained with an iterative strategy. We achieve state-of-the-art performance across three distinct imaging modalities from four public datasets. On Vietnamese dense breast mammograms, we attain an mAP of 0.7 for mass detection and 0.55 for calcifications, yielding an absolute improvement of 16 percentage points. Additionally, a radiologist-supported evaluation of 100 mammograms from an out-of-distribution Chinese cohort demonstrates a twofold gain in lesion detection performance. For chest X-rays and angiography, we achieve an mAP of 0.25 for mass and 0.37 for stenosis detection, improving results by 4 and 7 percentage points, respectively. These results highlight the potential of our approach to advance robust and generalizable detection systems for medical imaging.

[150] Pre- and Post-Treatment Glioma Segmentation with the Medical Imaging Segmentation Toolkit

Adrian Celaya, Tucker Netherton, Dawid Schellingerhout, Caroline Chung, Beatrice Riviere, David Fuentes

Main category: cs.CV

TL;DR: MIST introduces a flexible postprocessing framework for medical image segmentation, evaluated in the BraTS 2025 challenge, enabling customizable strategies for high-quality results.

Details

Motivation: Addressing the lack of standardized tooling for rigorous comparison in medical image segmentation.

Method: Extends MIST’s postprocessing module with transforms like object removal, connected components extraction, and morphological operations, allowing user-defined strategies.

Result: Evaluated three strategies, showing MIST’s capability for rapid experimentation and refinement, producing high-quality segmentations.

Conclusion: MIST’s modular and open-source design supports reproducible and scalable research in medical image segmentation.

Abstract: Medical image segmentation continues to advance rapidly, yet rigorous comparison between methods remains challenging due to a lack of standardized and customizable tooling. In this work, we present the current state of the Medical Imaging Segmentation Toolkit (MIST), with a particular focus on its flexible and modular postprocessing framework designed for the BraTS 2025 pre- and post-treatment glioma segmentation challenge. Since its debut in the 2024 BraTS adult glioma post-treatment segmentation challenge, MIST’s postprocessing module has been significantly extended to support a wide range of transforms, including removal or replacement of small objects, extraction of the largest connected components, and morphological operations such as hole filling and closing. These transforms can be composed into user-defined strategies, enabling fine-grained control over the final segmentation output. We evaluate three such strategies - ranging from simple small-object removal to more complex, class-specific pipelines - and rank their performance using the BraTS ranking protocol. Our results highlight how MIST facilitates rapid experimentation and targeted refinement, ultimately producing high-quality segmentations for the BraTS 2025 challenge. MIST remains open source and extensible, supporting reproducible and scalable research in medical image segmentation.

[151] MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation

Shuolin Xu, Bingyuan Wang, Zeyu Cai, Fangteng Fu, Yue Ma, Tongyi Lee, Hongchuan Yu, Zeyu Wang

Main category: cs.CV

TL;DR: The paper introduces MagicAnime, a large-scale, multimodal dataset for cartoon animation tasks, addressing the scarcity of annotated data and domain gap between real-world videos and cartoons.

Details

Motivation: The challenge lies in generating high-quality cartoon animations due to complex non-human characters, diverse motions, and fine-grained emotions, compounded by scarce annotated data.

Method: Proposes MagicAnime dataset with hierarchical annotations (400k video clips for image-to-video, 50k for whole-body annotation, etc.) and benchmarks (MagicAnime-Bench) for task comparisons.

Result: Validates effectiveness in four tasks (e.g., video-driven face animation) for high-fidelity, fine-grained, and controllable generation.

Conclusion: MagicAnime bridges the domain gap and supports diverse cartoon animation tasks with its comprehensive dataset and benchmarks.

Abstract: Generating high-quality cartoon animations multimodal control is challenging due to the complexity of non-human characters, stylistically diverse motions and fine-grained emotions. There is a huge domain gap between real-world videos and cartoon animation, as cartoon animation is usually abstract and has exaggerated motion. Meanwhile, public multimodal cartoon data are extremely scarce due to the difficulty of large-scale automatic annotation processes compared with real-life scenarios. To bridge this gap, We propose the MagicAnime dataset, a large-scale, hierarchically annotated, and multimodal dataset designed to support multiple video generation tasks, along with the benchmarks it includes. Containing 400k video clips for image-to-video generation, 50k pairs of video clips and keypoints for whole-body annotation, 12k pairs of video clips for video-to-video face animation, and 2.9k pairs of video and audio clips for audio-driven face animation. Meanwhile, we also build a set of multi-modal cartoon animation benchmarks, called MagicAnime-Bench, to support the comparisons of different methods in the tasks above. Comprehensive experiments on four tasks, including video-driven face animation, audio-driven face animation, image-to-video animation, and pose-driven character animation, validate its effectiveness in supporting high-fidelity, fine-grained, and controllable generation.

[152] Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

Ankit Sanjyal

Main category: cs.CV

TL;DR: The paper introduces Local Prompt Adaptation (LPA), a training-free method to improve text-to-image diffusion models by enhancing style uniformity and spatial coherence for complex prompts.

Details

Motivation: Diffusion models struggle with complex prompts involving multiple objects and style specifications, leading to inconsistent visuals.

Method: LPA decomposes prompts into content and style tokens, injecting them selectively into U-Net’s attention layers at different stages.

Result: LPA outperforms baselines like Composer and SDXL in CLIP score and style consistency metrics.

Conclusion: LPA offers a promising direction for controllable and expressive diffusion-based generation.

Abstract: Diffusion models have become a powerful backbone for text-to-image generation, enabling users to synthesize high-quality visuals from natural language prompts. However, they often struggle with complex prompts involving multiple objects and global or local style specifications. In such cases, the generated scenes tend to lack style uniformity and spatial coherence, limiting their utility in creative and controllable content generation. In this paper, we propose a simple, training-free architectural method called Local Prompt Adaptation (LPA). Our method decomposes the prompt into content and style tokens, and injects them selectively into the U-Net’s attention layers at different stages. By conditioning object tokens early and style tokens later in the generation process, LPA enhances both layout control and stylistic consistency. We evaluate our method on a custom benchmark of 50 style-rich prompts across five categories and compare against strong baselines including Composer, MultiDiffusion, Attend-and-Excite, LoRA, and SDXL. Our approach outperforms prior work on both CLIP score and style consistency metrics, offering a new direction for controllable, expressive diffusion-based generation.

[153] SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions

Babak Taati, Muhammad Muzammil, Yasamin Zarghami, Abhishek Moturu, Airhossein Kazerouni, Hailey Reimer, Alex Mihailidis, Thomas Hadjistavropoulos

Main category: cs.CV

TL;DR: SynPAIN is a synthetic dataset for pain detection, addressing diversity gaps and algorithmic bias in existing models, improving performance on real clinical data.

Details

Motivation: Pain assessment in non-communicative patients (e.g., older adults with dementia) is challenging due to limited diverse datasets.

Method: Created a synthetic dataset (SynPAIN) using generative AI, balanced across ethnicities, ages, and genders, validated with clinical pain assessment tools.

Result: Synthetic data improved pain detection by 7.0% in precision and revealed hidden algorithmic biases in existing models.

Conclusion: SynPAIN fills gaps in pain assessment research, offering a diverse dataset and framework for bias mitigation.

Abstract: Accurate pain assessment in patients with limited ability to communicate, such as older adults with dementia, represents a critical healthcare challenge. Robust automated systems of pain detection may facilitate such assessments. Existing pain detection datasets, however, suffer from limited ethnic/racial diversity, privacy constraints, and underrepresentation of older adults who are the primary target population for clinical deployment. We present SynPAIN, a large-scale synthetic dataset containing 10,710 facial expression images (5,355 neutral/expressive pairs) across five ethnicities/races, two age groups (young: 20-35, old: 75+), and two genders. Using commercial generative AI tools, we created demographically balanced synthetic identities with clinically meaningful pain expressions. Our validation demonstrates that synthetic pain expressions exhibit expected pain patterns, scoring significantly higher than neutral and non-pain expressions using clinically validated pain assessment tools based on facial action unit analysis. We experimentally demonstrate SynPAIN’s utility in identifying algorithmic bias in existing pain detection models. Through comprehensive bias evaluation, we reveal substantial performance disparities across demographic characteristics. These performance disparities were previously undetectable with smaller, less diverse datasets. Furthermore, we demonstrate that age-matched synthetic data augmentation improves pain detection performance on real clinical data, achieving a 7.0% improvement in average precision. SynPAIN addresses critical gaps in pain assessment research by providing the first publicly available, demographically diverse synthetic dataset specifically designed for older adult pain detection, while establishing a framework for measuring and mitigating algorithmic bias. The dataset is available at https://doi.org/10.5683/SP3/WCXMAP

[154] T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval

Yili Li, Gang Xiong, Gaopeng Gou, Xiangyan Qu, Jiamin Zhuang, Zhen Li, Junzheng Shi

Main category: cs.CV

TL;DR: T2VParser improves text-to-video retrieval by extracting multiview semantic representations for adaptive alignment, addressing partial misalignment in video-text datasets.

Details

Motivation: Videos contain richer information than images, and current video-text datasets often misalign due to incomplete textual descriptions, leading to incorrect supervision.

Method: Proposes T2VParser with Adaptive Decomposition Tokens to extract and align multiview semantic representations from text and video.

Result: T2VParser achieves accurate partial alignment through cross-modal content decomposition, outperforming direct alignment methods.

Conclusion: T2VParser effectively aligns text and video representations while preserving pretrained model knowledge, improving retrieval accuracy.

Abstract: Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities. The goal of T2VParser is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models. Experimental results demonstrate that T2VParser achieves accurate partial alignment through effective cross-modal content decomposition. The code is available at https://github.com/Lilidamowang/T2VParser.

[155] Efficient Learning for Product Attributes with Compact Multimodal Models

Mandar Kulkarni

Main category: cs.CV

TL;DR: The paper explores label-efficient semi-supervised fine-tuning for compact Vision Language Models (VLMs) in e-commerce, using Direct Preference Optimization (DPO) to leverage unlabeled data for improved performance.

Details

Motivation: Supervised fine-tuning of VLMs is costly due to annotation requirements, prompting the need for label-efficient methods.

Method: Uses PEFT for initial training, then DPO with unlabeled data by generating and segregating reasoning chains for fine-tuning.

Result: DPO-based fine-tuning outperforms supervised models and improves with more unlabeled data.

Conclusion: Unlabeled data can effectively enhance VLM performance in e-commerce attribute prediction.

Abstract: Image-based product attribute prediction in e-commerce is a crucial task with numerous applications. The supervised fine-tuning of Vision Language Models (VLMs) faces significant scale challenges due to the cost of manual or API based annotation. In this paper, we investigate label-efficient semi-supervised fine-tuning strategies for compact VLMs (2B-3B parameters) that leverage unlabeled product listings through Direct Preference Optimization (DPO). Beginning with a small, API-based, annotated, and labeled set, we first employ PEFT to train low-rank adapter modules. To update the adapter weights with unlabeled data, we generate multiple reasoning-and-answer chains per unlabeled sample and segregate these chains into preferred and dispreferred based on self-consistency. We then fine-tune the model with DPO loss and use the updated model for the next iteration. By using PEFT fine-tuning with DPO, our method achieves efficient convergence with minimal compute overhead. On a dataset spanning twelve e-commerce verticals, DPO-based fine-tuning, which utilizes only unlabeled data, demonstrates a significant improvement over the supervised model. Moreover, experiments demonstrate that accuracy with DPO training improves with more unlabeled data, indicating that a large pool of unlabeled samples can be effectively leveraged to improve performance.

[156] DeepJIVE: Learning Joint and Individual Variation Explained from Multimodal Data Using Deep Learning

Matthew Drexler, Benjamin Risk, James J Lah, Suprateek Kundu, Deqiang Qiu

Main category: cs.CV

TL;DR: DeepJIVE is a deep-learning method for multimodal data analysis, overcoming limitations of traditional methods by handling high-dimensional data and identifying nonlinear structures.

Details

Motivation: Traditional multimodal data integration methods struggle with high-dimensional data and nonlinear structures, prompting the need for a more advanced approach.

Method: DeepJIVE uses deep learning to perform Joint and Individual Variance Explained (JIVE), with mathematical derivations and experimental validations on synthetic and real-world datasets. Three loss functions were explored to achieve identity and orthogonality constraints.

Result: DeepJIVE successfully uncovers joint and individual variations in multimodal datasets and identifies biologically plausible patterns in Alzheimer’s Disease Neuroimaging Initiative (ADNI) data.

Conclusion: DeepJIVE is a valuable tool for multimodal data analysis, offering improved capabilities over conventional methods.

Abstract: Conventional multimodal data integration methods provide a comprehensive assessment of the shared or unique structure within each individual data type but suffer from several limitations such as the inability to handle high-dimensional data and identify nonlinear structures. In this paper, we introduce DeepJIVE, a deep-learning approach to performing Joint and Individual Variance Explained (JIVE). We perform mathematical derivation and experimental validations using both synthetic and real-world 1D, 2D, and 3D datasets. Different strategies of achieving the identity and orthogonality constraints for DeepJIVE were explored, resulting in three viable loss functions. We found that DeepJIVE can successfully uncover joint and individual variations of multimodal datasets. Our application of DeepJIVE to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) also identified biologically plausible covariation patterns between the amyloid positron emission tomography (PET) and magnetic resonance (MR) images. In conclusion, the proposed DeepJIVE can be a useful tool for multimodal data analysis.

[157] Regularizing Subspace Redundancy of Low-Rank Adaptation

Yue Zhu, Haiwen Diao, Shang Gao, Jiazuo Yu, Jiawen Zhu, Yunzhi Zhuge, Shuai Hao, Xu Jia, Lu Zhang, Ying Zhang, Huchuan Lu

Main category: cs.CV

TL;DR: ReSoRA improves LoRA by reducing redundancy in projection matrices, enhancing feature adaptation without extra inference costs.

Details

Motivation: Existing LoRA variants suffer from high redundancy in projection matrices, limiting feature adaptation effectiveness.

Method: ReSoRA decomposes low-rank submatrices into subspaces and applies de-redundancy constraints to feature distributions.

Result: ReSoRA boosts performance of PETL methods across various datasets and architectures.

Conclusion: ReSoRA is a flexible, plug-and-play solution for improving LoRA-based methods.

Abstract: Low-Rank Adaptation (LoRA) and its variants have delivered strong capability in Parameter-Efficient Transfer Learning (PETL) by minimizing trainable parameters and benefiting from reparameterization. However, their projection matrices remain unrestricted during training, causing high representation redundancy and diminishing the effectiveness of feature adaptation in the resulting subspaces. While existing methods mitigate this by manually adjusting the rank or implicitly applying channel-wise masks, they lack flexibility and generalize poorly across various datasets and architectures. Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. Specifically, it theoretically decomposes the low-rank submatrices into multiple equivalent subspaces and systematically applies de-redundancy constraints to the feature distributions across different projections. Extensive experiments validate that our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks. Besides, as a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs. Code is publicly available at: https://github.com/Lucenova/ReSoRA.

[158] Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing

Haichuan Li, Tomi Westerlund

Main category: cs.CV

TL;DR: Co-Win is a BEV perception framework for urban environments, combining point cloud encoding and window-based feature extraction for multi-modality understanding.

Details

Motivation: Accurate perception in complex urban settings is crucial for autonomous navigation, requiring methods that handle multi-modality and fine-grained scene decomposition.

Method: Hierarchical architecture with encoder, window-based backbone, and query-based decoder, using variational approach and mask-based instance segmentation.

Result: Produces data-consistent, contextually relevant masks and interpretable, diverse instance predictions.

Conclusion: Co-Win enhances autonomous driving decision-making by improving scene understanding and perception accuracy.

Abstract: Accurate perception and scene understanding in complex urban environments is a critical challenge for ensuring safe and efficient autonomous navigation. In this paper, we present Co-Win, a novel bird’s eye view (BEV) perception framework that integrates point cloud encoding with efficient parallel window-based feature extraction to address the multi-modality inherent in environmental understanding. Our method employs a hierarchical architecture comprising a specialized encoder, a window-based backbone, and a query-based decoder head to effectively capture diverse spatial features and object relationships. Unlike prior approaches that treat perception as a simple regression task, our framework incorporates a variational approach with mask-based instance segmentation, enabling fine-grained scene decomposition and understanding. The Co-Win architecture processes point cloud data through progressive feature extraction stages, ensuring that predicted masks are both data-consistent and contextually relevant. Furthermore, our method produces interpretable and diverse instance predictions, enabling enhanced downstream decision-making and planning in autonomous driving systems.

[159] Bias Analysis for Synthetic Face Detection: A Case Study of the Impact of Facial Attribute

Asmae Lamsaf, Lucia Cascone, Hugo Proença, João Neves

Main category: cs.CV

TL;DR: The paper addresses bias in synthetic face detectors, proposing an evaluation framework to analyze bias across facial attributes and providing a case study of five detectors.

Details

Motivation: Existing synthetic face detection models and datasets may exhibit bias, leading to detection failures for certain demographic groups, raising social, legal, and ethical concerns.

Method: An evaluation framework using synthetic data with evenly distributed attributes is introduced to mitigate data skew. A case study evaluates five state-of-the-art detectors across 25 facial attributes.

Result: Synthetic face detectors are generally biased toward specific facial attributes. The study identifies origins of bias through training set analysis and detector activation maps.

Conclusion: The framework highlights bias in synthetic face detectors and provides insights into its origins, emphasizing the need for balanced training data and unbiased models.

Abstract: Bias analysis for synthetic face detection is bound to become a critical topic in the coming years. Although many detection models have been developed and several datasets have been released to reliably identify synthetic content, one crucial aspect has been largely overlooked: these models and training datasets can be biased, leading to failures in detection for certain demographic groups and raising significant social, legal, and ethical issues. In this work, we introduce an evaluation framework to contribute to the analysis of bias of synthetic face detectors with respect to several facial attributes. This framework exploits synthetic data generation, with evenly distributed attribute labels, for mitigating any skew in the data that could otherwise influence the outcomes of bias analysis. We build on the proposed framework to provide an extensive case study of the bias level of five state-of-the-art detectors in synthetic datasets with 25 controlled facial attributes. While the results confirm that, in general, synthetic face detectors are biased towards the presence/absence of specific facial attributes, our study also sheds light on the origins of the observed bias through the analysis of the correlations with the balancing of facial attributes in the training sets of the detectors, and the analysis of detectors activation maps in image pairs with controlled attribute modifications.

[160] SAMwave: Wavelet-Driven Feature Enrichment for Effective Adaptation of Segment Anything Model

Saurabh Yadav, Avi Gupta, Koteswar Rao Jerripothula

Main category: cs.CV

TL;DR: SAMwave introduces a wavelet-based approach to enhance SAM’s performance in complex tasks by extracting multi-scale high-frequency features, outperforming existing methods.

Details

Motivation: Foundation models like SAM degrade in performance for untrained complex tasks, and current adapter-based fine-tuning methods are limited.

Method: SAMwave uses wavelet transforms and complex-valued adapters to extract richer spatial-frequency information.

Result: SAMwave significantly outperforms existing methods on low-level vision tasks, proving its efficiency and flexibility.

Conclusion: SAMwave offers a superior, interpretable solution for adapting SAM to complex tasks, validated across multiple backbones and tasks.

Abstract: The emergence of large foundation models has propelled significant advances in various domains. The Segment Anything Model (SAM), a leading model for image segmentation, exemplifies these advances, outperforming traditional methods. However, such foundation models often suffer from performance degradation when applied to complex tasks for which they are not trained. Existing methods typically employ adapter-based fine-tuning strategies to adapt SAM for tasks and leverage high-frequency features extracted from the Fourier domain. However, Our analysis reveals that these approaches offer limited benefits due to constraints in their feature extraction techniques. To overcome this, we propose \textbf{\textit{SAMwave}}, a novel and interpretable approach that utilizes the wavelet transform to extract richer, multi-scale high-frequency features from input data. Extending this, we introduce complex-valued adapters capable of capturing complex-valued spatial-frequency information via complex wavelet transforms. By adaptively integrating these wavelet coefficients, SAMwave enables SAM’s encoder to capture information more relevant for dense prediction. Empirical evaluations on four challenging low-level vision tasks demonstrate that SAMwave significantly outperforms existing adaptation methods. This superior performance is consistent across both the SAM and SAM2 backbones and holds for both real and complex-valued adapter variants, highlighting the efficiency, flexibility, and interpretability of our proposed method for adapting segment anything models.

[161] Quaternion-Based Robust PCA for Efficient Moving Target Detection and Background Recovery in Color Videos

Liyang Wang, Shiqian Wu, Shun Fang, Qile Zhu, Jiaxin Wu, Sos Again

Main category: cs.CV

TL;DR: The paper introduces uQRPCA+, a method for moving target detection and background recovery in color videos, reducing computational costs and achieving SOTA performance.

Details

Motivation: To address the high computational costs of QSVD in color video processing and improve the generalization of deep models by enriching datasets with synthetic data.

Method: Proposes uQRPCA and uQRPCA+ frameworks, utilizing a quaternion Riemannian manifold to reduce QSVD complexity and introducing CR1B for ideal low-rank background recovery.

Result: uQRPCA+ achieves SOTA performance in moving target detection and background recovery tasks.

Conclusion: The uQRPCA+ framework effectively balances segmentation and background recovery, offering computational efficiency and high performance.

Abstract: Moving target detection is a challenging computer vision task aimed at generating accurate segmentation maps in diverse in-the-wild color videos captured by static cameras. If backgrounds and targets can be simultaneously extracted and recombined, such synthetic data can significantly enrich annotated in-the-wild datasets and enhance the generalization ability of deep models. Quaternion-based RPCA (QRPCA) is a promising unsupervised paradigm for color image processing. However, in color video processing, Quaternion Singular Value Decomposition (QSVD) incurs high computational costs, and rank-1 quaternion matrix fails to yield rank-1 color channels. In this paper, we reduce the computational complexity of QSVD to o(1) by utilizing a quaternion Riemannian manifold. Furthermor, we propose the universal QRPCA (uQRPCA) framework, which achieves a balance in simultaneously segmenting targets and recovering backgrounds from color videos. Moreover, we expand to uQRPCA+ by introducing the Color Rank-1 Batch (CR1B) method to further process and obtain the ideal low-rank background across color channels. Experiments demonstrate our uQRPCA+ achieves State Of The Art (SOTA) performance on moving target detection and background recovery tasks compared to existing open-source methods. Our implementation is publicly available on GitHub at https://github.com/Ruchtech/uQRPCA

[162] Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

Shuyu Yang, Yaxiong Wang, Li Zhu, Zhedong Zheng

Main category: cs.CV

TL;DR: A new task, text-based person anomaly search, is introduced to identify both routine and abnormal behaviors using text descriptions. A large-scale Pedestrian Anomaly Behavior (PAB) benchmark is created, and a pose-aware framework achieves 84.93% recall@1 accuracy.

Details

Motivation: Current benchmarks for text-based person search focus on common actions, ignoring the need to identify abnormal behaviors in real-world scenarios.

Method: A cross-modal pose-aware framework integrates human pose patterns with identity-based hard negative pair sampling. The PAB benchmark includes synthesized and real-world image-text pairs.

Result: The proposed method achieves 84.93% recall@1 accuracy on the PAB benchmark, outperforming other methods.

Conclusion: The PAB benchmark and pose-aware framework effectively address the gap in identifying abnormal behaviors, with synthetic data aiding fine-grained retrieval.

Abstract: Text-based person search aims to retrieve specific individuals across camera networks using natural language descriptions. However, current benchmarks often exhibit biases towards common actions like walking or standing, neglecting the critical need for identifying abnormal behaviors in real-world scenarios. To meet such demands, we propose a new task, text-based person anomaly search, locating pedestrians engaged in both routine or anomalous activities via text. To enable the training and evaluation of this new task, we construct a large-scale image-text Pedestrian Anomaly Behavior (PAB) benchmark, featuring a broad spectrum of actions, e.g., running, performing, playing soccer, and the corresponding anomalies, e.g., lying, being hit, and falling of the same identity. The training set of PAB comprises 1,013,605 synthesized image-text pairs of both normalities and anomalies, while the test set includes 1,978 real-world image-text pairs. To validate the potential of PAB, we introduce a cross-modal pose-aware framework, which integrates human pose patterns with identity-based hard negative pair sampling. Extensive experiments on the proposed benchmark show that synthetic training data facilitates the fine-grained behavior retrieval, and the proposed pose-aware method arrives at 84.93% recall@1 accuracy, surpassing other competitive methods. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/CMP.

[163] Investigating the Effect of Spatial Context on Multi-Task Sea Ice Segmentation

Behzad Vahedi, Rafael Pires de Lima, Sepideh Jalayer, Walter N. Meier, Andrew P. Barrett, Morteza Karimzadeh

Main category: cs.CV

TL;DR: The study explores how spatial context impacts sea ice segmentation, finding optimal receptive field sizes vary by data resolution and task. Combining SAR and AMSR2 data improves performance.

Details

Motivation: To understand how spatial context affects segmentation of sea ice properties and optimize deep learning models for geospatial tasks.

Method: Uses Atrous Spatial Pyramid Pooling with varying atrous rates to control receptive field size, testing on Sentinel-1 SAR and AMSR2 data for multi-task segmentation.

Result: Smaller receptive fields work best for high-resolution Sentinel-1 data, medium for stage of development, and larger fields often reduce performance. SAR-AMSR2 fusion enhances results.

Conclusion: Appropriate spatial context selection is crucial for sea ice mapping, with insights for optimizing deep learning in geospatial applications.

Abstract: Capturing spatial context at multiple scales is crucial for deep learning-based sea ice segmentation. However, the optimal specification of spatial context based on observation resolution and task characteristics remains underexplored. This study investigates the impact of spatial context on the segmentation of sea ice concentration, stage of development, and floe size using a multi-task segmentation model. We implement Atrous Spatial Pyramid Pooling with varying atrous rates to systematically control the receptive field size of convolutional operations, and to capture multi-scale contextual information. We explore the interactions between spatial context and feature resolution for different sea ice properties and examine how spatial context influences segmentation performance across different input feature combinations from Sentinel-1 SAR and Advanced Microwave Radiometer-2 (AMSR2) for multi-task mapping. Using Gradient-weighted Class Activation Mapping, we visualize how atrous rates influence model decisions. Our findings indicate that smaller receptive fields excel for high-resolution Sentinel-1 data, while medium receptive fields yield better performances for stage of development segmentation and larger receptive fields often lead to diminished performances. The fusion of SAR and AMSR2 enhances segmentation across all tasks. We highlight the value of lower-resolution 18.7 and 36.5 GHz AMSR2 channels in sea ice mapping. These findings highlight the importance of selecting appropriate spatial context based on observation resolution and target properties in sea ice mapping. By systematically analyzing receptive field effects in a multi-task setting, our study provides insights for optimizing deep learning models in geospatial applications.

[164] Leveraging Sparse LiDAR for RAFT-Stereo: A Depth Pre-Fill Perspective

Jinsu Yoo, Sooyoung Jeon, Zanming Huang, Tai-Yu Pan, Wei-Lun Chao

Main category: cs.CV

TL;DR: The paper introduces GRAFT-Stereo, a method to improve stereo matching accuracy by using LiDAR guidance, addressing challenges with sparse LiDAR data through interpolation and early fusion techniques.

Details

Motivation: To enhance stereo matching accuracy by leveraging LiDAR depth data, especially under sparse LiDAR conditions.

Method: Inject LiDAR depth into the initial disparity map and image features, using interpolation for sparse data and distinct pre-filling approaches for each case.

Result: GRAFT-Stereo outperforms existing LiDAR-guided methods in sparse LiDAR scenarios across multiple datasets.

Conclusion: The study provides insights and solutions for effective LiDAR-guided stereo methods, inspiring further research.

Abstract: We investigate LiDAR guidance within the RAFT-Stereo framework, aiming to improve stereo matching accuracy by injecting precise LiDAR depth into the initial disparity map. We find that the effectiveness of LiDAR guidance drastically degrades when the LiDAR points become sparse (e.g., a few hundred points per frame), and we offer a novel explanation from a signal processing perspective. This insight leads to a surprisingly simple solution that enables LiDAR-guided RAFT-Stereo to thrive: pre-filling the sparse initial disparity map with interpolation. Interestingly, we find that pre-filling is also effective when injecting LiDAR depth into image features via early fusion, but for a fundamentally different reason, necessitating a distinct pre-filling approach. By combining both solutions, the proposed Guided RAFT-Stereo (GRAFT-Stereo) significantly outperforms existing LiDAR-guided methods under sparse LiDAR conditions across various datasets. We hope this study inspires more effective LiDAR-guided stereo methods.

[165] MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition

Peihao Xiang, Kaida Wu, Ou Bai

Main category: cs.CV

TL;DR: MTCAE-DFER enhances dynamic facial expression recognition using a cascaded decoder module with ViT and VideoMAE, improving global-local feature interaction and reducing overfitting.

Details

Motivation: To improve dynamic facial expression recognition by integrating global and local dynamic features and addressing overfitting in large models.

Method: Uses a cascaded decoder module based on ViT and VideoMAE, with decoder outputs as queries and encoder outputs as keys/values for feature interaction.

Result: Proven robustness and effectiveness through ablation experiments and SOTA comparisons on public datasets.

Conclusion: MTCAE-DFER successfully enhances recognition by leveraging global-local feature interaction and multi-task learning.

Abstract: This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition, namely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder module, which is based on the Vision Transformer (ViT) architecture and employs the decoder concept of Transformer to reconstruct the multi-head attention module. The decoder output from the previous task serves as the query (Q), representing local dynamic features, while the Video Masked Autoencoder (VideoMAE) shared encoder output acts as both the key (K) and value (V), representing global dynamic features. This setup facilitates interaction between global and local dynamic features across related tasks. Additionally, this proposal aims to alleviate overfitting of complex large model. We utilize autoencoder-based multi-task cascaded learning approach to explore the impact of dynamic face detection and dynamic face landmark on dynamic facial expression recognition, which enhances the model’s generalization ability. After we conduct extensive ablation experiments and comparison with state-of-the-art (SOTA) methods on various public datasets for dynamic facial expression recognition, the robustness of the MTCAE-DFER model and the effectiveness of global-local dynamic feature interaction among related tasks have been proven.

[166] The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face?

Dinh Nam Pham, Eleftherios Avramidis

Main category: cs.CV

TL;DR: The paper explores the role of non-manual facial features in automatic sign language recognition (ASLR), identifying the mouth as the most impactful feature for improving accuracy.

Details

Motivation: Non-manual facial features are crucial in sign language but underexplored in ASLR. Prior work lacks systematic analysis of distinct facial regions.

Method: Uses two deep learning models (CNN and transformer) on an SLR dataset to evaluate contributions of eyes, mouth, and full face.

Result: The mouth is the most important non-manual feature, significantly boosting recognition accuracy.

Conclusion: Facial features, especially the mouth, are essential for improving ASLR systems.

Abstract: Non-manual facial features play a crucial role in sign language communication, yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a transformer-based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.

[167] Latest Object Memory Management for Temporally Consistent Video Instance Segmentation

Seunghun Lee, Jiwan Seo, Minwoo Choi, Kiljoon Han, Jaehoon Jeong, Zane Durante, Ehsan Adeli, Sang Hyun Park, Sunghoon Im

Main category: cs.CV

TL;DR: LOMM introduces Latest Object Memory (LOM) and Decoupled Object Association (DOA) for improved video instance segmentation, achieving state-of-the-art results.

Details

Motivation: Enhancing long-term instance tracking in video instance segmentation (VIS) by addressing challenges like object presence modeling and identity consistency.

Method: Uses Latest Object Memory (LOM) to track and update object states, and Decoupled Object Association (DOA) to manage new and existing objects separately.

Result: Achieves a state-of-the-art AP score of 54.0 on YouTube-VIS 2022, outperforming traditional methods.

Conclusion: LOMM sets a new benchmark in VIS by improving tracking accuracy and identity consistency in dynamic scenes.

Abstract: In this paper, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation that significantly improves long-term instance tracking. At the core of our method is Latest Object Memory (LOM), which robustly tracks and continuously updates the latest states of objects by explicitly modeling their presence in each frame. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the VIS process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new benchmark in VIS. Notably, our LOMM achieves state-of-the-art AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos. Project page: https://seung-hun-lee.github.io/projects/LOMM/

[168] MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, Zuxuan Wu

Main category: cs.CV

TL;DR: MagicMotion is a novel image-to-video framework for trajectory-controllable video generation, addressing limitations in existing methods by supporting multi-object motion control and offering diverse trajectory formats. It introduces MagicData (a dataset) and MagicBench (a benchmark) for robust evaluation.

Details

Motivation: Existing methods struggle with complex and multi-object motion control, lack diverse trajectory formats, and suffer from poor object consistency and visual quality. The absence of a dedicated dataset or benchmark further limits progress.

Method: MagicMotion uses three levels of trajectory conditions (masks, bounding boxes, sparse boxes) to animate objects along defined paths while maintaining quality. It includes MagicData (dataset) and MagicBench (benchmark) for training and evaluation.

Result: MagicMotion outperforms previous methods in video quality and trajectory control accuracy, as demonstrated by extensive experiments.

Conclusion: MagicMotion advances trajectory-controllable video generation by addressing key challenges, supported by a new dataset and benchmark, and achieves superior performance.

Abstract: Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at https://quanhaol.github.io/magicmotion-site.

[169] MoFRR: Mixture of Diffusion Models for Face Retouching Restoration

Jiaxin Liu, Qichao Ying, Zhenxing Qian, Sheng Li, Runqi Zhang, Jian Liu, Xinpeng Zhang

Main category: cs.CV

TL;DR: The paper introduces Face Retouching Restoration (FRR), a task to recover original faces from retouched images, and proposes MoFRR, a method using specialized and shared experts for restoration.

Details

Motivation: Addressing the lack of methods to accurately restore original faces from retouched images on social media.

Method: MoFRR employs a mixture of diffusion models with specialized experts for distinct retouching types and a shared expert for universal traces, using dual-branch structures for low and high-frequency restoration.

Result: MoFRR shows effectiveness on the RetouchingFFHQ++ dataset, demonstrating successful restoration of retouched faces.

Conclusion: The proposed MoFRR effectively tackles the FRR task, offering a robust solution for restoring original faces from retouched images.

Abstract: The widespread use of face retouching on social media platforms raises concerns about the authenticity of face images. While existing methods focus on detecting face retouching, how to accurately recover the original faces from the retouched ones has yet to be answered. This paper introduces Face Retouching Restoration (FRR), a novel computer vision task aimed at restoring original faces from their retouched counterparts. FRR differs from traditional image restoration tasks by addressing the complex retouching operations with various types and degrees, which focuses more on the restoration of the low-frequency information of the faces. To tackle this challenge, we propose MoFRR, Mixture of Diffusion Models for FRR. Inspired by DeepSeek’s expert isolation strategy, the MoFRR uses sparse activation of specialized experts handling distinct retouching types and the engagement of a shared expert dealing with universal retouching traces. Each specialized expert follows a dual-branch structure with a DDIM-based low-frequency branch guided by an Iterative Distortion Evaluation Module (IDEM) and a Cross-Attention-based High-Frequency branch (HFCAM) for detail refinement. Extensive experiments on a newly constructed face retouching dataset, RetouchingFFHQ++, demonstrate the effectiveness of MoFRR for FRR.

[170] Latent Multimodal Reconstruction for Misinformation Detection

Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis

Main category: cs.CV

TL;DR: The paper introduces ‘Miscaption This!’, a dataset of LVLM-generated miscaptioned images, and ‘LAMAR’, a reconstruction-based network for multimodal misinformation detection, achieving state-of-the-art results.

Details

Motivation: Addressing the lack of realistic synthetic data for multimodal misinformation detection (MMD) by leveraging Large Vision-Language Models (LVLMs) to generate diverse examples.

Method: Uses LVLMs to create synthetic miscaptioned images (‘Miscaption This!’) and proposes ‘LAMAR’, a network reconstructing truthful caption embeddings for improved detection.

Result: Models trained on ‘Miscaption This!’ generalize better to real-world misinformation; LAMAR achieves state-of-the-art performance on NewsCLIPpings and VERITE benchmarks.

Conclusion: LVLM-generated data and reconstruction-based networks like LAMAR significantly advance MMD, offering robust solutions for misinformation detection.

Abstract: Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image’s origin, context, or meaning, poses a growing challenge in the digital age. To support fact-checkers, researchers have focused on developing datasets and methods for multimodal misinformation detection (MMD). Due to the scarcity of large-scale annotated MMD datasets, recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic examples that lack real-world complexity, limiting model robustness. Meanwhile, Large Vision-Language Models (LVLMs) remain underexplored for generating diverse and realistic synthetic data for MMD. To address, we introduce “Miscaption This!”, a collection of LVLM-generated miscaptioned image datasets. Additionally, we introduce “Latent Multimodal Reconstruction” (LAMAR), a network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on “MisCaption This!” generalize better to real-world misinformation while LAMAR achieves new state-of-the-art on both NewsCLIPpings and VERITE benchmarks; highlighting the value of LVLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction

[171] Self-Guided Masked Autoencoder

Jeongwoo Shin, Inseo Lee, Junho Lee, Joonseok Lee

Main category: cs.CV

TL;DR: The paper analyzes MAE’s learning process, revealing it learns patch-level clustering early. A self-guided MAE is proposed, improving learning without external aids.

Details

Motivation: To uncover what and how MAE learns and enhance its self-supervised learning process.

Method: Proposes a self-guided MAE that generates informed masks using patch clustering progress, replacing random masking.

Result: The method significantly boosts learning without external models, validated by downstream task experiments.

Conclusion: Self-guided MAE improves MAE’s learning process while maintaining its self-supervised nature.

Abstract: Masked Autoencoder (MAE) is a self-supervised approach for representation learning, widely applicable to a variety of downstream tasks in computer vision. In spite of its success, it is still not fully uncovered what and how MAE exactly learns. In this paper, with an in-depth analysis, we discover that MAE intrinsically learns pattern-based patch-level clustering from surprisingly early stages of pretraining. Upon this understanding, we propose self-guided masked autoencoder, which internally generates informed mask by utilizing its progress in patch clustering, substituting the naive random masking of the vanilla MAE. Our approach significantly boosts its learning process without relying on any external models or supplementary information, keeping the benefit of self-supervised nature of MAE intact. Comprehensive experiments on various downstream tasks verify the effectiveness of the proposed method.

[172] CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment

Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, Yilin Wang

Main category: cs.CV

TL;DR: CP-LLM is a multimodal LLM for video quality assessment, combining pixel-level and contextual analysis to improve accuracy and interpretability.

Details

Motivation: Existing VQA models lack contextual understanding or sensitivity to small distortions, limiting their effectiveness.

Method: CP-LLM uses dual vision encoders for high-level and low-level analysis, with a language decoder to integrate insights, trained via multi-task learning.

Result: CP-LLM achieves state-of-the-art performance on VQA benchmarks, with superior robustness to pixel distortions.

Conclusion: CP-LLM offers a comprehensive and practical solution for real-world video quality assessment.

Abstract: Video quality assessment (VQA) is a challenging research topic with broad applications. Effective VQA necessitates sensitivity to pixel-level distortions and a comprehensive understanding of video context to accurately determine the perceptual impact of distortions. Traditional hand-crafted and learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent LLM-based models struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context and Pixel aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). The model is trained via a multi-task pipeline optimizing for score prediction, description generation, and pairwise comparisons. Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on established VQA benchmarks and superior robustness to pixel distortions, confirming its efficacy for comprehensive and practical video quality assessment in real-world scenarios.

[173] HydraMamba: Multi-Head State Space Model for Global Point Cloud Learning

Kanglin Qu, Pan Gao, Qun Dai, Yuanhao Sun

Main category: cs.CV

TL;DR: HydraMamba introduces a state space model-based point cloud network to improve long-range dependency modeling and locality learning in point cloud tasks.

Details

Motivation: Existing attention mechanisms in point cloud learning suffer from quadratic complexity and limited inter-point interactions, while current state space models lack proper point cloud serialization and locality learning.

Method: HydraMamba employs a shuffle serialization strategy for better adaptation to S6’s causal nature, a ConvBiS6 layer for local and global dependency capture, and MHS6 for enhanced modeling.

Result: HydraMamba achieves state-of-the-art performance on object-level and scene-level tasks.

Conclusion: HydraMamba effectively addresses the challenges of long-range dependency and locality learning in point cloud networks, demonstrating superior performance.

Abstract: The attention mechanism has become a dominant operator in point cloud learning, but its quadratic complexity leads to limited inter-point interactions, hindering long-range dependency modeling between objects. Due to excellent long-range modeling capability with linear complexity, the selective state space model (S6), as the core of Mamba, has been exploited in point cloud learning for long-range dependency interactions over the entire point cloud. Despite some significant progress, related works still suffer from imperfect point cloud serialization and lack of locality learning. To this end, we explore a state space model-based point cloud network termed HydraMamba to address the above challenges. Specifically, we design a shuffle serialization strategy, making unordered point sets better adapted to the causal nature of S6. Meanwhile, to overcome the deficiency of existing techniques in locality learning, we propose a ConvBiS6 layer, which is capable of capturing local geometries and global context dependencies synergistically. Besides, we propose MHS6 by extending the multi-head design to S6, further enhancing its modeling capability. HydraMamba achieves state-of-the-art results on various tasks at both object-level and scene-level. The code is available at https://github.com/Point-Cloud-Learning/HydraMamba.

[174] JDATT: A Joint Distillation Framework for Atmospheric Turbulence Mitigation and Target Detection

Zhiming Liu, Paul Hill, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: JDATT is a joint distillation framework combining turbulence mitigation and target detection, reducing model complexity for real-time use.

Details

Motivation: Address inefficiencies and high computational costs of separate turbulence mitigation and detection methods.

Method: Integrates AT mitigation and detection modules with hybrid distillation (CWD, MGD, and KL divergence).

Result: Achieves better visual restoration and detection accuracy with reduced model size and inference time.

Conclusion: JDATT is efficient for real-time applications in resource-constrained settings.

Abstract: Atmospheric turbulence (AT) introduces severe degradations, such as rippling, blur, and intensity fluctuations, that hinder both image quality and downstream vision tasks like target detection. While recent deep learning-based approaches have advanced AT mitigation using transformer and Mamba architectures, their high complexity and computational cost make them unsuitable for real-time applications, especially in resource-constrained settings such as remote surveillance. Moreover, the common practice of separating turbulence mitigation and object detection leads to inefficiencies and suboptimal performance. To address these challenges, we propose JDATT, a Joint Distillation framework for Atmospheric Turbulence mitigation and Target detection. JDATT integrates state-of-the-art AT mitigation and detection modules and introduces a unified knowledge distillation strategy that compresses both components while minimizing performance loss. We employ a hybrid distillation scheme: feature-level distillation via Channel-Wise Distillation (CWD) and Masked Generative Distillation (MGD), and output-level distillation via Kullback-Leibler divergence. Experiments on synthetic and real-world turbulence datasets demonstrate that JDATT achieves superior visual restoration and detection accuracy while significantly reducing model size and inference time, making it well-suited for real-time deployment.

[175] InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing

Kun-Hsiang Lin, Yu-Wen Tseng, Kang-Yang Huang, Jhih-Ciang Wu, Wen-Huang Cheng

Main category: cs.CV

TL;DR: InstructFLIP is a novel framework for face anti-spoofing (FAS) that uses vision-language models (VLMs) and meta-domain learning to improve generalization and reduce training redundancy.

Details

Motivation: Addressing limited semantic understanding of attack types and training redundancy in cross-domain FAS.

Method: Integrates VLMs for better visual input perception and employs a meta-domain strategy for unified learning. Explicitly decouples instructions into content (spoofing semantics) and style (environment/camera variations).

Result: Outperforms SOTA models in accuracy and reduces training redundancy across diverse domains.

Conclusion: InstructFLIP effectively enhances FAS generalization and efficiency, validated by extensive experiments.

Abstract: Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.

[176] TransFlow: Motion Knowledge Transfer from Video Diffusion Models to Video Salient Object Detection

Suhwan Cho, Minhyeok Lee, Jungho Lee, Sunghun Yang, Sangyoun Lee

Main category: cs.CV

TL;DR: TransFlow transfers motion knowledge from video diffusion models to generate realistic training data for video SOD, improving performance.

Details

Motivation: Training video SOD models is limited by scarce video datasets; existing methods using spatial transformations fail due to unrealistic optical flows.

Method: TransFlow leverages pre-trained video diffusion models to generate semantically-aware optical flows from static images.

Result: The method achieves improved performance across multiple benchmarks.

Conclusion: TransFlow effectively transfers motion knowledge for video SOD, enhancing model training with realistic data.

Abstract: Video salient object detection (SOD) relies on motion cues to distinguish salient objects from backgrounds, but training such models is limited by scarce video datasets compared to abundant image datasets. Existing approaches that use spatial transformations to create video sequences from static images fail for motion-guided tasks, as these transformations produce unrealistic optical flows that lack semantic understanding of motion. We present TransFlow, which transfers motion knowledge from pre-trained video diffusion models to generate realistic training data for video SOD. Video diffusion models have learned rich semantic motion priors from large-scale video data, understanding how different objects naturally move in real scenes. TransFlow leverages this knowledge to generate semantically-aware optical flows from static images, where objects exhibit natural motion patterns while preserving spatial boundaries and temporal coherence. Our method achieves improved performance across multiple benchmarks, demonstrating effective motion knowledge transfer.

[177] HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

Jun Li, Jinpeng Wang, Chaolei Tan, Niu Lian, Long Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia, Bin Chen

Main category: cs.CV

TL;DR: HLFormer, a hyperbolic modeling framework, improves Partially Relevant Video Retrieval (PRVR) by addressing Euclidean space limitations with hybrid space learning and hierarchical constraints.

Details

Motivation: Existing PRVR methods suffer from geometric distortion in Euclidean space, misrepresenting hierarchical video structures and semantics, leading to suboptimal temporal modeling.

Method: HLFormer uses hyperbolic space learning, integrating Lorentz and Euclidean Attention Blocks, a Mean-Guided Adaptive Interaction Module, and a Partial Order Preservation Loss for hierarchical modeling.

Result: HLFormer outperforms state-of-the-art methods in PRVR tasks.

Conclusion: Hyperbolic modeling effectively enhances hierarchical and cross-modal matching in PRVR, as demonstrated by HLFormer’s superior performance.

Abstract: Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce “text < video” hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICCV25-HLFormer.

[178] DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation

Suhwan Cho, Minhyeok Lee, Jungho Lee, Donghyeong Kim, Sangyoun Lee

Main category: cs.CV

TL;DR: DepthFlow synthesizes optical flow from single images using depth maps to expand training data for unsupervised video object segmentation, achieving state-of-the-art results.

Details

Motivation: The scarcity of training data for two-stream VOS approaches limits performance. DepthFlow addresses this by leveraging depth-to-flow synthesis.

Method: Estimate depth maps from images, convert them into synthetic flow fields, and use these to create training pairs for VOS models.

Result: Achieves state-of-the-art performance on all public VOS benchmarks.

Conclusion: DepthFlow provides a scalable solution to data scarcity, enhancing VOS model training.

Abstract: Unsupervised video object segmentation (VOS) aims to detect the most prominent object in a video. Recently, two-stream approaches that leverage both RGB images and optical flow have gained significant attention, but their performance is fundamentally constrained by the scarcity of training data. To address this, we propose DepthFlow, a novel data generation method that synthesizes optical flow from single images. Our approach is driven by the key insight that VOS models depend more on structural information embedded in flow maps than on their geometric accuracy, and that this structure is highly correlated with depth. We first estimate a depth map from a source image and then convert it into a synthetic flow field that preserves essential structural cues. This process enables the transformation of large-scale image-mask pairs into image-flow-mask training pairs, dramatically expanding the data available for network training. By training a simple encoder-decoder architecture with our synthesized data, we achieve new state-of-the-art performance on all public VOS benchmarks, demonstrating a scalable and effective solution to the data scarcity problem.

[179] Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning

Steven Walton

Main category: cs.CV

TL;DR: The paper explores architectural principles to enhance computer vision models’ performance while reducing computational demands, focusing on data handling, neural architecture modifications, and leveraging Normalizing Flows.

Details

Motivation: Address the need for efficient models in diverse, resource-constrained environments by reducing computational demands without sacrificing performance.

Method: Three approaches: optimizing data ingress/egress, modifying neural architecture (e.g., restricted attention in vision transformers), and leveraging Normalizing Flows for knowledge distillation.

Result: Demonstrates that careful architectural design can make models smaller, faster, and more cost-effective while maintaining high performance.

Conclusion: Efficient neural architectures can significantly reduce computational demands, enabling broader deployment in resource-constrained settings.

Abstract: Major advancements in the capabilities of computer vision models have been primarily fueled by rapid expansion of datasets, model parameters, and computational budgets, leading to ever-increasing demands on computational infrastructure. However, as these models are deployed in increasingly diverse and resource-constrained environments, there is a pressing need for architectures that can deliver high performance while requiring fewer computational resources. This dissertation focuses on architectural principles through which models can achieve increased performance while reducing their computational demands. We discuss strides towards this goal through three directions. First, we focus on data ingress and egress, investigating how information may be passed into and retrieved from our core neural processing units. This ensures that our models make the most of available data, allowing smaller architectures to become more performant. Second, we investigate modifications to the core neural architecture, applied to restricted attention in vision transformers. This section explores how removing uniform context windows in restricted attention increases the expressivity of the underlying neural architecture. Third, we explore the natural structures of Normalizing Flows and how we can leverage these properties to better distill model knowledge. These contributions demonstrate that careful design of neural architectures can increase the efficiency of machine learning algorithms, allowing them to become smaller, faster, and cheaper.

[180] ForCenNet: Foreground-Centric Network for Document Image Rectification

Peng Cai, Qiang Li, Kaicheng Yang, Dong Guo, Jia Li, Nan Zhou, Xiang An, Ninghua Yang, Jiankang Deng

Main category: cs.CV

TL;DR: ForCenNet, a Foreground-Centric Network, improves document image rectification by focusing on foreground elements, achieving state-of-the-art results on benchmarks.

Details

Motivation: Existing methods overlook foreground elements, which are crucial for geometric references and layout correction in document images.

Method: Proposes a foreground-centric label generation method, a mask mechanism, and a curvature consistency loss to enhance distortion correction.

Result: Achieves state-of-the-art performance on benchmarks (DocUNet, DIR300, WarpDoc, DocReal) and effectively corrects layout elements.

Conclusion: ForCenNet demonstrates the importance of foreground elements in document rectification, offering superior performance and practical utility.

Abstract: Document image rectification aims to eliminate geometric deformation in photographed documents to facilitate text recognition. However, existing methods often neglect the significance of foreground elements, which provide essential geometric references and layout information for document image correction. In this paper, we introduce Foreground-Centric Network (ForCenNet) to eliminate geometric distortions in document images. Specifically, we initially propose a foreground-centric label generation method, which extracts detailed foreground elements from an undistorted image. Then we introduce a foreground-centric mask mechanism to enhance the distinction between readable and background regions. Furthermore, we design a curvature consistency loss to leverage the detailed foreground labels to help the model understand the distorted geometric distribution. Extensive experiments demonstrate that ForCenNet achieves new state-of-the-art on four real-world benchmarks, such as DocUNet, DIR300, WarpDoc, and DocReal. Quantitative analysis shows that the proposed method effectively undistorts layout elements, such as text lines and table borders. The resources for further comparison are provided at https://github.com/caipeng328/ForCenNet.

[181] DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection

Guiping Cao, Xiangyuan Lan, Wenjian Huang, Jianguo Zhang, Dongmei Jiang, Yaowei Wang

Main category: cs.CV

TL;DR: DS-Det introduces a flexible Single-Query paradigm to address inefficiencies in transformer detectors caused by fixed queries and Recurrent Opposing Interactions (ROT), improving decoder efficiency and object detection performance.

Details

Motivation: Existing transformer detectors use fixed queries, which suffer from inefficiencies due to ROT and query ambiguity, limiting flexibility and performance.

Method: Proposes DS-Det with a Single-Query paradigm and disentangled attention learning (Cross-Attention for box location, Self-Attention for deduplication) to resolve ROT and ambiguity. Introduces PoCoo loss for prioritizing hard samples.

Result: Demonstrates superior performance on COCO2017 and WiderPerson datasets across five backbone models.

Conclusion: DS-Det effectively addresses decoder inefficiencies and ambiguity, offering a flexible and efficient solution for object detection.

Abstract: Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing inTeractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, “query ambiguity” arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR’s one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing “query ambiguity” and “ROT” issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at https://github.com/Med-Process/DS-Det/.

[182] SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models

Joon Hyun Park, Kumju Jo, Sungyong Baik

Main category: cs.CV

TL;DR: SeeDiff leverages Stable Diffusion’s attention mechanisms to generate high-quality pixel-level annotation masks without additional training, prompt tuning, or pre-trained segmentation networks.

Details

Motivation: To automate pixel-level annotation mask generation without human effort by fully exploiting Stable Diffusion's capabilities.

Method: Uses cross-attention for coarse object localization (seeds) and self-attention for iterative region expansion, refining masks with background uniformity.

Result: Generates high-quality masks directly from Stable Diffusion, eliminating the need for extra steps.

Conclusion: SeeDiff is an efficient, off-the-shelf solution for semantic segmentation mask generation.

Abstract: Entrusted with the goal of pixel-level object classification, the semantic segmentation networks entail the laborious preparation of pixel-level annotation masks. To obtain pixel-level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text-to-image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text-guided Diffusion models and thus require a pre-trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross-attention alone provides very coarse object localization, which however can provide initial seeds. Then, akin to region expansion in seeded segmentation, we utilize the semantic-correspondence-modeling capability of self-attention to iteratively spread the attention to the whole class from the seeds using multi-scale self-attention maps. We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences, compared to complex-structured objects. Thus, we further refine a mask using a more accurate background mask. Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion, without additional training procedure, prompt tuning, or a pre-trained segmentation network.

[183] FM-LC: A Hierarchical Framework for Urban Flood Mapping by Land Cover Identification Models

Xin Hong, Longchao Da, Hua Wei

Main category: cs.CV

TL;DR: FM-LC, a hierarchical framework for flood mapping in arid regions, improves accuracy by addressing spectral confusion and refining boundaries, outperforming traditional methods.

Details

Motivation: Urban flooding in arid regions is challenging to map due to limited spectral contrast and rapid dynamics. Accurate mapping is crucial for emergency response and resilience planning.

Method: FM-LC uses a three-stage process: multi-class U-Net segmentation, binary expert model for misclassified areas, and Bayesian smoothing for refinement.

Result: Validated on the Dubai storm event, FM-LC improved F1-scores by up to 29% and provided sharper flood delineations.

Conclusion: FM-LC effectively addresses challenges in arid-region flood mapping, offering significant improvements over traditional approaches.

Abstract: Urban flooding in arid regions poses severe risks to infrastructure and communities. Accurate, fine-scale mapping of flood extents and recovery trajectories is therefore essential for improving emergency response and resilience planning. However, arid environments often exhibit limited spectral contrast between water and adjacent surfaces, rapid hydrological dynamics, and highly heterogeneous urban land covers, which challenge traditional flood-mapping approaches. High-resolution, daily PlanetScope imagery provides the temporal and spatial detail needed. In this work, we introduce FM-LC, a hierarchical framework for Flood Mapping by Land Cover identification, for this challenging task. Through a three-stage process, it first uses an initial multi-class U-Net to segment imagery into water, vegetation, built area, and bare ground classes. We identify that this method has confusion between spectrally similar categories (e.g., water vs. vegetation). Second, by early checking, the class with the major misclassified area is flagged, and a lightweight binary expert segmentation model is trained to distinguish the flagged class from the rest. Third, a Bayesian smoothing step refines boundaries and removes spurious noise by leveraging nearby pixel information. We validate the framework on the April 2024 Dubai storm event, using pre- and post-rainfall PlanetScope composites. Experimental results demonstrate average F1-score improvements of up to 29% across all land-cover classes and notably sharper flood delineations, significantly outperforming conventional single-stage U-Net baselines.

[184] AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen, Assane Gueye

Main category: cs.CV

TL;DR: AutoSign, an autoregressive decoder-only transformer, bypasses traditional alignment methods in Continuous Sign Language Recognition (CSLR), improving accuracy by 6.1% on the Isharah-1000 dataset.

Details

Motivation: To bridge communication gaps between hearing and hearing-impaired communities by addressing limitations of multi-stage CSLR pipelines, such as error propagation and vocabulary scalability.

Method: Proposes AutoSign, using a decoder-only transformer with a temporal compression module (1D CNNs) and AraGPT2 for direct pose-to-text translation.

Result: Achieves a 6.1% improvement in WER score on the Isharah-1000 dataset, with hand and body gestures identified as most discriminative.

Conclusion: AutoSign’s direct translation approach outperforms traditional alignment-based methods, offering a scalable and efficient solution for CSLR.

Abstract: Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1% in WER score compared to the best existing method.

[185] Knowledge Regularized Negative Feature Tuning for Out-of-Distribution Detection with Vision-Language Models

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang

Main category: cs.CV

TL;DR: KR-NFT improves OOD detection by separating ID and OOD features using Negative Feature Tuning and knowledge regularization, enhancing generalization and reducing false positives.

Details

Motivation: Addressing the reduced generalization performance of negative prompt tuning in OOD detection for vision-language models.

Method: Proposes Knowledge Regularized Negative Feature Tuning (KR-NFT), combining Negative Feature Tuning (NFT) for feature separation and knowledge-regularization (KR) for optimization.

Result: Improves ID classification and OOD detection, reducing FPR95 by 5.44% on unseen ID categories with few-shot training.

Conclusion: KR-NFT is efficient, scalable, and enhances both ID and OOD detection performance, particularly in generalization settings.

Abstract: Out-of-distribution (OOD) detection is crucial for building reliable machine learning models. Although negative prompt tuning has enhanced the OOD detection capabilities of vision-language models, these tuned models often suffer from reduced generalization performance on unseen classes and styles. To address this challenge, we propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT), which integrates an innovative adaptation architecture termed Negative Feature Tuning (NFT) and a corresponding knowledge-regularization (KR) optimization strategy. Specifically, NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces. This separation maximizes the distinction between in-distribution (ID) and OOD images. Additionally, we introduce image-conditional learnable factors through a lightweight meta-network, enabling dynamic adaptation to individual images and mitigating sensitivity to class and style shifts. Compared to traditional negative prompt tuning, NFT demonstrates superior efficiency and scalability. To optimize this adaptation architecture, the KR optimization strategy is designed to enhance the discrimination between ID and OOD sets while mitigating pre-trained knowledge forgetting. This enhances OOD detection performance on trained ID classes while simultaneously improving OOD detection on unseen ID datasets. Notably, when trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44% under an unexplored generalization setting with unseen ID categories. Codes can be found at \href{https://github.com/ZhuWenjie98/KRNFT}{https://github.com/ZhuWenjie98/KRNFT}.

[186] Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation

Yihong Cao, Jiaming Zhang, Xu Zheng, Hao Shi, Kunyu Peng, Hang Liu, Kailun Yang, Hui Zhang

Main category: cs.CV

TL;DR: The paper introduces UNLOCK, a source-free method for panoramic image segmentation, addressing distortions and occlusions without needing source data.

Details

Motivation: Panoramic image processing is limited by distortions, occlusions, and lack of annotations. Existing methods require source data, which is impractical.

Method: UNLOCK uses Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning to adapt without source data or target labels.

Result: Achieves state-of-the-art performance (10.9 mAAP, 11.6 mAP) and +4.3 mAPQ improvement over source-only methods.

Conclusion: UNLOCK provides a practical, source-free solution for panoramic segmentation, matching source-dependent methods.

Abstract: Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.

[187] FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing

Bizhu Wu, Jinheng Xie, Meidan Ding, Zhe Kong, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen

Main category: cs.CV

TL;DR: The paper introduces the FineMotion dataset to improve text-driven human motion generation by focusing on detailed body part movements and timing.

Details

Motivation: Existing methods for generating human motions from text often overlook specific body part movements and their timing, limiting realism and detail.

Method: The authors propose the FineMotion dataset, containing 442,000 motion snippets and 95k detailed descriptions of body part movements.

Result: The dataset improves Top-3 accuracy by 15.3% for the MDM model and supports zero-shot fine-grained motion editing.

Conclusion: The FineMotion dataset enhances text-driven motion generation and enables detailed spatial and temporal motion editing via text.

Abstract: Generating realistic human motions from textual descriptions has undergone significant advancements. However, existing methods often overlook specific body part movements and their timing. In this paper, we address this issue by enriching the textual description with more details. Specifically, we propose the FineMotion dataset, which contains over 442,000 human motion snippets - short segments of human motion sequences - and their corresponding detailed descriptions of human body part movements. Additionally, the dataset includes about 95k detailed paragraphs describing the movements of human body parts of entire motion sequences. Experimental results demonstrate the significance of our dataset on the text-driven finegrained human motion generation task, especially with a remarkable +15.3% improvement in Top-3 accuracy for the MDM model. Notably, we further support a zero-shot pipeline of fine-grained motion editing, which focuses on detailed editing in both spatial and temporal dimensions via text. Dataset and code available at: CVI-SZU/FineMotion

[188] A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba

Ye Lu, Jie Wang, Jianjun Gao, Rui Gong, Chen Cai, Kim-Hui Yap

Main category: cs.CV

TL;DR: SAMA is a framework for pose-lifting that captures spatial joint topology and motion dynamics independently, outperforming Mamba-based methods with fewer computational costs.

Details

Motivation: Mamba-based methods struggle with intricate joint connections and uniform processing of motion trajectories, neglecting intrinsic motion differences.

Method: SAMA uses a Structure-aware State Integrator (SSI) for joint feature fusion and a Motion-adaptive State Modulator (MSM) for tailored motion adjustments.

Result: Extensive experiments show SAMA achieves advanced results with lower computational costs.

Conclusion: SAMA effectively addresses limitations of Mamba-based methods by integrating structure-awareness and motion-adaptivity.

Abstract: Recent Mamba-based methods for the pose-lifting task tend to model joint dependencies by 2D-to-1D mapping with diverse scanning strategies. Though effective, they struggle to model intricate joint connections and uniformly process all joint motion trajectories while neglecting the intrinsic differences across motion characteristics. In this work, we propose a structure-aware and motion-adaptive framework to capture spatial joint topology along with diverse motion dynamics independently, named as SAMA. Specifically, SAMA consists of a Structure-aware State Integrator (SSI) and a Motion-adaptive State Modulator (MSM). The Structure-aware State Integrator is tasked with leveraging dynamic joint relationships to fuse information at both the joint feature and state levels in the state space, based on pose topology rather than sequential state transitions. The Motion-adaptive State Modulator is responsible for joint-specific motion characteristics recognition, thus applying tailored adjustments to diverse motion patterns across different joints. Through the above key modules, our algorithm enables structure-aware and motion-adaptive pose lifting. Extensive experiments across multiple benchmarks demonstrate that our algorithm achieves advanced results with fewer computational costs.

[189] From General to Specialized: The Need for Foundational Models in Agriculture

Vishal Nedungadi, Xingguo Xiong, Aike Potze, Ron Van Bree, Tao Lin, Marc Rußwurm, Ioannis N. Athanasiadis

Main category: cs.CV

TL;DR: The paper evaluates existing foundational models for agricultural tasks, proposes a framework for an ideal agricultural foundation model (CropFM), and highlights the need for a dedicated model tailored to agriculture.

Details

Motivation: Addressing food security challenges by leveraging foundation models for agricultural monitoring, as current applications in agriculture remain under-explored.

Method: Quantitative evaluation of existing foundational models, development of a requirements framework (CropFM), and empirical evaluation of two models in three agricultural tasks.

Result: Existing models show potential but lack specialization for agriculture, emphasizing the need for a dedicated foundational model.

Conclusion: A tailored foundational model for agriculture (CropFM) is necessary to effectively address agricultural challenges.

Abstract: Food security remains a global concern as population grows and climate change intensifies, demanding innovative solutions for sustainable agricultural productivity. Recent advances in foundation models have demonstrated remarkable performance in remote sensing and climate sciences, and therefore offer new opportunities for agricultural monitoring. However, their application in challenges related to agriculture-such as crop type mapping, crop phenology estimation, and crop yield estimation-remains under-explored. In this work, we quantitatively evaluate existing foundational models to assess their effectivity for a representative set of agricultural tasks. From an agricultural domain perspective, we describe a requirements framework for an ideal agricultural foundation model (CropFM). We then survey and compare existing general-purpose foundational models in this framework and empirically evaluate two exemplary of them in three representative agriculture specific tasks. Finally, we highlight the need for a dedicated foundational model tailored specifically to agriculture.

[190] RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection

Xiaokai Bai, Chenxu Zhou, Lianqing Zheng, Si-Yuan Cao, Jianan Liu, Xiaohan Zhang, Zhengzhuang Zhang, Hui-liang Shen

Main category: cs.CV

TL;DR: RaGS is a novel framework using 3D Gaussian Splatting to fuse 4D radar and monocular images for 3D object detection, outperforming existing methods.

Details

Motivation: Existing fusion approaches for 4D radar and monocular images lack holistic scene understanding or are constrained by rigid grid structures.

Method: RaGS employs a cascaded pipeline: Frustum-based Localization Initiation (FLI), Iterative Multimodal Aggregation (IMA), and Multi-level Gaussian Fusion (MGF) to dynamically model scenes with Gaussians.

Result: RaGS achieves state-of-the-art performance on benchmarks like View-of-Delft, TJ4DRadSet, and OmniHD-Scenes.

Conclusion: RaGS provides a flexible, resource-efficient solution for 3D object detection by focusing on sparse objects while maintaining scene perception.

Abstract: 4D millimeter-wave radar has emerged as a promising sensor for autonomous driving, but effective 3D object detection from both 4D radar and monocular images remains a challenge. Existing fusion approaches typically rely on either instance-based proposals or dense BEV grids, which either lack holistic scene understanding or are limited by rigid grid structures. To address these, we propose RaGS, the first framework to leverage 3D Gaussian Splatting (GS) as representation for fusing 4D radar and monocular cues in 3D object detection. 3D GS naturally suits 3D object detection by modeling the scene as a field of Gaussians, dynamically allocating resources on foreground objects and providing a flexible, resource-efficient solution. RaGS uses a cascaded pipeline to construct and refine the Gaussian field. It starts with the Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse 3D Gaussians positions. Then, the Iterative Multimodal Aggregation (IMA) fuses semantics and geometry, refining the limited Gaussians to the regions of interest. Finally, the Multi-level Gaussian Fusion (MGF) renders the Gaussians into multi-level BEV features for 3D object detection. By dynamically focusing on sparse objects within scenes, RaGS enable object concentrating while offering comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes benchmarks demonstrate its state-of-the-art performance. Code will be released.

[191] Part Segmentation of Human Meshes via Multi-View Human Parsing

James Dickens, Kamyar Hamad

Main category: cs.CV

TL;DR: The paper bridges point cloud deep learning and human parsing by enabling semantic segmentation of human meshes using geometric data, introducing a pseudo-ground truth pipeline and a memory-efficient sampling strategy.

Details

Motivation: To combine advances in point cloud deep learning and human parsing for semantic segmentation of human meshes without relying on texture.

Method: Developed a pseudo-ground truth labeling pipeline for Thuman2.1, introduced windowed iterative farthest point sampling with space-filling curve-based serialization, and used PointTransformer for geometric segmentation.

Result: The approach effectively achieves semantic parsing of human meshes, confirmed by experimental results.

Conclusion: The proposed method is accurate and efficient for semantic segmentation of human meshes using only geometric data.

Abstract: Recent advances in point cloud deep learning have led to models that achieve high per-part labeling accuracy on large-scale point clouds, using only the raw geometry of unordered point sets. In parallel, the field of human parsing focuses on predicting body part and clothing/accessory labels from images. This work aims to bridge these two domains by enabling per-vertex semantic segmentation of large-scale human meshes. To achieve this, a pseudo-ground truth labeling pipeline is developed for the Thuman2.1 dataset: meshes are first aligned to a canonical pose, segmented from multiple viewpoints, and the resulting point-level labels are then backprojected onto the original mesh to produce per-point pseudo ground truth annotations. Subsequently, a novel, memory-efficient sampling strategy is introduced, a windowed iterative farthest point sampling (FPS) with space-filling curve-based serialization to effectively downsample the point clouds. This is followed by a purely geometric segmentation using PointTransformer, enabling semantic parsing of human meshes without relying on texture information. Experimental results confirm the effectiveness and accuracy of the proposed approach.

[192] OW-CLIP: Data-Efficient Visual Supervision for Open-World Object Detection via Human-AI Collaboration

Junwen Duan, Wei Xue, Ziyao Kang, Shixia Liu, Jiazhi Xia

Main category: cs.CV

TL;DR: OW-CLIP is a visual analytics system for open-world object detection (OWOD) that addresses data-hungry training, partial feature overfitting, and inflexibility by using multimodal prompt tuning, Crop-Smoothing, and dual-modal data refinement. It achieves 89% of SOTA performance with minimal self-generated data.

Details

Motivation: Traditional OWOD methods are limited by reliance on crowdsourced annotations, partial feature overfitting, and rigid architectures. OW-CLIP aims to overcome these challenges with efficient data use and adaptable training.

Method: OW-CLIP employs plug-and-play multimodal prompt tuning, Crop-Smoothing to reduce overfitting, and dual-modal data refinement using large language models and cross-modal similarity. It also includes a visualization interface for high-quality annotations.

Result: OW-CLIP achieves 89% of SOTA performance with only 3.8% self-generated data and outperforms SOTA when using equivalent data volumes.

Conclusion: OW-CLIP offers a data-efficient, flexible solution for OWOD, improving annotation quality and model performance while reducing reliance on external data.

Abstract: Open-world object detection (OWOD) extends traditional object detection to identifying both known and unknown object, necessitating continuous model adaptation as new annotations emerge. Current approaches face significant limitations: 1) data-hungry training due to reliance on a large number of crowdsourced annotations, 2) susceptibility to “partial feature overfitting,” and 3) limited flexibility due to required model architecture modifications. To tackle these issues, we present OW-CLIP, a visual analytics system that provides curated data and enables data-efficient OWOD model incremental training. OW-CLIP implements plug-and-play multimodal prompt tuning tailored for OWOD settings and introduces a novel “Crop-Smoothing” technique to mitigate partial feature overfitting. To meet the data requirements for the training methodology, we propose dual-modal data refinement methods that leverage large language models and cross-modal similarity for data generation and filtering. Simultaneously, we develope a visualization interface that enables users to explore and deliver high-quality annotations: including class-specific visual feature phrases and fine-grained differentiated images. Quantitative evaluation demonstrates that OW-CLIP achieves competitive performance at 89% of state-of-the-art performance while requiring only 3.8% self-generated data, while outperforming SOTA approach when trained with equivalent data volumes. A case study shows the effectiveness of the developed method and the improved annotation quality of our visualization system.

[193] All-in-One Medical Image Restoration with Latent Diffusion-Enhanced Vector-Quantized Codebook Prior

Haowei Chen, Zhiwen Yang, Haotian Hou, Hui Zhang, Bingzheng Wei, Gang Zhou, Yan Xu

Main category: cs.CV

TL;DR: DiffCode is a novel framework for all-in-one medical image restoration (MedIR) that uses a latent diffusion-enhanced vector-quantized codebook prior to handle diverse task-specific degradations.

Details

Motivation: Existing methods struggle with the heterogeneity of MedIR tasks, which involve distinct degradations and information losses. DiffCode aims to unify these tasks under one model.

Method: DiffCode employs a task-adaptive codebook bank for task-specific HQ prior features and a latent diffusion strategy to refine feature distribution iteratively.

Result: DiffCode outperforms existing methods in quantitative metrics and visual quality for MRI super-resolution, CT denoising, and PET synthesis.

Conclusion: DiffCode effectively addresses the challenges of all-in-one MedIR by integrating task-specific priors and leveraging latent diffusion for superior restoration.

Abstract: All-in-one medical image restoration (MedIR) aims to address multiple MedIR tasks using a unified model, concurrently recovering various high-quality (HQ) medical images (e.g., MRI, CT, and PET) from low-quality (LQ) counterparts. However, all-in-one MedIR presents significant challenges due to the heterogeneity across different tasks. Each task involves distinct degradations, leading to diverse information losses in LQ images. Existing methods struggle to handle these diverse information losses associated with different tasks. To address these challenges, we propose a latent diffusion-enhanced vector-quantized codebook prior and develop \textbf{DiffCode}, a novel framework leveraging this prior for all-in-one MedIR. Specifically, to compensate for diverse information losses associated with different tasks, DiffCode constructs a task-adaptive codebook bank to integrate task-specific HQ prior features across tasks, capturing a comprehensive prior. Furthermore, to enhance prior retrieval from the codebook bank, DiffCode introduces a latent diffusion strategy that utilizes the diffusion model’s powerful mapping capabilities to iteratively refine the latent feature distribution, estimating more accurate HQ prior features during restoration. With the help of the task-adaptive codebook bank and latent diffusion strategy, DiffCode achieves superior performance in both quantitative metrics and visual quality across three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis.

[194] ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, K. Huang

Main category: cs.CV

TL;DR: ATCTrack introduces a novel vision-language tracker that aligns multimodal cues with dynamic target states for robust tracking in complex scenarios.

Details

Motivation: Existing vision-language trackers struggle with dynamic target states and diverse textual expressions, limiting their robustness in real-world conditions.

Method: ATCTrack employs temporal visual target-context modeling and precise target word identification with adaptive context word calibration.

Result: ATCTrack achieves state-of-the-art performance on mainstream benchmarks.

Conclusion: The proposed tracker effectively addresses dynamic target-context alignment and outperforms existing methods.

Abstract: Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack.

[195] Efficient Self-Supervised Neuro-Analytic Visual Servoing for Real-time Quadrotor Control

Sebastian Mocanu, Sebastian-Ion Nae, Mihai-Eugen Barbu, Marius Leordeanu

Main category: cs.CV

TL;DR: A self-supervised neuro-analytical model for quadrotor control uses a small student ConvNet to learn from an improved IBVS teacher, achieving 11x faster inference with similar accuracy.

Details

Motivation: To enable efficient, stable, and accurate visual-based quadrotor control without relying on explicit geometric models or fiducial markers.

Method: Knowledge distillation from an analytical IBVS teacher to a student ConvNet, combined with a two-stage segmentation pipeline for robust feature detection.

Result: The student model achieves 11x faster inference than the teacher, with similar control accuracy and lower computational cost.

Conclusion: The proposed method enables real-time, vision-only quadrotor control in GPS-denied environments, outperforming classical approaches.

Abstract: This work introduces a self-supervised neuro-analytical, cost efficient, model for visual-based quadrotor control in which a small 1.7M parameters student ConvNet learns automatically from an analytical teacher, an improved image-based visual servoing (IBVS) controller. Our IBVS system solves numerical instabilities by reducing the classical visual servoing equations and enabling efficient stable image feature detection. Through knowledge distillation, the student model achieves 11x faster inference compared to the teacher IBVS pipeline, while demonstrating similar control accuracy at a significantly lower computational and memory cost. Our vision-only self-supervised neuro-analytic control, enables quadrotor orientation and movement without requiring explicit geometric models or fiducial markers. The proposed methodology leverages simulation-to-reality transfer learning and is validated on a small drone platform in GPS-denied indoor environments. Our key contributions include: (1) an analytical IBVS teacher that solves numerical instabilities inherent in classical approaches, (2) a two-stage segmentation pipeline combining YOLOv11 with a U-Net-based mask splitter for robust anterior-posterior vehicle segmentation to correctly estimate the orientation of the target, and (3) an efficient knowledge distillation dual-path system, which transfers geometric visual servoing capabilities from the analytical IBVS teacher to a compact and small student neural network that outperforms the teacher, while being suitable for real-time onboard deployment.

[196] FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving

Tao Lian, Jose L. Gómez, Antonio M. López

Main category: cs.CV

TL;DR: FedS2R is a one-shot federated domain generalization framework for synthetic-to-real semantic segmentation in autonomous driving, combining data augmentation and knowledge distillation to outperform individual client models.

Details

Motivation: The potential of federated domain generalization in semantic segmentation for autonomous driving is underexplored, prompting the development of FedS2R.

Method: FedS2R uses inconsistency-driven data augmentation for unstable classes and multi-client knowledge distillation with feature fusion to create a global model.

Result: The global model outperforms individual client models and is only 2 mIoU points behind a model trained with all client data.

Conclusion: FedS2R effectively addresses synthetic-to-real semantic segmentation in autonomous driving under federated learning.

Abstract: Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple clients without sharing raw data. However, its potential in the semantic segmentation of autonomous driving remains underexplored. In this paper, we propose FedS2R, the first one-shot federated domain generalization framework for synthetic-to-real semantic segmentation in autonomous driving. FedS2R comprises two components: an inconsistency-driven data augmentation strategy that generates images for unstable classes, and a multi-client knowledge distillation scheme with feature fusion that distills a global model from multiple client models. Experiments on five real-world datasets, Cityscapes, BDD100K, Mapillary, IDD, and ACDC, show that the global model significantly outperforms individual client models and is only 2 mIoU points behind the model trained with simultaneous access to all client data. These results demonstrate the effectiveness of FedS2R in synthetic-to-real semantic segmentation for autonomous driving under federated learning

[197] Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention

Drandreb Earl O. Juanico, Rowel O. Atienza, Jeffrey Kenneth Go

Main category: cs.CV

TL;DR: RCA enhances object localization in vision-language transformers by reweighting attention, improving performance in 11 out of 15 models without retraining.

Details

Motivation: To improve object localization in vision-language transformers by addressing extreme attention values and amplifying mid-level activations.

Method: RCA reweights final-layer attention, suppressing extremes and boosting mid-level activations to highlight semantically relevant tokens.

Result: RCA improves FitAP (a new metric) in 11 out of 15 models, with gains up to +26.6%. Late-fusion models benefit most, but others like DeepSeek-VL2 also improve.

Conclusion: RCA provides interpretability and performance gains for multimodal transformers, with effectiveness tied to attention sharpness and fusion timing.

Abstract: We propose Reverse Contrast Attention (RCA), a plug-in method that enhances object localization in vision-language transformers without retraining. RCA reweights final-layer attention by suppressing extremes and amplifying mid-level activations to let semantically relevant but subdued tokens guide predictions. We evaluate it on Open Vocabulary Referring Object Detection (OV-RefOD), introducing FitAP, a confidence-free average precision metric based on IoU and box area. RCA improves FitAP in 11 out of 15 open-source VLMs, with gains up to $+26.6%$. Effectiveness aligns with attention sharpness and fusion timing; while late-fusion models benefit consistently, models like $\texttt{DeepSeek-VL2}$ also improve, pointing to capacity and disentanglement as key factors. RCA offers both interpretability and performance gains for multimodal transformers.

[198] TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking

Mengmeng Wang, Haonan Wang, Yulong Li, Xiangjie Kong, Jiaxin Du, Guojiang Shen, Feng Xia

Main category: cs.CV

TL;DR: TrackAny3D is a framework for category-agnostic 3D single object tracking (SOT) using pretrained 3D models, achieving state-of-the-art performance with strong generalization.

Details

Motivation: Current category-specific 3D SOT methods are impractical for real-world use due to limited generalization and the need for separate models per category.

Method: The framework integrates parameter-efficient adapters, a Mixture-of-Geometry-Experts (MoGE) architecture, and a temporal context optimization strategy with learnable tokens and dynamic mask weighting.

Result: TrackAny3D outperforms existing methods on three benchmarks, demonstrating strong generalization and competitiveness.

Conclusion: The work highlights the potential of unified models and large-scale pretrained models in 3D SOT, encouraging further research in this direction.

Abstract: 3D LiDAR-based single object tracking (SOT) relies on sparse and irregular point clouds, posing challenges from geometric variations in scale, motion patterns, and structural complexity across object categories. Current category-specific approaches achieve good accuracy but are impractical for real-world use, requiring separate models for each category and showing limited generalization. To tackle these issues, we propose TrackAny3D, the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT. We first integrate parameter-efficient adapters to bridge the gap between pretraining and tracking tasks while preserving geometric priors. Then, we introduce a Mixture-of-Geometry-Experts (MoGE) architecture that adaptively activates specialized subnetworks based on distinct geometric characteristics. Additionally, we design a temporal context optimization strategy that incorporates learnable temporal tokens and a dynamic mask weighting module to propagate historical information and mitigate temporal drift. Experiments on three commonly-used benchmarks show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT, demonstrating strong generalization and competitiveness. We hope this work will enlighten the community on the importance of unified models and further expand the use of large-scale pretrained models in this field.

[199] DriveIndia: An Object Detection Dataset for Diverse Indian Traffic Scenes

Rishav Kumar, D. Santhosh Reddy, P. Rajalakshmi

Main category: cs.CV

TL;DR: DriveIndia is a large-scale dataset for object detection in Indian traffic, featuring 66,986 images across 24 categories, collected under diverse conditions. Baseline results using YOLO models achieve 78.7% mAP.

Details

Motivation: To address the complexity and unpredictability of Indian traffic environments for autonomous driving research.

Method: The dataset includes 66,986 high-resolution images annotated in YOLO format, covering varied weather, illumination, road infrastructure, and traffic patterns. Baseline performance is evaluated using YOLO family models.

Result: The top-performing YOLO variant achieves a mAP of 78.7%, demonstrating the dataset’s utility for robust object detection.

Conclusion: DriveIndia serves as a benchmark for autonomous driving challenges and will be publicly available for research.

Abstract: We introduce \textbf{DriveIndia}, a large-scale object detection dataset purpose-built to capture the complexity and unpredictability of Indian traffic environments. The dataset contains \textbf{66,986 high-resolution images} annotated in YOLO format across \textbf{24 traffic-relevant object categories}, encompassing diverse conditions such as varied weather (fog, rain), illumination changes, heterogeneous road infrastructure, and dense, mixed traffic patterns and collected over \textbf{120+ hours} and covering \textbf{3,400+ kilometers} across urban, rural, and highway routes. DriveIndia offers a comprehensive benchmark for real-world autonomous driving challenges. We provide baseline results using state-of-the-art \textbf{YOLO family models}, with the top-performing variant achieving a $mAP_{50}$ of \textbf{78.7%}. Designed to support research in robust, generalizable object detection under uncertain road conditions, DriveIndia will be publicly available via the TiHAN-IIT Hyderabad dataset repository (https://tihan.iith.ac.in/tiand-datasets/).

[200] A mini-batch training strategy for deep subspace clustering networks

Yuxuan Jiang, Chenwei Yu, Zhi Lin, Xiaolan Liu

Main category: cs.CV

TL;DR: The paper introduces a mini-batch training strategy for deep subspace clustering (DSC) using a memory bank for global features and a decoder-free framework with contrastive learning, achieving scalable training and competitive performance.

Details

Motivation: Existing DSC methods rely on full-batch processing due to the self-expressive module, limiting scalability for high-resolution images.

Method: Proposes a mini-batch training strategy with a memory bank and a decoder-free framework using contrastive learning.

Result: Achieves performance comparable to full-batch methods and outperforms state-of-the-art methods on COIL100 and ORL datasets.

Conclusion: The approach enables scalable DSC training and efficient fine-tuning of pre-trained encoders, offering a practical solution for large-scale subspace clustering.

Abstract: Mini-batch training is a cornerstone of modern deep learning, offering computational efficiency and scalability for training complex architectures. However, existing deep subspace clustering (DSC) methods, which typically combine an autoencoder with a self-expressive layer, rely on full-batch processing. The bottleneck arises from the self-expressive module, which requires representations of the entire dataset to construct a self-representation coefficient matrix. In this work, we introduce a mini-batch training strategy for DSC by integrating a memory bank that preserves global feature representations. Our approach enables scalable training of deep architectures for subspace clustering with high-resolution images, overcoming previous limitations. Additionally, to efficiently fine-tune large-scale pre-trained encoders for subspace clustering, we propose a decoder-free framework that leverages contrastive learning instead of autoencoding for representation learning. This design not only eliminates the computational overhead of decoder training but also provides competitive performance. Extensive experiments demonstrate that our approach not only achieves performance comparable to full-batch methods, but outperforms other state-of-the-art subspace clustering methods on the COIL100 and ORL datasets by fine-tuning deep networks.

[201] HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

Chang Liu, Yunfan Ye, Fan Zhang, Qingyang Zhou, Yuchuan Luo, Zhiping Cai

Main category: cs.CV

TL;DR: HumanSAM is a framework for classifying human-centric video forgeries into spatial, appearance, and motion anomalies, outperforming state-of-the-art methods.

Details

Motivation: Addressing the lack of fine-grained understanding of forgery types in human-centric videos, which is critical for reliability and interpretability in real-world applications.

Method: HumanSAM fuses video understanding and spatial depth to capture geometry, semantics, and spatiotemporal consistency, using a rank-based confidence enhancement strategy with prior scores.

Result: HumanSAM achieves promising results in binary and multi-class forgery classification, validated on the Human-centric Forgery Video (HFV) dataset.

Conclusion: HumanSAM advances forgery detection by providing fine-grained classification and robust representation, supported by a new benchmark dataset.

Abstract: Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly.To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.

[202] MambaVesselNet++: A Hybrid CNN-Mamba Architecture for Medical Image Segmentation

Qing Xu, Yanming Chen, Yue Li, Ziyu Liu, Zhenye Lou, Yixuan Zhang, Xiangjian He

Main category: cs.CV

TL;DR: MambaVesselNet++ is a hybrid CNN-Mamba framework for medical image segmentation, combining local feature capture with efficient long-range dependency modeling, outperforming existing methods.

Details

Motivation: Address the computational inefficiency of vision transformers in medical image segmentation while maintaining global context modeling.

Method: Uses a hybrid image encoder (Hi-Encoder) with texture-aware layers for low-level features and Mamba for long-range dependencies, coupled with a bifocal fusion decoder (BF-Decoder) for mask generation.

Result: Outperforms convolution-based, transformer-based, and Mamba-based methods in 2D, 3D, and instance segmentation tasks.

Conclusion: MambaVesselNet++ efficiently combines CNN and Mamba strengths for superior medical image segmentation.

Abstract: Medical image segmentation plays an important role in computer-aided diagnosis. Traditional convolution-based U-shape segmentation architectures are usually limited by the local receptive field. Existing vision transformers have been widely applied to diverse medical segmentation frameworks due to their superior capabilities of capturing global contexts. Despite the advantage, the real-world application of vision transformers is challenged by their non-linear self-attention mechanism, requiring huge computational costs. To address this issue, the selective state space model (SSM) Mamba has gained recognition for its adeptness in modeling long-range dependencies in sequential data, particularly noted for its efficient memory costs. In this paper, we propose MambaVesselNet++, a Hybrid CNN-Mamba framework for medical image segmentation. Our MambaVesselNet++ is comprised of a hybrid image encoder (Hi-Encoder) and a bifocal fusion decoder (BF-Decoder). In Hi-Encoder, we first devise the texture-aware layer to capture low-level semantic features by leveraging convolutions. Then, we utilize Mamba to effectively model long-range dependencies with linear complexity. The Bi-Decoder adopts skip connections to combine local and global information of the Hi-Encoder for the accurate generation of segmentation masks. Extensive experiments demonstrate that MambaVesselNet++ outperforms current convolution-based, transformer-based, and Mamba-based state-of-the-arts across diverse medical 2D, 3D, and instance segmentation tasks. The code is available at https://github.com/CC0117/MambaVesselNet.

[203] LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs

Jiaze Wang, Rui Chen, Haowang Cui

Main category: cs.CV

TL;DR: LLM_Control improves spatial control in text-to-image diffusion models by using a multimodal LLM to enhance grounding and attention modulation.

Details

Motivation: Existing methods struggle with complex spatial compositions and multiple objects in textual prompts.

Method: LLM_Control employs a multimodal LLM to arrange layouts, augment semantics, and inject control signals into the denoising network.

Result: Achieves competitive synthesis quality across various T2I models, handling challenging inputs better than existing methods.

Conclusion: LLM_Control effectively addresses spatial control challenges in T2I generation.

Abstract: Recent spatial control methods for text-to-image (T2I) diffusion models have shown compelling results. However, these methods still fail to precisely follow the control conditions and generate the corresponding images, especially when encountering the textual prompts that contain multiple objects or have complex spatial compositions. In this work, we present a LLM-guided framework called LLM_Control to address the challenges of the controllable T2I generation task. By improving grounding capabilities, LLM_Control is introduced to accurately modulate the pre-trained diffusion models, where visual conditions and textual prompts influence the structures and appearance generation in a complementary way. We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes. The obtained control signals are injected into the denoising network to refocus and enhance attention maps according to novel sampling constraints. Extensive qualitative and quantitative experiments have demonstrated that LLM_Control achieves competitive synthesis quality compared to other state-of-the-art methods across various pre-trained T2I models. It is noteworthy that LLM_Control allows the challenging input conditions on which most of the existing methods

[204] SCALAR: Scale-wise Controllable Visual Autoregressive Learning

Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu

Main category: cs.CV

TL;DR: SCALAR introduces a scale-wise conditional decoding method for controllable image synthesis in VAR models, addressing inefficiencies in control encoding and injection.

Details

Motivation: Controllable image synthesis in VAR models is challenging due to hierarchical prediction and inefficient control mechanisms.

Method: SCALAR uses a novel scale-wise conditional decoding mechanism to improve control encoding and injection.

Result: The method enhances fidelity and efficiency in controllable generation for VAR models.

Conclusion: SCALAR provides a promising solution for fine-grained control in visual autoregressive models.

Abstract: Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a

[205] UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block

Luoxi Jing, Dianxi Shi, Zhe Liu, Songchang Jin, Chunping Qiu, Ziteng Qiao, Yuxian Li, Jianqiang Xia

Main category: cs.CV

TL;DR: UniCT Depth combines CNNs and Transformers for event-image fusion, outperforming existing methods in monocular depth estimation.

Details

Motivation: Challenges in depth estimation with image-based methods (struggle in tough scenarios) and event cameras (sparse data issues) motivate a fusion approach.

Method: Proposes UniCT Depth with CcViT-DA Block (CMSA for spatial dependencies, MFSA for cross-modal fusion) and DCC Block for detail enhancement.

Result: UniCT Depth surpasses existing image, event, and fusion-based methods in key metrics.

Conclusion: The unified CNN-Transformer approach effectively addresses fusion challenges, improving depth estimation accuracy.

Abstract: Depth estimation plays a crucial role in 3D scene understanding and is extensively used in a wide range of vision tasks. Image-based methods struggle in challenging scenarios, while event cameras offer high dynamic range and temporal resolution but face difficulties with sparse data. Combining event and image data provides significant advantages, yet effective integration remains challenging. Existing CNN-based fusion methods struggle with occlusions and depth disparities due to limited receptive fields, while Transformer-based fusion methods often lack deep modality interaction. To address these issues, we propose UniCT Depth, an event-image fusion method that unifies CNNs and Transformers to model local and global features. We propose the Convolution-compensated ViT Dual SA (CcViT-DA) Block, designed for the encoder, which integrates Context Modeling Self-Attention (CMSA) to capture spatial dependencies and Modal Fusion Self-Attention (MFSA) for effective cross-modal fusion. Furthermore, we design the tailored Detail Compensation Convolution (DCC) Block to improve texture details and enhances edge representations. Experiments show that UniCT Depth outperforms existing image, event, and fusion-based monocular depth estimation methods across key metrics.

[206] AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation

Qingqing Fang, Wenxi Lv, Qinliang Su

Main category: cs.CV

TL;DR: AF-CLIP enhances CLIP for zero-/few-shot anomaly detection by optimizing visual features for local anomalies and introducing multi-scale aggregation and learnable prompts.

Details

Motivation: Existing methods require large training samples and ignore local anomaly optimization in CLIP-based approaches.

Method: Introduces a lightweight adapter for anomaly-focused visual features, multi-scale spatial aggregation, and learnable textual prompts.

Result: Demonstrates strong zero-shot detection capability and extends to few-shot scenarios with memory banks.

Conclusion: AF-CLIP is effective and generalizable across industrial and medical datasets.

Abstract: Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP’s zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.

[207] RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning

Chengyu Zheng, Jin Huang, Honghua Chen, Mingqiang Wei

Main category: cs.CV

TL;DR: A zero-shot method refines point cloud registration using diffusion features from depth images, improving accuracy without training data.

Details

Motivation: Leverage diffusion models' semantic capabilities to enhance point cloud registration without needing dedicated training datasets.

Method: Project point clouds into depth maps, extract diffusion features, and integrate them with geometric features for better correspondences.

Result: Improved registration accuracy and robust generalization across diverse datasets.

Conclusion: The method effectively enhances point cloud registration by combining diffusion and geometric features, demonstrating strong performance.

Abstract: Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets. Codes are available at https://github.com/zhengcy-lambo/RARE.git.

[208] Predicting Brain Responses To Natural Movies With Multimodal LLMs

Cesar Kadir Torrico Villanueva, Jiaxin Cindy Tu, Mihir Tripathy, Connor Lane, Rishab Iyer, Paul S. Scotti

Main category: cs.CV

TL;DR: MedARC’s solution for Algonauts 2025 used multimodal pretrained models, linear projection, temporal alignment, and lightweight encoders to map features to fMRI data, achieving 4th place with a mean Pearson’s correlation of 0.2085.

Details

Motivation: To improve generalization of encoding models for novel movie stimuli by combining multimodal features and optimizing model selection.

Method: Leveraged pretrained models (V-JEPA2, Whisper, Llama 3.2, InternVL3, Qwen2.5-Omni), projected features linearly, aligned temporally, and mapped to cortical parcels using shared and subject-specific heads.

Result: Achieved mean Pearson’s correlation of 0.2085, ranking 4th; a last-minute optimization could have secured 2nd place.

Conclusion: Combining multimodal features with simple shared-subject architectures and thorough model selection enhances generalization in encoding models.

Abstract: We present MedARC’s team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson’s correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.

[209] Pic2Diagnosis: A Method for Diagnosis of Cardiovascular Diseases from the Printed ECG Pictures

Oğuzhan Büyüksolak, İlkay Öksüz

Main category: cs.CV

TL;DR: A two-step curriculum learning framework improves CVD diagnosis from ECG images, achieving high accuracy and robustness without digitization.

Details

Motivation: Traditional ECG diagnosis relies on outdated datasets and stepwise algorithms with limited accuracy, necessitating a more reliable automated method.

Method: Uses a two-step curriculum learning framework: pre-training on segmentation masks and fine-tuning on grayscale, inverted ECG images, enhanced by an ensemble of models.

Result: Achieves an AUC of 0.9534 and F1 score of 0.7801 on the BHF ECG Challenge dataset, outperforming individual models.

Conclusion: The method simplifies CVD diagnosis, handles real-world artifacts, and is especially useful in resource-limited settings for rapid, accurate diagnosis.

Abstract: The electrocardiogram (ECG) is a vital tool for diagnosing heart diseases. However, many disease patterns are derived from outdated datasets and traditional stepwise algorithms with limited accuracy. This study presents a method for direct cardiovascular disease (CVD) diagnosis from ECG images, eliminating the need for digitization. The proposed approach utilizes a two-step curriculum learning framework, beginning with the pre-training of a classification model on segmentation masks, followed by fine-tuning on grayscale, inverted ECG images. Robustness is further enhanced through an ensemble of three models with averaged outputs, achieving an AUC of 0.9534 and an F1 score of 0.7801 on the BHF ECG Challenge dataset, outperforming individual models. By effectively handling real-world artifacts and simplifying the diagnostic process, this method offers a reliable solution for automated CVD diagnosis, particularly in resource-limited settings where printed or scanned ECG images are commonly used. Such an automated procedure enables rapid and accurate diagnosis, which is critical for timely intervention in CVD cases that often demand urgent care.

[210] Transfer or Self-Supervised? Bridging the Performance Gap in Medical Imaging

Zehui Zhao, Laith Alzubaidi, Jinglan Zhang, Ye Duan, Usman Naseem, Yuantong Gu

Main category: cs.CV

TL;DR: The paper compares transfer learning and self-supervised learning in medical applications, evaluating their performance, robustness, and suitability for challenges like data imbalance and scarcity.

Details

Motivation: To address limited data availability and improve model generalization in medical research using transfer learning and self-supervised learning.

Method: Pre-trained two models with different methods on the same source domain datasets and evaluated them on small medical datasets, testing issues like data imbalance and domain mismatch.

Result: Identified factors influencing performance and robustness, providing insights into how each method handles medical data challenges.

Conclusion: Offers recommendations for applying transfer learning and self-supervised learning in medical fields to enhance efficiency and deployment strategies.

Abstract: Recently, transfer learning and self-supervised learning have gained significant attention within the medical field due to their ability to mitigate the challenges posed by limited data availability, improve model generalisation, and reduce computational expenses. Transfer learning and self-supervised learning hold immense potential for advancing medical research. However, it is crucial to recognise that transfer learning and self-supervised learning architectures exhibit distinct advantages and limitations, manifesting variations in accuracy, training speed, and robustness. This paper compares the performance and robustness of transfer learning and self-supervised learning in the medical field. Specifically, we pre-trained two models using the same source domain datasets with different pre-training methods and evaluated them on small-sized medical datasets to identify the factors influencing their final performance. We tested data with several common issues in medical domains, such as data imbalance, data scarcity, and domain mismatch, through comparison experiments to understand their impact on specific pre-trained models. Finally, we provide recommendations to help users apply transfer learning and self-supervised learning methods in medical areas, and build more convenient and efficient deployment strategies.

[211] FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images

Hao-Yu Hou, Chun-Yi Lee, Motoharu Sonogashira, Yasutomo Kawanishi

Main category: cs.CV

TL;DR: FROSS is a faster-than-real-time method for generating 3D semantic scene graphs (SSGs) by lifting 2D scene graphs to 3D and using Gaussian distributions, outperforming existing methods in speed and performance.

Details

Motivation: Existing 3D SSG generation methods are computationally intensive and non-incremental, limiting real-time open-world applications.

Method: FROSS lifts 2D scene graphs to 3D space, representing objects as 3D Gaussian distributions, avoiding intensive point cloud processing.

Result: FROSS achieves superior performance and faster operation than prior methods, validated on ReplicaSSG and 3DSSG datasets.

Conclusion: FROSS is an efficient solution for real-time 3D SSG generation, with publicly available implementation and dataset.

Abstract: The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-than-Real-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationally-intensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while operating significantly faster than prior 3D SSG generation methods. Our implementation and dataset are publicly available at https://github.com/Howardkhh/FROSS.

[212] VAMPIRE: Uncovering Vessel Directional and Morphological Information from OCTA Images for Cardiovascular Disease Risk Factor Prediction

Lehan Wang, Hualiang Wang, Chubin Ou, Lushi Chen, Yunyi Liang, Xiaomeng Li

Main category: cs.CV

TL;DR: A novel multi-purpose CVD risk assessment method using OCTA images, combining risk and condition prediction, outperforms existing approaches.

Details

Motivation: Current CVD risk methods lack detailed vascular analysis and clinical utility, prompting the need for a more comprehensive approach.

Method: Introduces OCTA-CVD dataset and VAMPIRE model with Mamba-Based Directional and Information-Enhanced Morphological modules for detailed vascular feature extraction.

Result: Outperforms standard classification backbones, OCTA-based methods, and ophthalmologic models.

Conclusion: Proposed method enhances CVD risk assessment accuracy and clinical relevance, with open-source dataset and code.

Abstract: Cardiovascular disease (CVD) remains the leading cause of death worldwide, requiring urgent development of effective risk assessment methods for timely intervention. While current research has introduced non-invasive and efficient approaches to predict CVD risk from retinal imaging with deep learning models, the commonly used fundus photographs and Optical Coherence Tomography (OCT) fail to capture detailed vascular features critical for CVD assessment compared with OCT angiography (OCTA) images. Moreover, existing methods typically classify CVD risk only as high or low, without providing a deeper analysis on CVD-related blood factor conditions, thus limiting prediction accuracy and clinical utility. As a result, we propose a novel multi-purpose paradigm of CVD risk assessment that jointly performs CVD risk and CVD-related condition prediction, aligning with clinical experiences. Based on this core idea, we introduce OCTA-CVD, the first OCTA dataset for CVD risk assessment, and a Vessel-Aware Mamba-based Prediction model with Informative Enhancement (VAMPIRE) based on OCTA enface images. Our proposed model aims to extract crucial vascular characteristics through two key components: (1) a Mamba-Based Directional (MBD) Module that captures fine-grained vascular trajectory features and (2) an Information-Enhanced Morphological (IEM) Module that incorporates comprehensive vessel morphology knowledge. Experimental results demonstrate that our method can surpass standard classification backbones, OCTA-based detection methods, and ophthalmologic foundation models. Our codes and the collected OCTA-CVD dataset are available at https://github.com/xmed-lab/VAMPIRE.

[213] Region-based Cluster Discrimination for Visual Representation Learning

Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng

Main category: cs.CV

TL;DR: RICE enhances region-level visual and OCR capabilities by introducing a novel method with a billion-scale dataset and unified loss, outperforming prior methods in dense tasks.

Details

Motivation: Global representations in vision-language models like CLIP and SigLIP limit effectiveness for dense prediction tasks, prompting the need for region-level enhancements.

Method: Constructs a billion-scale region dataset, uses a Region Transformer for semantics, and employs a unified region cluster discrimination loss for joint object and OCR learning.

Result: RICE consistently outperforms previous methods in segmentation, dense detection, and visual perception for MLLMs.

Conclusion: RICE addresses the limitations of global representations, offering improved performance for dense tasks and scalable training.

Abstract: Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.

[214] TAPS : Frustratingly Simple Test Time Active Learning for VLMs

Dhruv Sarkar, Aprameyo Chakrabartty, Bibhudatta Bhanja

Main category: cs.CV

TL;DR: A novel Test-Time Active Learning (TTAL) framework is proposed for real-time adaptation in streaming data, using dynamic prompts and active querying to improve performance under latency and memory constraints.

Details

Motivation: To address the challenge of adapting models in real-time streaming scenarios where only one sample is available at a time, requiring immediate decisions while respecting constraints.

Method: Introduces a TTAL framework with dynamically adjusted entropy thresholds for querying, class-balanced memory replacement, and class-aware distribution alignment.

Result: Demonstrates consistent improvements over state-of-the-art methods across 10 cross-dataset benchmarks and 4 domain generalization datasets, with reasonable overhead.

Conclusion: The framework offers a practical solution for real-world deployment in safety-critical applications like autonomous systems and medical diagnostics.

Abstract: Test-Time Optimization enables models to adapt to new data during inference by updating parameters on-the-fly. Recent advances in Vision-Language Models (VLMs) have explored learning prompts at test time to improve performance in downstream tasks. In this work, we extend this idea by addressing a more general and practical challenge: Can we effectively utilize an oracle in a continuous data stream where only one sample is available at a time, requiring an immediate query decision while respecting latency and memory constraints? To tackle this, we propose a novel Test-Time Active Learning (TTAL) framework that adaptively queries uncertain samples and updates prompts dynamically. Unlike prior methods that assume batched data or multiple gradient updates, our approach operates in a real-time streaming scenario with a single test sample per step. We introduce a dynamically adjusted entropy threshold for active querying, a class-balanced replacement strategy for memory efficiency, and a class-aware distribution alignment technique to enhance adaptation. The design choices are justified using careful theoretical analysis. Extensive experiments across 10 cross-dataset transfer benchmarks and 4 domain generalization datasets demonstrate consistent improvements over state-of-the-art methods while maintaining reasonable latency and memory overhead. Our framework provides a practical and effective solution for real-world deployment in safety-critical applications such as autonomous systems and medical diagnostics.

[215] FaRMamba: Frequency-based learning and Reconstruction aided Mamba for Medical Segmentation

Ze Rong, ZiYue Zhao, Zhaoxin Wang, Lei Ma

Main category: cs.CV

TL;DR: FaRMamba improves medical image segmentation by addressing high-frequency detail loss and spatial degradation with multi-scale frequency transforms and self-supervised reconstruction.

Details

Motivation: Challenges like blurred boundaries, high-frequency detail loss, and long-range structure modeling hinder accurate medical image segmentation.

Method: FaRMamba uses a Multi-Scale Frequency Transform Module (MSFM) and a Self-Supervised Reconstruction Auxiliary Encoder (SSRAE) to restore high-frequency details and spatial correlations.

Result: FaRMamba outperforms CNN-Transformer hybrids and Mamba variants in boundary accuracy, detail preservation, and global coherence.

Conclusion: FaRMamba offers a flexible, frequency-aware framework for future medical image segmentation models.

Abstract: Accurate medical image segmentation remains challenging due to blurred lesion boundaries (LBA), loss of high-frequency details (LHD), and difficulty in modeling long-range anatomical structures (DC-LRSS). Vision Mamba employs one-dimensional causal state-space recurrence to efficiently model global dependencies, thereby substantially mitigating DC-LRSS. However, its patch tokenization and 1D serialization disrupt local pixel adjacency and impose a low-pass filtering effect, resulting in Local High-frequency Information Capture Deficiency (LHICD) and two-dimensional Spatial Structure Degradation (2D-SSD), which in turn exacerbate LBA and LHD. In this work, we propose FaRMamba, a novel extension that explicitly addresses LHICD and 2D-SSD through two complementary modules. A Multi-Scale Frequency Transform Module (MSFM) restores attenuated high-frequency cues by isolating and reconstructing multi-band spectra via wavelet, cosine, and Fourier transforms. A Self-Supervised Reconstruction Auxiliary Encoder (SSRAE) enforces pixel-level reconstruction on the shared Mamba encoder to recover full 2D spatial correlations, enhancing both fine textures and global context. Extensive evaluations on CAMUS echocardiography, MRI-based Mouse-cochlea, and Kvasir-Seg endoscopy demonstrate that FaRMamba consistently outperforms competitive CNN-Transformer hybrids and existing Mamba variants, delivering superior boundary accuracy, detail preservation, and global coherence without prohibitive computational overhead. This work provides a flexible frequency-aware framework for future segmentation models that directly mitigates core challenges in medical imaging.

[216] The Devil is in the EOS: Sequence Training for Detailed Image Captioning

Abdelrahman Mohamed, Yova Kementchedjhieva

Main category: cs.CV

TL;DR: The paper addresses the issue of generic, short captions in vision-language models (VLMs) by identifying and mitigating a bias towards the end-of-sequence (EOS) token during training. Their unsupervised method improves caption detail without complex rewards or supervision.

Details

Motivation: VLMs often produce short, generic captions despite strong vision-language capabilities. The authors identify a bias towards the EOS token as the root cause.

Method: Proposes an unsupervised method to debias the model’s premature prediction of the EOS token, encouraging longer, more detailed captions.

Result: Experiments on three VLMs and benchmarks show increased caption length and detail, though with more hallucinations.

Conclusion: The simple, effective method enhances caption detail without needing supervision or complex rewards, applicable to any pretrained model.

Abstract: Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model’s tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.

[217] KB-DMGen: Knowledge-Based Global Guidance and Dynamic Pose Masking for Human Image Generation

Shibang Liu, Xuemei Xie, Guangming Shi

Main category: cs.CV

TL;DR: KB-DMGen improves human image generation by combining Knowledge-Based Global Guidance and Dynamic Pose Masking to ensure both pose accuracy and overall image quality.

Details

Motivation: Existing methods prioritize pose accuracy but neglect global image quality, which is crucial for realistic portrait generation.

Method: Proposes KB-DMGen, using a Knowledge Base (KB) for pose accuracy and image quality, and Dynamic Masking (DM) to adjust pose region importance.

Result: Achieves state-of-the-art results in AP and CAP on the HumanArt dataset.

Conclusion: KB-DMGen effectively balances pose accuracy and image quality, setting a new benchmark in human image generation.

Abstract: Recent methods using diffusion models have made significant progress in human image generation with various control signals such as pose priors. In portrait generation, both the accuracy of human pose and the overall visual quality are crucial for realistic synthesis. Most existing methods focus on controlling the accuracy of generated poses, but ignore the quality assurance of the entire image. In order to ensure the global image quality and pose accuracy, we propose Knowledge-Based Global Guidance and Dynamic pose Masking for human image Generation (KB-DMGen). The Knowledge Base (KB) is designed not only to enhance pose accuracy but also to leverage image feature information to maintain overall image quality. Dynamic Masking (DM) dynamically adjusts the importance of pose-related regions. Experiments demonstrate the effectiveness of our model, achieving new state-of-the-art results in terms of AP and CAP on the HumanArt dataset. The code will be made publicly available.

[218] Hybrid-Domain Synergistic Transformer for Hyperspectral Image Denoising

Haoyue Li, Di Wu

Main category: cs.CV

TL;DR: The paper proposes HDST, a hybrid-domain transformer network for hyperspectral image denoising, integrating frequency domain enhancement and multiscale modeling to handle spatial-spectral noise coupling.

Details

Motivation: Existing deep learning methods struggle with the unique spatial-spectral characteristics and complex noise distributions of hyperspectral images (HSI).

Method: HDST combines FFT preprocessing, dynamic cross-domain attention, and hierarchical architecture for 3D collaborative processing of spatial, frequency, and channel domains.

Result: HDST outperforms existing methods on real and synthetic datasets, maintaining computational efficiency.

Conclusion: The research offers a universal framework for HSI denoising and insights for high-dimensional visual data noise issues.

Abstract: Hyperspectral image denoising faces the challenge of multi-dimensional coupling of spatially non-uniform noise and spectral correlation interference. Existing deep learning methods mostly focus on RGB images and struggle to effectively handle the unique spatial-spectral characteristics and complex noise distributions of hyperspectral images (HSI). This paper proposes an HSI denoising framework, Hybrid-Domain Synergistic Transformer Network (HDST), based on frequency domain enhancement and multiscale modeling, achieving three-dimensional collaborative processing of spatial, frequency and channel domains. The method innovatively integrates three key mechanisms: (1) introducing an FFT preprocessing module with multi-band convolution to extract cross-band correlations and decouple spectral noise components; (2) designing a dynamic cross-domain attention module that adaptively fuses spatial domain texture features and frequency domain noise priors through a learnable gating mechanism; (3) building a hierarchical architecture where shallow layers capture global noise statistics using multiscale atrous convolution, and deep layers achieve detail recovery through frequency domain postprocessing. Experiments on both real and synthetic datasets demonstrate that HDST significantly improves denoising performance while maintaining computational efficiency, validating the effectiveness of the proposed method. This research provides new insights and a universal framework for addressing complex noise coupling issues in HSI and other high-dimensional visual data. The code is available at https://github.com/lhy-cn/HDST-HSIDenoise.

[219] Detection of Medial Epicondyle Avulsion in Elbow Ultrasound Images via Bone Structure Reconstruction

Shizuka Akahori, Shotaro Teruya, Pragyan Shrestha, Yuichi Yoshii, Satoshi Iizuka, Akira Ikumi, Hiromitsu Tsuge, Itaru Kitahara

Main category: cs.CV

TL;DR: A reconstruction-based framework using masked autoencoders detects medial epicondyle avulsion in elbow ultrasound images by learning normal bone structures, achieving high accuracy.

Details

Motivation: Medial epicondyle avulsion, common in baseball players, involves bone detachment. Detecting it requires understanding normal bone continuity, as abnormalities appear as discontinuities.

Method: A masked autoencoder-based framework learns normal bone structure continuity. It reconstructs normal structures, highlighting avulsion sites through large reconstruction errors.

Result: The method achieved pixel-wise AUC of 0.965 and image-wise AUC of 0.967, outperforming existing approaches.

Conclusion: The proposed framework effectively detects avulsion by leveraging normal bone structure learning, with a publicly available dataset for further research.

Abstract: This study proposes a reconstruction-based framework for detecting medial epicondyle avulsion in elbow ultrasound images, trained exclusively on normal cases. Medial epicondyle avulsion, commonly observed in baseball players, involves bone detachment and deformity, often appearing as discontinuities in bone contour. Therefore, learning the structure and continuity of normal bone is essential for detecting such abnormalities. To achieve this, we propose a masked autoencoder-based, structure-aware reconstruction framework that learns the continuity of normal bone structures. Even in the presence of avulsion, the model attempts to reconstruct the normal structure, resulting in large reconstruction errors at the avulsion site. For evaluation, we constructed a novel dataset comprising normal and avulsion ultrasound images from 16 baseball players, with pixel-level annotations under orthopedic supervision. Our method outperformed existing approaches, achieving a pixel-wise AUC of 0.965 and an image-wise AUC of 0.967. The dataset is publicly available at: https://github.com/Akahori000/Ultrasound-Medial-Epicondyle-Avulsion-Dataset.

[220] NeuroVoxel-LM: Language-Aligned 3D Perception via Dynamic Voxelization and Meta-Embedding

Shiyu Liu, Lianlei Shan

Main category: cs.CV

TL;DR: NeuroVoxel-LM improves 3D scene perception by combining NeRF with dynamic voxelization and lightweight meta-embedding, addressing inefficiencies in existing models.

Details

Motivation: Existing 3D language models struggle with sparse, large-scale point clouds due to slow feature extraction and limited accuracy.

Method: Proposes NeuroVoxel-LM with Dynamic Resolution Multiscale Voxelization (DR-MSV) and Token-level Adaptive Pooling for Lightweight Meta-Embedding (TAP-LME).

Result: DR-MSV enhances efficiency and accuracy; TAP-LME outperforms max-pooling in semantic representation.

Conclusion: NeuroVoxel-LM effectively addresses challenges in 3D language models, improving performance and fidelity.

Abstract: Recent breakthroughs in Visual Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have significantly advanced 3D scene perception towards language-driven cognition. However, existing 3D language models struggle with sparse, large-scale point clouds due to slow feature extraction and limited representation accuracy. To address these challenges, we propose NeuroVoxel-LM, a novel framework that integrates Neural Radiance Fields (NeRF) with dynamic resolution voxelization and lightweight meta-embedding. Specifically, we introduce a Dynamic Resolution Multiscale Voxelization (DR-MSV) technique that adaptively adjusts voxel granularity based on geometric and structural complexity, reducing computational cost while preserving reconstruction fidelity. In addition, we propose the Token-level Adaptive Pooling for Lightweight Meta-Embedding (TAP-LME) mechanism, which enhances semantic representation through attention-based weighting and residual fusion. Experimental results demonstrate that DR-MSV significantly improves point cloud feature extraction efficiency and accuracy, while TAP-LME outperforms conventional max-pooling in capturing fine-grained semantics from NeRF weights.

[221] RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters

Xiaolin Liu, Tianyi Zhou, Hongbo Kang, Jian Ma, Ziwen Wang, Jing Huang, Wenguo Weng, Yu-Kun Lai, Kun Li

Main category: cs.CV

TL;DR: A real-time 3D crowd evacuation simulation framework is proposed, integrating a 3D-adaptive SFM Decision Mechanism and Personalized Gait Control Motor to simulate complex human behaviors and terrain effects, enhancing realism and dynamic awareness.

Details

Motivation: Current evacuation models fail to simulate real-world human behaviors like collisions, interactions, and terrain/body shape influences, limiting accuracy.

Method: Proposes a framework aligned with the SDM flow, combining 3D-adaptive SFM for decision-making and Personalized Gait Control for movement, with Part-level Force Visualization for analysis.

Result: The framework supports dynamic trajectory planning, personalized agent behavior, and uneven terrain compatibility, producing realistic evacuation visuals.

Conclusion: The method improves realism in crowd evacuation simulations, offering better insights and compatibility with diverse scenarios.

Abstract: Crowd evacuation simulation is critical for enhancing public safety, and demanded for realistic virtual environments. Current mainstream evacuation models overlook the complex human behaviors that occur during evacuation, such as pedestrian collisions, interpersonal interactions, and variations in behavior influenced by terrain types or individual body shapes. This results in the failure to accurately simulate the escape of people in the real world. In this paper, aligned with the sensory-decision-motor (SDM) flow of the human brain, we propose a real-time 3D crowd evacuation simulation framework that integrates a 3D-adaptive SFM (Social Force Model) Decision Mechanism and a Personalized Gait Control Motor. This framework allows multiple agents to move in parallel and is suitable for various scenarios, with dynamic crowd awareness. Additionally, we introduce Part-level Force Visualization to assist in evacuation analysis. Experimental results demonstrate that our framework supports dynamic trajectory planning and personalized behavior for each agent throughout the evacuation process, and is compatible with uneven terrain. Visually, our method generates evacuation results that are more realistic and plausible, providing enhanced insights for crowd simulation. The code is available at http://cic.tju.edu.cn/faculty/likun/projects/RESCUE.

[222] Local2Global query Alignment for Video Instance Segmentation

Rajat Koner, Zhipeng Wang, Srinivas Parthasarathy, Chinghang Chen

Main category: cs.CV

TL;DR: Local2Global (L2G) is an online video instance segmentation framework that uses local and global queries with a novel L2G-aligner for temporal consistency, achieving state-of-the-art performance.

Details

Motivation: Addressing challenges like noise accumulation, drift, occlusions, and scene transitions in online video segmentation to improve temporal consistency.

Method: Uses DETR-based query propagation with local and global queries, and a lightweight L2G-aligner transformer decoder for alignment.

Result: Achieves 54.3 AP on Youtube-VIS-19, 49.4 AP on Youtube-VIS-21, and 37.0 AP on OVIS with ResNet-50.

Conclusion: L2G provides a simple yet effective online solution for video instance segmentation, outperforming benchmarks without complex mechanisms.

Abstract: Online video segmentation methods excel at handling long sequences and capturing gradual changes, making them ideal for real-world applications. However, achieving temporally consistent predictions remains a challenge, especially with gradual accumulation of noise or drift in on-line propagation, abrupt occlusions and scene transitions. This paper introduces Local2Global, an online framework, for video instance segmentation, exhibiting state-of-the-art performance with simple baseline and training purely in online fashion. Leveraging the DETR-based query propagation framework, we introduce two novel sets of queries:(1) local queries that capture initial object-specific spatial features from each frame and (2) global queries containing past spatio-temporal representations. We propose the L2G-aligner, a novel lightweight transformer decoder, to facilitate an early alignment between local and global queries. This alignment allows our model to effectively utilize current frame information while maintaining temporal consistency, producing a smooth transition between frames. Furthermore, L2G-aligner is integrated within the segmentation model, without relying on additional complex heuristics, or memory mechanisms. Extensive experiments across various challenging VIS and VPS datasets showcase the superiority of our method with simple online training, surpassing current benchmarks without bells and rings. For instance, we achieve 54.3 and 49.4 AP on Youtube-VIS-19/-21 datasets and 37.0 AP on OVIS dataset respectively withthe ResNet-50 backbone.

[223] Multi-output Deep-Supervised Classifier Chains for Plant Pathology

Jianping Yao, Son N. Tran

Main category: cs.CV

TL;DR: The paper proposes Mo-DsCC, a model for plant leaf disease classification, integrating plant species and disease type predictions for improved accuracy.

Details

Motivation: Existing methods overlook the relationship between plant species and disease types, limiting performance.

Method: Mo-DsCC uses a modified VGG-16 backbone, deep supervision, and classification chains to link plant species and disease predictions.

Result: Mo-DsCC outperforms other methods in accuracy and F1-score on Plant Village and PlantDoc datasets.

Conclusion: Mo-DsCC is a promising tool for smart agriculture, offering practical benefits and novel insights.

Abstract: Plant leaf disease classification is an important task in smart agriculture which plays a critical role in sustainable production. Modern machine learning approaches have shown unprecedented potential in this classification task which offers an array of benefits including time saving and cost reduction. However, most recent approaches directly employ convolutional neural networks where the effect of the relationship between plant species and disease types on prediction performance is not properly studied. In this study, we proposed a new model named Multi-output Deep Supervised Classifier Chains (Mo-DsCC) which weaves the prediction of plant species and disease by chaining the output layers for the two labels. Mo-DsCC consists of three components: A modified VGG-16 network as the backbone, deep supervision training, and a stack of classification chains. To evaluate the advantages of our model, we perform intensive experiments on two benchmark datasets Plant Village and PlantDoc. Comparison to recent approaches, including multi-model, multi-label (Power-set), multi-output and multi-task, demonstrates that Mo-DsCC achieves better accuracy and F1-score. The empirical study in this paper shows that the application of Mo-DsCC could be a useful puzzle for smart agriculture to benefit farms and bring new ideas to industry and academia.

[224] An Automated Deep Segmentation and Spatial-Statistics Approach for Post-Blast Rock Fragmentation Assessment

Yukun Yang

Main category: cs.CV

TL;DR: An end-to-end pipeline using a fine-tuned YOLO12l-seg model for real-time instance segmentation and 3D spatial analysis of post-blast images, achieving high accuracy and robustness.

Details

Motivation: To automate and improve the accuracy of blast-effect assessment in field conditions, especially for small-object crowding.

Method: Fine-tuned YOLO12l-seg model trained on 500+ annotated images, converting high-fidelity masks into 3D coordinates and extracting multi-metric spatial descriptors.

Result: Achieved Box mAP@0.5 ~ 0.769, Mask mAP@0.5 ~ 0.800 at ~15 FPS, with demonstrated accuracy and robustness.

Conclusion: The framework is feasible for rapid, automated blast-effect assessment, showcasing key fragmentation patterns effectively.

Abstract: We introduce an end-to-end pipeline that leverages a fine-tuned YOLO12l-seg model – trained on over 500 annotated post-blast images – to deliver real-time instance segmentation (Box mAP@0.5 ~ 0.769, Mask mAP@0.5 ~ 0.800 at ~ 15 FPS). High-fidelity masks are converted into normalized 3D coordinates, from which we extract multi-metric spatial descriptors: principal component directions, kernel density hotspots, size-depth regression, and Delaunay edge statistics. We present four representative examples to illustrate key fragmentation patterns. Experimental results confirm the framework’s accuracy, robustness to small-object crowding, and feasibility for rapid, automated blast-effect assessment in field conditions.

[225] Wavelet-guided Misalignment-aware Network for Visible-Infrared Object Detection

Haote Zhang, Lipeng Gu, Wuzhou Quan, Fu Lee Wang, Honghui Fan, Jiali Tang, Dingkun Zhu, Haoran Xie, Xiaoping Zhang, Mingqiang Wei

Main category: cs.CV

TL;DR: WMNet improves visible-infrared object detection by addressing misalignments through wavelet-guided analysis and modality-aware fusion.

Details

Motivation: Performance in visible-infrared object detection is limited by misalignments due to resolution disparities, spatial displacements, and modality inconsistencies.

Method: Proposes WMNet, using wavelet-based multi-frequency analysis and modality-aware fusion to align and integrate cross-modal features.

Result: WMNet achieves state-of-the-art performance on misaligned cross-modal object detection tasks across multiple datasets.

Conclusion: WMNet effectively addresses misalignment issues, enhancing detection robustness and accuracy.

Abstract: Visible-infrared object detection aims to enhance the detection robustness by exploiting the complementary information of visible and infrared image pairs. However, its performance is often limited by frequent misalignments caused by resolution disparities, spatial displacements, and modality inconsistencies. To address this issue, we propose the Wavelet-guided Misalignment-aware Network (WMNet), a unified framework designed to adaptively address different cross-modal misalignment patterns. WMNet incorporates wavelet-based multi-frequency analysis and modality-aware fusion mechanisms to improve the alignment and integration of cross-modal features. By jointly exploiting low and high-frequency information and introducing adaptive guidance across modalities, WMNet alleviates the adverse effects of noise, illumination variation, and spatial misalignment. Furthermore, it enhances the representation of salient target features while suppressing spurious or misleading information, thereby promoting more accurate and robust detection. Extensive evaluations on the DVTOD, DroneVehicle, and M3FD datasets demonstrate that WMNet achieves state-of-the-art performance on misaligned cross-modal object detection tasks, confirming its effectiveness and practical applicability.

[226] GT-Mean Loss: A Simple Yet Effective Solution for Brightness Mismatch in Low-Light Image Enhancement

Jingxi Liao, Shijie Hao, Richang Hong, Meng Wang

Main category: cs.CV

TL;DR: The paper addresses brightness mismatch in supervised low-light image enhancement (LLIE) and proposes the GT-mean loss to improve model performance.

Details

Motivation: Brightness mismatch between enhanced images and ground truth misleads model training, yet is overlooked in current LLIE research.

Method: Introduces the GT-mean loss, a probabilistic approach to align image mean values, extending existing loss functions with minimal computational cost.

Result: Experiments show consistent performance improvements across methods and datasets when using the GT-mean loss.

Conclusion: The GT-mean loss effectively mitigates brightness mismatch, enhancing supervised LLIE model performance.

Abstract: Low-light image enhancement (LLIE) aims to improve the visual quality of images captured under poor lighting conditions. In supervised LLIE research, there exists a significant yet often overlooked inconsistency between the overall brightness of an enhanced image and its ground truth counterpart, referred to as brightness mismatch in this study. Brightness mismatch negatively impact supervised LLIE models by misleading model training. However, this issue is largely neglected in current research. In this context, we propose the GT-mean loss, a simple yet effective loss function directly modeling the mean values of images from a probabilistic perspective. The GT-mean loss is flexible, as it extends existing supervised LLIE loss functions into the GT-mean form with minimal additional computational costs. Extensive experiments demonstrate that the incorporation of the GT-mean loss results in consistent performance improvements across various methods and datasets.

[227] Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality

Daulet Toibazar, Kesen Wang, Sherif Mohamed, Abdulaziz Al-Badawi, Abdulrahman Alfulayt, Pedro J. Moreno

Main category: cs.CV

TL;DR: A lightweight VLM-based framework filters noisy web data to improve training quality, matching or outperforming larger datasets.

Details

Motivation: Maintaining data quality in VLMs is challenging; curated data often outperforms larger, noisier datasets.

Method: Uses a compact VLM fine-tuned on high-quality data to filter training samples by quality and alignment.

Result: Filtered datasets perform as well as or better than larger, noisier datasets.

Conclusion: The method offers a lightweight, effective solution for high-quality vision-language training data.

Abstract: Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning and significantly broadens the practical applications of AI. However, including visual inputs also brings new challenges in maintaining data quality. Empirical evidence consistently shows that carefully curated and representative training examples often yield superior results compared to simply increasing the quantity of data. Inspired by this observation, we introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset. This model effectively evaluates and filters potential training samples based on caption and image quality and alignment. Unlike previous approaches, which typically add auxiliary filtration modules on top of existing full-scale VLMs, our method exclusively utilizes the inherent evaluative capability of a purpose-built small VLM. This strategy eliminates the need for extra modules and reduces training overhead. Our lightweight model efficiently filters out inaccurate, noisy web data, improving image-text alignment and caption linguistic fluency. Experimental results show that datasets underwent high-precision filtration using our compact VLM perform on par with, or even surpass, larger and noisier datasets gathered through high-volume web crawling. Thus, our method provides a lightweight yet robust solution for building high-quality vision-language training corpora. \ \textbf{Availability and implementation:} Our compact VLM filtration model, training data, utility scripts, and Supplementary data (Appendices) are freely available at https://github.com/daulettoibazar/Compact_VLM_Filter.

[228] AnimeColor: Reference-based Animation Colorization with Diffusion Transformers

Yuhong Zhang, Liyao Wang, Han Wang, Danni Wu, Zuzeng Lin, Feng Wang, Li Song

Main category: cs.CV

TL;DR: AnimeColor is a reference-based animation colorization framework using Diffusion Transformers (DiT) to improve color accuracy and temporal consistency. It includes High-level Color Extractor (HCE) and Low-level Color Guider (LCG) for semantic and fine-grained color guidance.

Details

Motivation: Existing methods for animation colorization lack color accuracy and temporal consistency, which are crucial for high-quality animation production.

Method: The framework integrates sketch sequences into a DiT-based video diffusion model, using HCE for semantic color and LCG for fine-grained details. A multi-stage training strategy optimizes reference image color usage.

Result: AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality.

Conclusion: AnimeColor advances animation colorization and offers a practical solution for industrial use, with code available for public access.

Abstract: Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose \textbf{AnimeColor}, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \href{https://github.com/IamCreateAI/AnimeColor}{https://github.com/IamCreateAI/AnimeColor}.

[229] Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning

Zeyu Xi, Haoying Sun, Yaofei Wu, Junchi Yan, Haoran Zhang, Lifang Wu, Liang Wang, Changwen Chen

Main category: cs.CV

TL;DR: The paper proposes a player-centric multimodal prompt generation network (LLM-IAVC) for identity-aware sports video captioning, addressing limitations of existing methods by focusing on player identities from a visual perspective.

Details

Motivation: Existing methods overlook player identities or rely on incorrect extra information, limiting their applicability.

Method: The model includes an identity-related information extraction module (IRIEM) with a player identification network (PIN) and bidirectional semantic interaction module (BSIM), plus a visual context learning module (VCLM), integrated as prompts for an LLM.

Result: The model achieves advanced performance on the NBA-Identity and VC-NBA-2022 benchmarks.

Conclusion: The proposed LLM-IAVC effectively generates identity-aware descriptions, supported by a new dataset (NBA-Identity).

Abstract: Existing sports video captioning methods often focus on the action yet overlook player identities, limiting their applicability. Although some methods integrate extra information to generate identity-aware descriptions, the player identities are sometimes incorrect because the extra information is independent of the video content. This paper proposes a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-IAVC), which focuses on recognizing player identities from a visual perspective. Specifically, an identity-related information extraction module (IRIEM) is designed to extract player-related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of the above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct a new benchmark called NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 major event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance. Code and dataset are publicly available at https://github.com/Zeyu1226-mt/LLM-IAVC.

[230] PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks

Clinton Ansun Mo, Kun Hu, Chengjiang Long, Dong Yuan, Wan-Chi Siu, Zhiyong Wang

Main category: cs.CV

TL;DR: PUMPS is an autoencoder for Temporal Point Clouds (TPCs) enabling motion synthesis tasks like prediction and interpolation, outperforming specialized methods.

Details

Motivation: Motion data transfer across skeletons is challenging due to structural differences. TPCs offer compatibility but lack direct learning capabilities.

Method: PUMPS reduces TPCs into feature vectors, uses latent noise for sampling, and employs linear assignment for reconstruction, avoiding costly attention mechanisms.

Result: PUMPS matches state-of-the-art in pre-training tasks and outperforms specialized methods in fine-tuning for denoising or estimation.

Conclusion: PUMPS provides a generalist yet effective solution for TPC-based motion synthesis, excelling in versatility and performance.

Abstract: Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive point-wise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture.

[231] LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

Fei Kong, Jinhao Duan, Kaidi Xu, Zhenhua Guo, Xiaofeng Zhu, Xiaoshuang Shi

Main category: cs.CV

TL;DR: The paper introduces a spatial evaluation pipeline and benchmark for Vision-Language Models (VLMs), highlighting their limitations in spatial understanding compared to humans.

Details

Motivation: Real-world applications like autonomous driving and robotics require precise spatial perception, but VLMs' spatial understanding is underexplored.

Method: A synthetic dataset is created to test VLMs on absolute and 3D spatial understanding tasks.

Result: Humans perform near-perfectly, while VLMs only match humans on simple tasks and fail on others.

Conclusion: VLMs need significant improvement in spatial understanding, and the benchmark provides a tool for future research.

Abstract: Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.

Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, Rongrong Ji

Main category: cs.CV

TL;DR: A universal video-level tracking model (Modaltracker) with online dense temporal token learning supports multi-modal tasks (RGB, RGB+Thermal, etc.) using the same architecture and parameters. It focuses on video-level sampling, association, and modality scalability, achieving state-of-the-art performance.

Details

Motivation: To create a unified tracking model that handles various modalities (RGB, thermal, etc.) without requiring separate training for each, reducing complexity and improving efficiency.

Method: Introduces video-level sampling and association, along with gated perceivers for adaptive cross-modal learning. Uses one-shot training to compress multi-modal representations into a single model.

Result: Achieves state-of-the-art performance on visible and multi-modal benchmarks.

Conclusion: Modaltracker offers a scalable, efficient solution for multi-modal tracking, leveraging temporal prompts and one-shot training to outperform existing methods.

Abstract: We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbf{Video-level Sampling}. We expand the model’s inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {\modaltracker} achieves a new \textit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.

[233] MoCTEFuse: Illumination-Gated Mixture of Chiral Transformer Experts for Multi-Level Infrared and Visible Image Fusion

Li Jinfu, Song Hong, Xia Jianghan, Lin Yucong, Wang Ting, Shao Long, Fan Jingfan, Yang Jian

Main category: cs.CV

TL;DR: MoCTEFuse is a dynamic multi-level image fusion network that addresses illumination changes in infrared and visible image fusion using an illumination-gated Mixture of Chiral Transformer Experts (MoCTE).

Details

Motivation: Illumination changes degrade fusion quality, and existing methods ignore this, causing modality bias. MoCTEFuse aims to balance texture details and object contrasts adaptively.

Method: Uses MoCTE with high- and low-illumination expert subnetworks, Chiral Transformer Fusion Blocks (CTFB), and a competitive loss function integrating illumination distributions.

Result: Achieves superior fusion performance on datasets (DroneVehicle, MSRS, TNO, RoadScene) and best detection mAP (70.93% on MFNet, 45.14% on DroneVehicle).

Conclusion: MoCTEFuse effectively handles illumination changes, outperforming existing methods in fusion quality and detection accuracy.

Abstract: While illumination changes inevitably affect the quality of infrared and visible image fusion, many outstanding methods still ignore this factor and directly merge the information from source images, leading to modality bias in the fused results. To this end, we propose a dynamic multi-level image fusion network called MoCTEFuse, which applies an illumination-gated Mixture of Chiral Transformer Experts (MoCTE) to adaptively preserve texture details and object contrasts in balance. MoCTE consists of high- and low-illumination expert subnetworks, each built upon the Chiral Transformer Fusion Block (CTFB). Guided by the illumination gating signals, CTFB dynamically switches between the primary and auxiliary modalities as well as assigning them corresponding weights with its asymmetric cross-attention mechanism. Meanwhile, it is stacked at multiple stages to progressively aggregate and refine modality-specific and cross-modality information. To facilitate robust training, we propose a competitive loss function that integrates illumination distributions with three levels of sub-loss terms. Extensive experiments conducted on the DroneVehicle, MSRS, TNO and RoadScene datasets show MoCTEFuse’s superior fusion performance. Finally, it achieves the best detection mean Average Precision (mAP) of 70.93% on the MFNet dataset and 45.14% on the DroneVehicle dataset. The code and model are released at https://github.com/Bitlijinfu/MoCTEFuse.

[234] SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection

Mohammed-En-Nadhir Zighem, Abdenour Hadid

Main category: cs.CV

TL;DR: SAViL-Det improves multi-script text detection by integrating textual prompts with visual features using a CLIP model and AFPN, achieving state-of-the-art results.

Details

Motivation: Existing methods lack semantic context integration for diverse scripts and arbitrarily shaped text in natural scenes.

Method: Combines pre-trained CLIP with AFPN, uses a language-vision decoder for semantic propagation, and employs text-to-pixel contrastive learning.

Result: Achieves F-scores of 84.8% on MLT-2019 and 90.2% on CTW1500.

Conclusion: SAViL-Det effectively enhances text detection by leveraging semantic context and cross-modal attention.

Abstract: Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.

[235] Color histogram equalization and fine-tuning to improve expression recognition of (partially occluded) faces on sign language datasets

Fabrizio Nunnari, Alakshendra Jyotsnaditya Ramkrishna Singh, Patrick Gebhard

Main category: cs.CV

TL;DR: The paper evaluates computer vision methods for classifying facial expressions in sign language, focusing on upper/lower face regions and color normalization. Results show high accuracy (83.8%) and better performance on the lower face (79.6%) than upper (77.9%), with upper face accuracy surpassing human level.

Details

Motivation: To quantify computer vision's ability to classify facial expressions in sign language and compare emotion manifestation between hearing and deaf subjects by analyzing upper/lower face regions.

Method: Introduces color normalization (histogram equalization and fine-tuning) and tests expression recognition on upper/lower face regions.

Result: Achieves 83.8% mean sensitivity with low variance (.042). Lower face recognition (79.6%) outperforms upper (77.9%), with upper face accuracy exceeding human level.

Conclusion: Computer vision methods effectively classify facial expressions in sign language, with notable performance differences between face regions and superior upper face accuracy compared to humans.

Abstract: The goal of this investigation is to quantify to what extent computer vision methods can correctly classify facial expressions on a sign language dataset. We extend our experiments by recognizing expressions using only the upper or lower part of the face, which is needed to further investigate the difference in emotion manifestation between hearing and deaf subjects. To take into account the peculiar color profile of a dataset, our method introduces a color normalization stage based on histogram equalization and fine-tuning. The results show the ability to correctly recognize facial expressions with 83.8% mean sensitivity and very little variance (.042) among classes. Like for humans, recognition of expressions from the lower half of the face (79.6%) is higher than that from the upper half (77.9%). Noticeably, the classification accuracy from the upper half of the face is higher than human level.

[236] When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang

Main category: cs.CV

TL;DR: A survey on multimodal long context token compression, categorizing methods by modality (image, video, audio) and mechanism, aiming to reduce computational challenges in MLLMs.

Details

Motivation: Address computational bottlenecks in MLLMs caused by quadratic complexity of self-attention with long multimodal inputs.

Method: Systematic categorization of token compression approaches by modality (image, video, audio) and underlying mechanisms (transformation, similarity, attention, query-based).

Result: Provides a structured overview of existing methods, identifies challenges, and suggests future research directions.

Conclusion: The survey consolidates progress in token compression, highlights key challenges, and aims to inspire further advancements in the field.

Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.

[237] Dual-Stream Global-Local Feature Collaborative Representation Network for Scene Classification of Mining Area

Shuqi Fan, Haoyi Wang, Xianju Li

Main category: cs.CV

TL;DR: A dual-branch fusion model for mining area scene classification, combining global and local features, achieves 83.63% accuracy, outperforming other models.

Details

Motivation: Accurate scene classification in mining areas aids geological monitoring and resource planning, but challenges include complex spatial layouts and multi-scale characteristics.

Method: Proposes a dual-branch fusion model with a global transformer branch, local enhancement branch, and deep feature fusion, using multi-loss computation.

Result: Achieves 83.63% accuracy, surpassing other models, and excels in all evaluation metrics.

Conclusion: The model effectively integrates multi-scale and local features, enhancing classification accuracy for complex mining landscapes.

Abstract: Scene classification of mining areas provides accurate foundational data for geological environment monitoring and resource development planning. This study fuses multi-source data to construct a multi-modal mine land cover scene classification dataset. A significant challenge in mining area classification lies in the complex spatial layout and multi-scale characteristics. By extracting global and local features, it becomes possible to comprehensively reflect the spatial distribution, thereby enabling a more accurate capture of the holistic characteristics of mining scenes. We propose a dual-branch fusion model utilizing collaborative representation to decompose global features into a set of key semantic vectors. This model comprises three key components:(1) Multi-scale Global Transformer Branch: It leverages adjacent large-scale features to generate global channel attention features for small-scale features, effectively capturing the multi-scale feature relationships. (2) Local Enhancement Collaborative Representation Branch: It refines the attention weights by leveraging local features and reconstructed key semantic sets, ensuring that the local context and detailed characteristics of the mining area are effectively integrated. This enhances the model’s sensitivity to fine-grained spatial variations. (3) Dual-Branch Deep Feature Fusion Module: It fuses the complementary features of the two branches to incorporate more scene information. This fusion strengthens the model’s ability to distinguish and classify complex mining landscapes. Finally, this study employs multi-loss computation to ensure a balanced integration of the modules. The overall accuracy of this model is 83.63%, which outperforms other comparative models. Additionally, it achieves the best performance across all other evaluation metrics.

[238] Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, Kun Zhou

Main category: cs.CV

TL;DR: MECo is a framework for co-speech gesture generation using LLMs, preserving motion-example details while aligning with speech. It outperforms existing methods in FGD, diversity, and similarity metrics.

Details

Motivation: Existing gesture generation systems lose rich motion details due to reliance on predefined labels or pseudo-labels. MECo aims to retain these details by leveraging LLMs.

Method: MECo fine-tunes LLMs to interpret speech audio and motion examples, using motion examples as explicit query contexts for gesture synthesis.

Result: State-of-the-art performance in FGD, motion diversity, and example-gesture similarity. Supports granular control and diverse input modalities.

Conclusion: MECo advances gesture generation by preserving motion details and enabling flexible control, validated by superior metrics and diverse applications.

Abstract: The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs’ comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fr'echet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/MECo-Page.

[239] MambaMap: Online Vectorized HD Map Construction using State Space Model

Ruizi Yang, Xiaolu Liu, Junbo Chen, Jianke Zhu

Main category: cs.CV

TL;DR: MambaMap is a novel framework for efficient temporal modeling in HD map construction, addressing occlusions and computational overhead with a memory bank and gating mechanism.

Details

Motivation: HD maps are crucial for autonomous driving, but existing methods struggle with fully utilizing temporal data or handling long sequences efficiently.

Method: MambaMap uses a memory bank for historical frame data, a gating mechanism for selective feature integration, and multi-directional scanning for enhanced feature extraction.

Result: Outperforms state-of-the-art methods on nuScenes and Argoverse2 datasets in accuracy and temporal consistency.

Conclusion: MambaMap effectively improves HD map construction by efficiently leveraging temporal information with low computational cost.

Abstract: High-definition (HD) maps are essential for autonomous driving, as they provide precise road information for downstream tasks. Recent advances highlight the potential of temporal modeling in addressing challenges like occlusions and extended perception range. However, existing methods either fail to fully exploit temporal information or incur substantial computational overhead in handling extended sequences. To tackle these challenges, we propose MambaMap, a novel framework that efficiently fuses long-range temporal features in the state space to construct online vectorized HD maps. Specifically, MambaMap incorporates a memory bank to store and utilize information from historical frames, dynamically updating BEV features and instance queries to improve robustness against noise and occlusions. Moreover, we introduce a gating mechanism in the state space, selectively integrating dependencies of map elements in high computational efficiency. In addition, we design innovative multi-directional and spatial-temporal scanning strategies to enhance feature extraction at both BEV and instance levels. These strategies significantly boost the prediction accuracy of our approach while ensuring robust temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed MambaMap approach outperforms state-of-the-art methods across various splits and perception ranges. Source code will be available at https://github.com/ZiziAmy/MambaMap.

[240] Decomposing Densification in Gaussian Splatting for Faster 3D Scene Reconstruction

Binxiao Huang, Zhengwu Liu, Ngai Wong

Main category: cs.CV

TL;DR: The paper introduces a global-to-local densification strategy and an energy-guided training framework to improve the efficiency and performance of 3D Gaussian Splatting (GS) for scene reconstruction.

Details

Motivation: The training process of GS suffers from slow convergence due to inefficient densification and suboptimal spatial distribution of Gaussian primitives.

Method: Proposes a global-to-local densification strategy and an energy-guided coarse-to-fine multi-resolution training framework, along with dynamic pruning of unnecessary primitives.

Result: Achieves over 2x training speedup with fewer Gaussian primitives and superior reconstruction performance on multiple datasets.

Conclusion: The proposed methods significantly enhance the efficiency and quality of GS-based scene reconstruction.

Abstract: 3D Gaussian Splatting (GS) has emerged as a powerful representation for high-quality scene reconstruction, offering compelling rendering quality. However, the training process of GS often suffers from slow convergence due to inefficient densification and suboptimal spatial distribution of Gaussian primitives. In this work, we present a comprehensive analysis of the split and clone operations during the densification phase, revealing their distinct roles in balancing detail preservation and computational efficiency. Building upon this analysis, we propose a global-to-local densification strategy, which facilitates more efficient growth of Gaussians across the scene space, promoting both global coverage and local refinement. To cooperate with the proposed densification strategy and promote sufficient diffusion of Gaussian primitives in space, we introduce an energy-guided coarse-to-fine multi-resolution training framework, which gradually increases resolution based on energy density in 2D images. Additionally, we dynamically prune unnecessary Gaussian primitives to speed up the training. Extensive experiments on MipNeRF-360, Deep Blending, and Tanks & Temples datasets demonstrate that our approach significantly accelerates training,achieving over 2x speedup with fewer Gaussian primitives and superior reconstruction performance.

[241] AnimalClue: Recognizing Animals by their Traces

Risa Shinoda, Nakamasa Inoue, Iro Laina, Christian Rupprecht, Hirokatsu Kataoka

Main category: cs.CV

TL;DR: AnimalClue is a large-scale dataset for species identification from indirect evidence like footprints and feces, addressing gaps in wildlife monitoring.

Details

Motivation: Current wildlife monitoring lacks robust methods for identifying species from indirect evidence, a critical aspect of biodiversity conservation.

Method: The paper introduces AnimalClue, a dataset with 159,605 bounding boxes of indirect evidence (footprints, feces, etc.) across 968 species, annotated with species-level labels and fine-grained traits.

Result: Experiments evaluate vision models on AnimalClue, highlighting challenges in recognizing subtle visual features for species identification.

Conclusion: AnimalClue bridges a gap in wildlife monitoring by providing a dataset for indirect evidence, enabling advancements in automated species identification.

Abstract: Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. Our dataset and code are available at https://dahlian00.github.io/AnimalCluePage/

[242] MIRepNet: A Pipeline and Foundation Model for EEG-Based Motor Imagery Classification

Dingkun Liu, Zhu Chen, Jingwei Luo, Shijie Lian, Dongrui Wu

Main category: cs.CV

TL;DR: MIRepNet is a specialized EEG foundation model for motor imagery (MI) paradigms, combining neurophysiologically-informed preprocessing and hybrid pretraining for superior performance.

Details

Motivation: Existing EEG foundation models lack paradigm-specific adaptations, limiting generalization. MIRepNet addresses this for MI, a common BCI paradigm.

Method: MIRepNet uses a neurophysiologically-informed preprocessing pipeline and hybrid pretraining (self-supervised and supervised) for rapid adaptation.

Result: Outperforms state-of-the-art models on five public MI datasets, even with fewer than 30 trials per class.

Conclusion: MIRepNet is a highly effective, paradigm-specific EEG foundation model for MI, with potential for practical BCI applications.

Abstract: Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices. Recent EEG foundation models aim to learn generalized representations across diverse BCI paradigms. However, these approaches overlook fundamental paradigm-specific neurophysiological distinctions, limiting their generalization ability. Importantly, in practical BCI deployments, the specific paradigm such as motor imagery (MI) for stroke rehabilitation or assistive robotics, is generally determined prior to data acquisition. This paper proposes MIRepNet, the first EEG foundation model tailored for the MI paradigm. MIRepNet comprises a high-quality EEG preprocessing pipeline incorporating a neurophysiologically-informed channel template, adaptable to EEG headsets with arbitrary electrode configurations. Furthermore, we introduce a hybrid pretraining strategy that combines self-supervised masked token reconstruction and supervised MI classification, facilitating rapid adaptation and accurate decoding on novel downstream MI tasks with fewer than 30 trials per class. Extensive evaluations across five public MI datasets demonstrated that MIRepNet consistently achieved state-of-the-art performance, significantly outperforming both specialized and generalized EEG models. Our code will be available on GitHub\footnote{https://github.com/staraink/MIRepNet}.

[243] L-MCAT: Unpaired Multimodal Transformer with Contrastive Attention for Label-Efficient Satellite Image Classification

Mitul Goswami, Mrinal Goswami

Main category: cs.CV

TL;DR: L-MCAT is a transformer-based framework for efficient remote sensing image classification using unpaired multimodal data, achieving high accuracy with minimal labels and computational resources.

Details

Motivation: To address the challenge of label-efficient classification in remote sensing, especially with unpaired multimodal satellite data, where traditional methods require extensive labeled data and computational power.

Method: Introduces Modality-Spectral Adapters (MSA) to compress sensor inputs and Unpaired Multimodal Attention Alignment (U-MAA) for aligning heterogeneous modalities without labels.

Result: Achieves 95.4% accuracy on SEN12MS with only 20 labels per class, outperforming baselines with 47x fewer parameters and 23x fewer FLOPs. Robust to 50% spatial misalignment.

Conclusion: L-MCAT is a highly efficient and robust solution for remote sensing classification, suitable for real-world deployment with minimal resource requirements.

Abstract: We propose the Lightweight Multimodal Contrastive Attention Transformer (L-MCAT), a novel transformer-based framework for label-efficient remote sensing image classification using unpaired multimodal satellite data. L-MCAT introduces two core innovations: (1) Modality-Spectral Adapters (MSA) that compress high-dimensional sensor inputs into a unified embedding space, and (2) Unpaired Multimodal Attention Alignment (U-MAA), a contrastive self-supervised mechanism integrated into the attention layers to align heterogeneous modalities without pixel-level correspondence or labels. L-MCAT achieves 95.4% overall accuracy on the SEN12MS dataset using only 20 labels per class, outperforming state-of-the-art baselines while using 47x fewer parameters and 23x fewer FLOPs than MCTrans. It maintains over 92% accuracy even under 50% spatial misalignment, demonstrating robustness for real-world deployment. The model trains end-to-end in under 5 hours on a single consumer GPU.

[244] Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation

Yooshin Cho, Hanbyel Cho, Janghyeon Lee, HyeongGwon Hong, Jaesung Ahn, Junmo Kim

Main category: cs.CV

TL;DR: Proposes a framework called controllable feature whitening to mitigate bias in deep neural networks by eliminating linear correlations between features, improving reliability without complex methods.

Details

Motivation: Addressing the susceptibility of deep neural networks to spurious correlations in datasets, aiming to enhance trustworthiness in AI.

Method: Quantifies linear correlations via covariance matrix and removes them using a whitening module, avoiding higher-order dependencies. No regularization or adversarial learning is needed.

Result: Effectively mitigates bias, handles fairness criteria (demographic parity and equalized odds), and outperforms existing methods on benchmark datasets.

Conclusion: The method improves reliability and fairness in AI by controlling feature correlations, offering a simpler and more stable alternative to existing approaches.

Abstract: As the use of artificial intelligence rapidly increases, the development of trustworthy artificial intelligence has become important. However, recent studies have shown that deep neural networks are susceptible to learn spurious correlations present in datasets. To improve the reliability, we propose a simple yet effective framework called controllable feature whitening. We quantify the linear correlation between the target and bias features by the covariance matrix, and eliminate it through the whitening module. Our results systemically demonstrate that removing the linear correlations between features fed into the last linear classifier significantly mitigates the bias, while avoiding the need to model intractable higher-order dependencies. A particular advantage of the proposed method is that it does not require regularization terms or adversarial learning, which often leads to unstable optimization in practice. Furthermore, we show that two fairness criteria, demographic parity and equalized odds, can be effectively handled by whitening with the re-weighted covariance matrix. Consequently, our method controls the trade-off between the utility and fairness of algorithms by adjusting the weighting coefficient. Finally, we validate that our method outperforms existing approaches on four benchmark datasets: Corrupted CIFAR-10, Biased FFHQ, WaterBirds, and Celeb-A.

[245] Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training

Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Yuhui Wu, Lei Zhang

Main category: cs.CV

TL;DR: The paper proposes a Transfer VAE Training (TVT) strategy to improve fine-structure preservation in real-world image super-resolution (Real-ISR) using stable diffusion models, while reducing computational costs.

Details

Motivation: Existing methods using pre-trained stable diffusion models struggle with fine-structure reconstruction due to aggressive downsampling in the VAE. Adapting VAEs with lower downsampling rates while maintaining compatibility with pre-trained UNets is challenging.

Method: The TVT strategy transfers an 8× downsampled VAE to a 4× one by training a new decoder and encoder sequentially. It also introduces a compact VAE and compute-efficient UNet to reduce costs.

Result: TVT significantly improves fine-structure preservation (e.g., small characters and textures) and requires fewer FLOPs than state-of-the-art methods.

Conclusion: The proposed TVT method effectively addresses fine-structure reconstruction issues in Real-ISR while optimizing computational efficiency.

Abstract: Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while adapting to the pre-trained UNet. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. The official code can be found at https://github.com/Joyies/TVT.

[246] $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement

Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai

Main category: cs.CV

TL;DR: The paper proposes $A^2R^2$, a framework for improving Img2LaTeX conversion using attention-guided refinement and visual reasoning, outperforming baselines and validated by a new dataset and experiments.

Details

Motivation: VLMs struggle with fine-grained visual elements in Img2LaTeX tasks, leading to inaccurate predictions. The paper aims to enhance performance via attention and iterative refinement.

Method: $A^2R^2$ integrates attention localization and iterative refinement within a visual reasoning framework, enabling self-correction and improved predictions.

Result: $A^2R^2$ outperforms baselines across six metrics, benefits from more inference rounds, and shows strong synergy in ablation studies.

Conclusion: The proposed framework effectively improves Img2LaTeX conversion, validated by experiments and a new dataset, demonstrating its practical utility.

Abstract: Img2LaTeX is a practically significant task that involves converting mathematical expressions or tabular data from images into LaTeX code. In recent years, vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks, owing to their generalization capabilities. While some studies have explored the use of VLMs for the Img2LaTeX task, their performance often falls short of expectations. Empirically, VLMs sometimes struggle with fine-grained visual elements, leading to inaccurate LaTeX predictions. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve prediction quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across six evaluation metrics spanning both textual and visual levels, consistently outperforming other baseline methods; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and human evaluations validate the practical effectiveness of our approach, as well as the strong synergy among its core components during inference.

[247] SWIFT: A General Sensitive Weight Identification Framework for Fast Sensor-Transfer Pansharpening

Zeyu Xia, Chenxi Sun, Tianyu Xin, Yubo Zeng, Haoyu Chen, Liang-Jian Deng

Main category: cs.CV

TL;DR: SWIFT is a fast, general-purpose framework for cross-sensor adaptation in pansharpening, using unsupervised sampling and gradient analysis to update only sensitive weights, reducing adaptation time significantly.

Details

Motivation: Deep learning-based pansharpening methods degrade on unseen sensor data, and full retraining or complex architectures are impractical. SWIFT addresses this by enabling efficient adaptation.

Method: SWIFT uses unsupervised sampling to select 3% of target domain samples, analyzes gradient behavior to identify sensitive weights, and updates only those weights for adaptation.

Result: SWIFT reduces adaptation time to ~1 minute on a GPU, outperforms direct-transfer baselines, and matches or exceeds full retraining performance on WorldView-2 and QuickBird datasets.

Conclusion: SWIFT offers a practical, efficient solution for cross-sensor pansharpening adaptation, setting a new state-of-the-art with minimal computational cost.

Abstract: Pansharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multispectral (LRMS) images to generate high-resolution multispectral (HRMS) images. Although deep learning-based methods have achieved promising performance, they generally suffer from severe performance degradation when applied to data from unseen sensors. Adapting these models through full-scale retraining or designing more complex architectures is often prohibitively expensive and impractical for real-world deployment. To address this critical challenge, we propose a fast and general-purpose framework for cross-sensor adaptation, SWIFT (Sensitive Weight Identification for Fast Transfer). Specifically, SWIFT employs an unsupervised sampling strategy based on data manifold structures to balance sample selection while mitigating the bias of traditional Farthest Point Sampling, efficiently selecting only 3% of the most informative samples from the target domain. This subset is then used to probe a source-domain pre-trained model by analyzing the gradient behavior of its parameters, allowing for the quick identification and subsequent update of only the weight subset most sensitive to the domain shift. As a plug-and-play framework, SWIFT can be applied to various existing pansharpening models. Extensive experiments demonstrate that SWIFT reduces the adaptation time from hours to approximately one minute on a single NVIDIA RTX 4090 GPU. The adapted models not only substantially outperform direct-transfer baselines but also achieve performance competitive with, and in some cases superior to, full retraining, establishing a new state-of-the-art on cross-sensor pansharpening tasks for the WorldView-2 and QuickBird datasets.

[248] From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos

Chenjian Gao, Lihe Ding, Rui Han, Zhanpeng Huang, Zibin Wang, Tianfan Xue

Main category: cs.CV

TL;DR: A hybrid pipeline combining 3D Gaussian Splatting (3DGS) and 2D diffusion models for realistic and temporally consistent 3D object insertion in videos, focusing on bracelets in dynamic wrist scenes.

Details

Motivation: Addressing the challenge of maintaining temporal consistency and realistic lighting when inserting 3D objects into videos, especially in dynamic scenarios.

Method: Uses 3DGS for initial rendering (ensuring temporal coherence) and refines results with a 2D diffusion model (for photorealistic lighting). Introduces a shading-driven pipeline to separate and refine intrinsic object properties.

Result: Achieves realistic lighting and temporal coherence in video object insertion, demonstrated with bracelet insertion in dynamic wrist scenes.

Conclusion: The hybrid approach synergizes 3D rendering and 2D diffusion, providing a robust solution for realistic and consistent video editing.

Abstract: Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing. Project Page: https://cjeen.github.io/BraceletPaper/

[249] PIVOTS: Aligning unseen Structures using Preoperative to Intraoperative Volume-To-Surface Registration for Liver Navigation

Peng Liu, Bianca Güttner, Yutong Su, Chenyang Li, Jinjing Xu, Mingyang Liu, Zhe Min, Andrey Zhylka, Jasper Smit, Karin Olthof, Matteo Fusaglia, Rudi Apolle, Matthias Miederer, Laura Frohneberger, Carina Riediger, Jügen Weitz, Fiona Kolbinger, Stefanie Speidel, Micha Pfeiffer

Main category: cs.CV

TL;DR: PIVOTS is a neural network for non-rigid registration in liver surgery, using point clouds to predict deformation, outperforming baselines in accuracy and robustness.

Details

Motivation: Accurate intraoperative liver deformation prediction is challenging due to factors like large deformation, noise, and limited visibility. PIVOTS aims to address these issues.

Method: PIVOTS uses a neural network with a geometric feature extraction encoder and a decoder with deformation-aware cross-attention modules, trained on synthetic biomechanical data.

Result: PIVOTS shows superior registration performance, handling noise, large deformation, and visibility constraints better than baseline methods.

Conclusion: PIVOTS advances liver registration, providing a benchmark for future comparisons, with code and datasets publicly available.

Abstract: Non-rigid registration is essential for Augmented Reality guided laparoscopic liver surgery by fusing preoperative information, such as tumor location and vascular structures, into the limited intraoperative view, thereby enhancing surgical navigation. A prerequisite is the accurate prediction of intraoperative liver deformation which remains highly challenging due to factors such as large deformation caused by pneumoperitoneum, respiration and tool interaction as well as noisy intraoperative data, and limited field of view due to occlusion and constrained camera movement. To address these challenges, we introduce PIVOTS, a Preoperative to Intraoperative VOlume-To-Surface registration neural network that directly takes point clouds as input for deformation prediction. The geometric feature extraction encoder allows multi-resolution feature extraction, and the decoder, comprising novel deformation aware cross attention modules, enables pre- and intraoperative information interaction and accurate multi-level displacement prediction. We train the neural network on synthetic data simulated from a biomechanical simulation pipeline and validate its performance on both synthetic and real datasets. Results demonstrate superior registration performance of our method compared to baseline methods, exhibiting strong robustness against high amounts of noise, large deformation, and various levels of intraoperative visibility. We publish the training and test sets as evaluation benchmarks and call for a fair comparison of liver registration methods with volume-to-surface data. Code and datasets are available here https://github.com/pengliu-nct/PIVOTS.

[250] Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach

Yanming Xiu, Maria Gorlatova

Main category: cs.CV

TL;DR: The paper addresses visual information manipulation (VIM) attacks in AR, categorizing them into formats and purposes, and introduces a dataset (AR-VIM) and a detection framework (VIM-Sense) achieving high accuracy and low latency.

Details

Motivation: AR's virtual content can mislead users through subtle manipulations, necessitating detection methods to prevent semantic misunderstandings or errors.

Method: A taxonomy for VIM attacks is proposed, and a dataset (AR-VIM) is created. The detection framework, VIM-Sense, combines vision-language models and OCR-based analysis.

Result: VIM-Sense achieves 88.94% accuracy on AR-VIM and detects attacks in ~7 seconds in both simulated and real-world evaluations.

Conclusion: The work successfully identifies and mitigates VIM attacks in AR, demonstrating the effectiveness of multimodal semantic reasoning.

Abstract: The virtual content in augmented reality (AR) can introduce misleading or harmful information, leading to semantic misunderstandings or user errors. In this work, we focus on visual information manipulation (VIM) attacks in AR where virtual content changes the meaning of real-world scenes in subtle but impactful ways. We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information. Based on the taxonomy, we construct a dataset, AR-VIM. It consists of 452 raw-AR video pairs spanning 202 different scenes, each simulating a real-world AR scenario. To detect such attacks, we propose a multimodal semantic reasoning framework, VIM-Sense. It combines the language and visual understanding capabilities of vision-language models (VLMs) with optical character recognition (OCR)-based textual analysis. VIM-Sense achieves an attack detection accuracy of 88.94% on AR-VIM, consistently outperforming vision-only and text-only baselines. The system reaches an average attack detection latency of 7.07 seconds in a simulated video processing framework and 7.17 seconds in a real-world evaluation conducted on a mobile Android AR application.

[251] Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction

Djamel Eddine Boukhari, Ali chemsa

Main category: cs.CV

TL;DR: A novel two-stage framework (Diff-FBP) uses generative pre-training (Diffusion Transformer) for facial beauty prediction, outperforming prior methods with a PCC of 0.932.

Details

Motivation: Existing methods struggle with subjective aesthetic assessment due to reliance on generic pre-training.

Method: Two-stage approach: 1) Self-supervised pre-training on FFHQ using a Diffusion Transformer, 2) Fine-tuning a lightweight head on FBP5500.

Result: Achieves state-of-the-art PCC of 0.932 on FBP5500.

Conclusion: Generative pre-training enhances feature alignment for subjective tasks like FBP.

Abstract: Facial Beauty Prediction (FBP) is a challenging computer vision task due to its subjective nature and the subtle, holistic features that influence human perception. Prevailing methods, often based on deep convolutional networks or standard Vision Transformers pre-trained on generic object classification (e.g., ImageNet), struggle to learn feature representations that are truly aligned with high-level aesthetic assessment. In this paper, we propose a novel two-stage framework that leverages the power of generative models to create a superior, domain-specific feature extractor. In the first stage, we pre-train a Diffusion Transformer on a large-scale, unlabeled facial dataset (FFHQ) through a self-supervised denoising task. This process forces the model to learn the fundamental data distribution of human faces, capturing nuanced details and structural priors essential for aesthetic evaluation. In the second stage, the pre-trained and frozen encoder of our Diffusion Transformer is used as a backbone feature extractor, with only a lightweight regression head being fine-tuned on the target FBP dataset (FBP5500). Our method, termed Diff-FBP, sets a new state-of-the-art on the FBP5500 benchmark, achieving a Pearson Correlation Coefficient (PCC) of 0.932, significantly outperforming prior art based on general-purpose pre-training. Extensive ablation studies validate that our generative pre-training strategy is the key contributor to this performance leap, creating feature representations that are more semantically potent for subjective visual tasks.

[252] ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Codruta Ancuti, Cosmin Ancuti

Main category: cs.CV

TL;DR: ModalFormer is a multimodal framework for low-light image enhancement (LLIE) that leverages nine auxiliary modalities and a novel Cross-modal Transformer (CM-T) with CM-MSA for superior performance.

Details

Motivation: Existing LLIE methods focus on RGB pixel-level transformations, missing contextual information from other modalities. ModalFormer addresses this gap by integrating multiple visual modalities.

Method: ModalFormer uses a CM-T with CM-MSA to fuse RGB data with auxiliary modalities (e.g., deep features, segmentation, geometry, color). It includes subnetworks for multimodal feature reconstruction.

Result: ModalFormer achieves state-of-the-art performance on multiple LLIE benchmarks.

Conclusion: ModalFormer demonstrates the effectiveness of multimodal integration in LLIE, setting a new benchmark for the task.

Abstract: Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features–including deep feature embeddings, segmentation information, geometric cues, and color information–to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer.

Naveen Mathews Renji, Kruthika K, Manasa Keshavamurthy, Pooja Kumari, S. Rajarajeswari

Main category: cs.CV

TL;DR: The paper explores semantic segmentation for autonomous vehicles using the Indian Driving Dataset, comparing five deep learning models (UNET, UNET+RESNET50, DeepLabsV3, PSPNet, SegNet) with a top MIOU of 0.6496.

Details

Motivation: Understanding unstructured driving environments in India is critical for autonomous vehicle development, requiring robust scene comprehension via semantic segmentation.

Method: Five deep learning models were trained and evaluated on the Indian Driving Dataset using Mean Intersection over Union (MIOU) for performance comparison.

Result: The highest MIOU of 0.6496 was achieved, demonstrating the effectiveness of semantic segmentation in challenging unstructured environments.

Conclusion: The study highlights the potential of semantic segmentation for autonomous driving in unstructured settings, with UNET+RESNET50 showing promising results.

Abstract: Autonomous vehicles are the next revolution in the automobile industry and they are expected to revolutionize the future of transportation. Understanding the scenario in which the autonomous vehicle will operate is critical for its competent functioning. Deep Learning has played a massive role in the progress that has been made till date. Semantic Segmentation, the process of annotating every pixel of an image with an object class, is one crucial part of this scene comprehension using Deep Learning. It is especially useful in Autonomous Driving Research as it requires comprehension of drivable and non-drivable areas, roadside objects and the like. In this paper semantic segmentation has been performed on the Indian Driving Dataset which has been recently compiled on the urban and rural roads of Bengaluru and Hyderabad. This dataset is more challenging compared to other datasets like Cityscapes, since it is based on unstructured driving environments. It has a four level hierarchy and in this paper segmentation has been performed on the first level. Five different models have been trained and their performance has been compared using the Mean Intersection over Union. These are UNET, UNET+RESNET50, DeepLabsV3, PSPNet and SegNet. The highest MIOU of 0.6496 has been achieved. The paper discusses the dataset, exploratory data analysis, preparation, implementation of the five models and studies the performance and compares the results achieved in the process.

[254] VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving

Levente Tempfli, Esteban Rivera, Markus Lienkamp

Main category: cs.CV

TL;DR: VESPA is a multimodal autolabeling pipeline combining LiDAR and camera data for scalable 3D scene understanding, achieving high performance without ground-truth annotations.

Details

Motivation: Manual 3D annotation is costly and labor-intensive; LiDAR-based autolabeling lacks semantic granularity and struggles with data limitations.

Method: VESPA fuses LiDAR’s geometric precision with camera images’ semantic richness using vision-language models for open-vocabulary labeling.

Result: Achieves 52.95% AP for object discovery and 46.54% AP for multiclass detection on Nuscenes.

Conclusion: VESPA offers a scalable, high-quality solution for 3D autolabeling without needing manual annotations or HD maps.

Abstract: Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to refine detection quality directly in the point cloud domain. VESPA supports the discovery of novel categories and produces high-quality 3D pseudolabels without requiring ground-truth annotations or HD maps. On Nuscenes dataset, VESPA achieves an AP of 52.95% for object discovery and up to 46.54% for multiclass object detection, demonstrating strong performance in scalable 3D scene understanding. Code will be available upon acceptance.

[255] Second Competition on Presentation Attack Detection on ID Card

Juan E. Tapia, Mario Nieto, Juan M. Espin, Alvaro S. Rocamora, Javier Barrachina, Naser Damer, Christoph Busch, Marija Ivanovska, Leon Todorov, Renat Khizbullin, Lazar Lazarevich, Aleksei Grishin, Daniel Schulz, Sebastian Gonzalez, Amir Mohammadi, Ketan Kotwal, Sebastien Marcel, Raghavendra Mudgalgundurao, Kiran Raja, Patrick Schuch, Sushrut Patwardhan, Raghavendra Ramachandra, Pedro Couto Pereira, Joao Ribeiro Pinto, Mariana Xavier, Andrés Valenzuela, Rodrigo Lara, Borut Batagelj, Marko Peterlin, Peter Peer, Ajnas Muhammed, Diogo Nunes, Nuno Gonçalves

Main category: cs.CV

TL;DR: The paper reports results from the second Presentation Attack Detection (PAD) competition on ID cards, highlighting improvements in evaluation methods and dataset quality, with notable performance gains over the first edition.

Details

Motivation: To advance PAD technology for ID cards by benchmarking algorithms and datasets, addressing challenges like limited bona fide images.

Method: The competition featured an automatic evaluation platform, two tracks (algorithm and dataset evaluation), and a new ID card dataset for training.

Result: Top teams achieved significant improvements: ‘Dragons’ (Track 1) with 40.48% AV-Rank and 11.44% EER, and ‘Incode’ (Track 2) with 14.76% AV-Rank and 6.36% EER.

Conclusion: PAD for ID cards is improving but remains challenging due to limited bona fide images, suggesting further research is needed.

Abstract: This work summarises and reports the results of the second Presentation Attack Detection competition on ID cards. This new version includes new elements compared to the previous one. (1) An automatic evaluation platform was enabled for automatic benchmarking; (2) Two tracks were proposed in order to evaluate algorithms and datasets, respectively; and (3) A new ID card dataset was shared with Track 1 teams to serve as the baseline dataset for the training and optimisation. The Hochschule Darmstadt, Fraunhofer-IGD, and Facephi company jointly organised this challenge. 20 teams were registered, and 74 submitted models were evaluated. For Track 1, the “Dragons” team reached first place with an Average Ranking and Equal Error rate (EER) of AV-Rank of 40.48% and 11.44% EER, respectively. For the more challenging approach in Track 2, the “Incode” team reached the best results with an AV-Rank of 14.76% and 6.36% EER, improving on the results of the first edition of 74.30% and 21.87% EER, respectively. These results suggest that PAD on ID cards is improving, but it is still a challenging problem related to the number of images, especially of bona fide images.

[256] Indian Sign Language Detection for Real-Time Translation using Machine Learning

Rajat Singhal, Jatin Gupta, Akhil Sharma, Anushka Gupta, Navya Sharma

Main category: cs.CV

TL;DR: The paper proposes a real-time Indian Sign Language (ISL) detection and translation system using a CNN, achieving 99.95% accuracy to bridge communication gaps for the deaf and mute in India.

Details

Motivation: To address the lack of skilled interpreters and accessible translation technologies for ISL, hindering effective communication for deaf and mute communities in India.

Method: A Convolutional Neural Network (CNN) trained on an ISL dataset, integrated with MediaPipe for hand tracking and motion detection.

Result: The model achieves 99.95% classification accuracy, demonstrating high precision in discerning nuanced visual features of signs.

Conclusion: The system offers a reliable, real-time solution for ISL translation, enhancing communication accessibility for deaf and mute individuals in India.

Abstract: Gestural language is used by deaf & mute communities to communicate through hand gestures & body movements that rely on visual-spatial patterns known as sign languages. Sign languages, which rely on visual-spatial patterns of hand gestures & body movements, are the primary mode of communication for deaf & mute communities worldwide. Effective communication is fundamental to human interaction, yet individuals in these communities often face significant barriers due to a scarcity of skilled interpreters & accessible translation technologies. This research specifically addresses these challenges within the Indian context by focusing on Indian Sign Language (ISL). By leveraging machine learning, this study aims to bridge the critical communication gap for the deaf & hard-of-hearing population in India, where technological solutions for ISL are less developed compared to other global sign languages. We propose a robust, real-time ISL detection & translation system built upon a Convolutional Neural Network (CNN). Our model is trained on a comprehensive ISL dataset & demonstrates exceptional performance, achieving a classification accuracy of 99.95%. This high precision underscores the model’s capability to discern the nuanced visual features of different signs. The system’s effectiveness is rigorously evaluated using key performance metrics, including accuracy, F1 score, precision & recall, ensuring its reliability for real-world applications. For real-time implementation, the framework integrates MediaPipe for precise hand tracking & motion detection, enabling seamless translation of dynamic gestures. This paper provides a detailed account of the model’s architecture, the data preprocessing pipeline & the classification methodology. The research elaborates the model architecture, preprocessing & classification methodologies for enhancing communication in deaf & mute communities.

[257] Can Foundation Models Predict Fitness for Duty?

Juan E. Tapia, Christoph Busch

Main category: cs.CV

TL;DR: The paper explores using deep learning and foundational models to predict fitness for duty (alertness) from near-infrared iris images, addressing challenges in dataset creation for training.

Details

Motivation: To expand the use of biometric devices beyond recognition by estimating alertness, despite the difficulty in gathering large datasets for training.

Method: Utilizes deep learning and foundational models, leveraging self-supervised learning for generalization with limited data.

Result: Demonstrates the potential of foundational models to enhance fitness-for-duty prediction using iris images.

Conclusion: Foundational models offer a promising solution for improving alertness prediction in scenarios with limited training data.

Abstract: Biometric capture devices have been utilised to estimate a person’s alertness through near-infrared iris images, expanding their use beyond just biometric recognition. However, capturing a substantial number of corresponding images related to alcohol consumption, drug use, and sleep deprivation to create a dataset for training an AI model presents a significant challenge. Typically, a large quantity of images is required to effectively implement a deep learning approach. Currently, training downstream models with a huge number of images based on foundational models provides a real opportunity to enhance this area, thanks to the generalisation capabilities of self-supervised models. This work examines the application of deep learning and foundational models in predicting fitness for duty, which is defined as the subject condition related to determining the alertness for work.

[258] JOLT3D: Joint Learning of Talking Heads and 3DMM Parameters with Application to Lip-Sync

Sungjoon Park, Minsik Park, Haneol Lee, Jaesub Yun, Donggeon Lee

Main category: cs.CV

TL;DR: The paper proposes a joint learning approach for 3D face reconstruction and talking head synthesis, improving facial expression representation and lip-sync quality.

Details

Motivation: To enhance talking head synthesis by optimizing a FACS-based blendshape representation for facial expressions, addressing limitations of prior methods.

Method: Jointly learns 3D face reconstruction and talking head synthesis models, decoupling chin contours and reducing flickering in lip-sync.

Result: Improved facial expression quality and lip-sync accuracy with reduced flickering near the mouth.

Conclusion: The joint learning approach outperforms previous methods, offering better control and quality in talking head synthesis.

Abstract: In this work, we revisit the effectiveness of 3DMM for talking head synthesis by jointly learning a 3D face reconstruction model and a talking head synthesis model. This enables us to obtain a FACS-based blendshape representation of facial expressions that is optimized for talking head synthesis. This contrasts with previous methods that either fit 3DMM parameters to 2D landmarks or rely on pretrained face reconstruction models. Not only does our approach increase the quality of the generated face, but it also allows us to take advantage of the blendshape representation to modify just the mouth region for the purpose of audio-based lip-sync. To this end, we propose a novel lip-sync pipeline that, unlike previous methods, decouples the original chin contour from the lip-synced chin contour, and reduces flickering near the mouth.

[259] Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis

Zhuokun Chen, Jugang Fan, Zhuowei Yu, Bohan Zhuang, Mingkui Tan

Main category: cs.CV

TL;DR: SparseVAR is a plug-and-play framework for accelerating next-scale prediction in visual autoregressive modeling by dynamically excluding low-frequency tokens during inference, reducing computational overhead without additional training.

Details

Motivation: The computational overhead in high-resolution stages of next-scale prediction models is a challenge due to the large number of tokens. Low-frequency tokens have negligible impact on image quality and exhibit strong similarity with neighbors.

Method: SparseVAR uses lightweight MSE-based metrics to identify and exclude low-frequency tokens while preserving fidelity with uniformly sampled anchor tokens.

Result: SparseVAR achieves up to 2x speedup with minimal quality degradation in models like Infinity-2B.

Conclusion: SparseVAR effectively reduces computational costs while maintaining high image generation quality, making it a practical solution for accelerating next-scale prediction models.

Abstract: Visual autoregressive modeling, based on the next-scale prediction paradigm, exhibits notable advantages in image quality and model scalability over traditional autoregressive and diffusion models. It generates images by progressively refining resolution across multiple stages. However, the computational overhead in high-resolution stages remains a critical challenge due to the substantial number of tokens involved. In this paper, we introduce SparseVAR, a plug-and-play acceleration framework for next-scale prediction that dynamically excludes low-frequency tokens during inference without requiring additional training. Our approach is motivated by the observation that tokens in low-frequency regions have a negligible impact on image quality in high-resolution stages and exhibit strong similarity with neighboring tokens. Additionally, we observe that different blocks in the next-scale prediction model focus on distinct regions, with some concentrating on high-frequency areas. SparseVAR leverages these insights by employing lightweight MSE-based metrics to identify low-frequency tokens while preserving the fidelity of excluded regions through a small set of uniformly sampled anchor tokens. By significantly reducing the computational cost while maintaining high image generation quality, SparseVAR achieves notable acceleration in both HART and Infinity. Specifically, SparseVAR achieves up to a 2 times speedup with minimal quality degradation in Infinity-2B.

[260] Priority-Aware Pathological Hierarchy Training for Multiple Instance Learning

Sungrae Hong, Kyungeun Kim, Juhyeon Kim, Sol Lee, Jisu Shin, Chanjae Song, Mun Yong Yi

Main category: cs.CV

TL;DR: The paper proposes a new MIL method addressing priority issues in clinical diagnosis by using vertical and horizontal hierarchies, reducing misdiagnosis and prioritizing critical symptoms.

Details

Motivation: Existing MIL approaches in clinical settings fail to address priority among pathological symptoms and diagnostic classes, leading to ignored priorities.

Method: The method uses two hierarchies (vertical inter-hierarchy and horizontal intra-hierarchy) to align MIL predictions and employs implicit feature re-usability to prioritize clinically serious classes.

Result: Experiments show the method reduces misdiagnosis and prioritizes important symptoms in multiclass scenarios.

Conclusion: The proposed method effectively addresses clinical priority issues in MIL, validated by real-world patient data and challenging cases.

Abstract: Multiple Instance Learning (MIL) is increasingly being used as a support tool within clinical settings for pathological diagnosis decisions, achieving high performance and removing the annotation burden. However, existing approaches for clinical MIL tasks have not adequately addressed the priority issues that exist in relation to pathological symptoms and diagnostic classes, causing MIL models to ignore priority among classes. To overcome this clinical limitation of MIL, we propose a new method that addresses priority issues using two hierarchies: vertical inter-hierarchy and horizontal intra-hierarchy. The proposed method aligns MIL predictions across each hierarchical level and employs an implicit feature re-usability during training to facilitate clinically more serious classes within the same level. Experiments with real-world patient data show that the proposed method effectively reduces misdiagnosis and prioritizes more important symptoms in multiclass scenarios. Further analysis verifies the efficacy of the proposed components and qualitatively confirms the MIL predictions against challenging cases with multiple symptoms.

[261] Automated 3D-GS Registration and Fusion via Skeleton Alignment and Gaussian-Adaptive Features

Shiyang Liu, Dianyi Yang, Yu Gao, Bohan Ren, Yi Yang, Mengyin Fu

Main category: cs.CV

TL;DR: A novel automated method for 3D Gaussian Splatting (3D-GS) sub-map alignment and fusion improves registration accuracy and fusion quality without manual intervention, enhancing 3D scene representation for robotics and navigation.

Details

Motivation: Existing methods for 3D-GS sub-map registration and fusion rely on manual intervention and degrade rendering quality, limiting their practicality for real-world applications like robotics and autonomous navigation.

Method: The approach involves geometric skeleton extraction and ellipsoid-aware convolution for robust registration, along with a multi-factor Gaussian fusion strategy to reduce scene element loss.

Result: The method reduces RRE by 41.9% for registration and improves PSNR by 10.11 dB for fusion, demonstrating superior accuracy and structural preservation.

Conclusion: The proposed method effectively automates and enhances 3D-GS sub-map alignment and fusion, offering improved consistency and accuracy for 3D scene representation in practical applications.

Abstract: In recent years, 3D Gaussian Splatting (3D-GS)-based scene representation demonstrates significant potential in real-time rendering and training efficiency. However, most existing methods primarily focus on single-map reconstruction, while the registration and fusion of multiple 3D-GS sub-maps remain underexplored. Existing methods typically rely on manual intervention to select a reference sub-map as a template and use point cloud matching for registration. Moreover, hard-threshold filtering of 3D-GS primitives often degrades rendering quality after fusion. In this paper, we present a novel approach for automated 3D-GS sub-map alignment and fusion, eliminating the need for manual intervention while enhancing registration accuracy and fusion quality. First, we extract geometric skeletons across multiple scenes and leverage ellipsoid-aware convolution to capture 3D-GS attributes, facilitating robust scene registration. Second, we introduce a multi-factor Gaussian fusion strategy to mitigate the scene element loss caused by rigid thresholding. Experiments on the ScanNet-GSReg and our Coord datasets demonstrate the effectiveness of the proposed method in registration and fusion. For registration, it achieves a 41.9% reduction in RRE on complex scenes, ensuring more precise pose estimation. For fusion, it improves PSNR by 10.11 dB, highlighting superior structural preservation. These results confirm its ability to enhance scene alignment and reconstruction fidelity, ensuring more consistent and accurate 3D scene representation for robotic perception and autonomous navigation.

[262] An Improved YOLOv8 Approach for Small Target Detection of Rice Spikelet Flowering in Field Environments

Beizhang Chen, Jinming Liang, Zheng Xiong, Ming Pan, Xiangbao Meng, Qingshan Lin, Qun Ma, Yingping Zhao

Main category: cs.CV

TL;DR: The paper proposes an improved YOLOv8 model for detecting rice spikelet flowering, enhancing accuracy and speed for hybrid rice seed production.

Details

Motivation: Accurate detection of rice flowering time is vital for efficient pollination and higher yields, but automated recognition is challenging due to small spikelet size and field complexity.

Method: The study improves YOLOv8 by replacing PANet with BiFPN for better feature fusion and adding a p2 small-object detection head. A dedicated dataset is created using high-resolution RGB cameras and augmentation.

Result: The improved YOLOv8s-p2 achieves 65.9% mAP@0.5, 67.6% precision, 61.5% recall, and 64.41% F1-score, with a speed of 69 f/s, outperforming the baseline.

Conclusion: The enhanced YOLOv8s-p2 provides a practical, high-accuracy solution for automated rice spikelet flowering monitoring in hybrid seed production.

Abstract: Accurately detecting rice flowering time is crucial for timely pollination in hybrid rice seed production. This not only enhances pollination efficiency but also ensures higher yields. However, due to the complexity of field environments and the characteristics of rice spikelets, such as their small size and short flowering period, automated and precise recognition remains challenging. To address this, this study proposes a rice spikelet flowering recognition method based on an improved YOLOv8 object detection model. First, a Bidirectional Feature Pyramid Network (BiFPN) replaces the original PANet structure to enhance feature fusion and improve multi-scale feature utilization. Second, to boost small object detection, a p2 small-object detection head is added, using finer feature mapping to reduce feature loss commonly seen in detecting small targets. Given the lack of publicly available datasets for rice spikelet flowering in field conditions, a high-resolution RGB camera and data augmentation techniques are used to construct a dedicated dataset, providing reliable support for model training and testing. Experimental results show that the improved YOLOv8s-p2 model achieves an mAP@0.5 of 65.9%, precision of 67.6%, recall of 61.5%, and F1-score of 64.41%, representing improvements of 3.10%, 8.40%, 10.80%, and 9.79%, respectively, over the baseline YOLOv8. The model also runs at 69 f/s on the test set, meeting practical application requirements. Overall, the improved YOLOv8s-p2 offers high accuracy and speed, providing an effective solution for automated monitoring in hybrid rice seed production.

[263] Enhancing Spatial Reasoning through Visual and Textual Thinking

Xun Liang, Xin Guo, Zhongming Jin, Weihang Pan, Penghui Shang, Deng Cai, Binbin Lin, Jieping Ye

Main category: cs.CV

TL;DR: The paper introduces SpatialVTS, a method to enhance spatial reasoning in VLMs by combining visual and textual thinking, improving performance without extra data.

Details

Motivation: VLMs struggle with spatial reasoning despite rapid advancements, prompting the need for a method like SpatialVTS to address this gap.

Method: SpatialVTS involves visual thinking (generating location-specific tokens) and textual thinking (long-term inference from visual cues and dialogues), supported by dataset corrections and logical reasoning details.

Result: The model significantly improves spatial understanding tasks without additional inputs like masks or depth.

Conclusion: SpatialVTS effectively enhances spatial reasoning in VLMs, demonstrating superior performance in spatial tasks.

Abstract: The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of essential targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, Our model conducts long-term thinking based on visual cues and dialogues, gradually inferring the answers to spatial reasoning problems. To effectively support the model’s training, we perform manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization ability, and developing thinking processes with logical reasoning details. Without introducing additional information (such as masks or depth), our model’s overall average level in several spatial understanding tasks has significantly improved compared with other models.

[264] Beyond Class Tokens: LLM-guided Dominant Property Mining for Few-shot Classification

Wei Zhuo, Runjie Luo, Wufeng Xue, Linlin Shen

Main category: cs.CV

TL;DR: The paper proposes BCT-CLIP, a novel Few-Shot Learning (FSL) method that leverages dominating properties via contrastive learning and LLM-based prior knowledge to improve discriminative representation learning.

Details

Motivation: Addressing the challenge of data scarcity in FSL by moving beyond simple class name alignment to capture visual diversity and discriminative properties.

Method: Introduces a multi-property generator (MPG) with patch-aware cross-attentions, an LLM-assisted retrieval procedure, and a contrastive learning strategy for property-token learning.

Result: Demonstrates superior performance on 11 widely used datasets, advancing discriminative class-specific representation learning.

Conclusion: Exploring dominating properties enhances FSL by providing comprehensive structural image representations, improving few-shot classification.

Abstract: Few-shot Learning (FSL), which endeavors to develop the generalization ability for recognizing novel classes using only a few images, faces significant challenges due to data scarcity. Recent CLIP-like methods based on contrastive language-image pertaining mitigate the issue by leveraging textual representation of the class name for unseen image discovery. Despite the achieved success, simply aligning visual representations to class name embeddings would compromise the visual diversity for novel class discrimination. To this end, we proposed a novel Few-Shot Learning (FSL) method (BCT-CLIP) that explores \textbf{dominating properties} via contrastive learning beyond simply using class tokens. Through leveraging LLM-based prior knowledge, our method pushes forward FSL with comprehensive structural image representations, including both global category representation and the patch-aware property embeddings. In particular, we presented a novel multi-property generator (MPG) with patch-aware cross-attentions to generate multiple visual property tokens, a Large-Language Model (LLM)-assistant retrieval procedure with clustering-based pruning to obtain dominating property descriptions, and a new contrastive learning strategy for property-token learning. The superior performances on the 11 widely used datasets demonstrate that our investigation of dominating properties advances discriminative class-specific representation learning and few-shot classification.

[265] T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation

Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi

Main category: cs.CV

TL;DR: T2I-Copilot is a training-free multi-agent system that automates prompt phrasing, model selection, and iterative refinement for text-to-image generation, improving quality and alignment.

Details

Motivation: Existing T2I models are sensitive to prompt phrasing and lack controllability, requiring repeated refinements without clear feedback.

Method: T2I-Copilot uses three agents: Input Interpreter, Generation Engine, and Quality Evaluator, to automate and refine the generation process.

Result: It achieves competitive VQA scores, outperforms other models in cost-efficiency and quality, and supports human intervention.

Conclusion: T2I-Copilot simplifies prompt engineering, enhances generation quality, and offers a flexible, training-free solution for T2I tasks.

Abstract: Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI-Labs/T2I-Copilot.

[266] GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections

Haiyang Bai, Jiaqi Zhu, Songru Jiang, Wei Huang, Tao Lu, Yuanqi Li, Jie Guo, Runze Fu, Yanwen Guo, Lijun Chen

Main category: cs.CV

TL;DR: A 3D Gaussian splatting framework for outdoor relighting using intrinsic image decomposition to handle sunlight, sky radiance, and indirect lighting, enabling diverse shading and dynamic shadows.

Details

Motivation: Prior methods compress global illumination into a single latent vector, limiting shading manipulation and shadow effects. This work aims to overcome these limitations.

Method: Uses residual-based sun visibility extraction, region-based supervision with structural consistency loss, and ray-tracing for shadow simulation.

Result: Synthesizes novel views with high fidelity and produces natural, multifaceted illumination and shadow effects.

Conclusion: The framework outperforms state-of-the-art relighting solutions in generating realistic illumination and shadows.

Abstract: We propose a 3D Gaussian splatting-based framework for outdoor relighting that leverages intrinsic image decomposition to precisely integrate sunlight, sky radiance, and indirect lighting from unconstrained photo collections. Unlike prior methods that compress the per-image global illumination into a single latent vector, our approach enables simultaneously diverse shading manipulation and the generation of dynamic shadow effects. This is achieved through three key innovations: (1) a residual-based sun visibility extraction method to accurately separate direct sunlight effects, (2) a region-based supervision framework with a structural consistency loss for physically interpretable and coherent illumination decomposition, and (3) a ray-tracing-based technique for realistic shadow simulation. Extensive experiments demonstrate that our framework synthesizes novel views with competitive fidelity against state-of-the-art relighting solutions and produces more natural and multifaceted illumination and shadow effects.

[267] MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization

Hyung Kyu Kim, Sangmin Lee, Hak Gu Kim

Main category: cs.CV

TL;DR: MemoryTalker synthesizes realistic 3D facial animation from audio alone, eliminating the need for priors like labels or meshes, and focuses on capturing speaking styles.

Details

Motivation: Previous methods require additional priors (e.g., labels or meshes) and fail to reflect speaking styles, limiting practical use.

Method: MemoryTalker uses a two-stage framework: 1) Memorizing general motion, and 2) Animating with audio-driven style features to emphasize personalized facial motion.

Result: The model generates reliable personalized facial animation without extra priors, outperforming state-of-the-art methods in evaluations.

Conclusion: MemoryTalker enhances personalized facial animation by focusing on speaking styles, improving usability and performance.

Abstract: Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker’s speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input to maximize usability in applications. Our framework consists of two training stages: 1-stage is storing and retrieving general motion (i.e., Memorizing), and 2-stage is to perform the personalized facial motion synthesis (i.e., Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our MemoryTalker can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods.

[268] AgroBench: Vision-Language Model Benchmark in Agriculture

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, Yoshitaka Ushiku

Main category: cs.CV

TL;DR: AgroBench is a benchmark for evaluating vision-language models (VLMs) in agriculture, covering 203 crops and 682 diseases. It reveals VLMs struggle with fine-grained tasks like weed identification.

Details

Motivation: Automated understanding of agricultural tasks like disease identification is crucial for sustainable farming. VLMs can enhance human-model interaction via text-based communication.

Method: AgroBench, annotated by agronomists, evaluates VLMs across seven agricultural topics with 203 crop and 682 disease categories.

Result: VLMs show room for improvement, especially in fine-grained tasks like weed identification, where performance is near random.

Conclusion: AgroBench highlights VLM limitations and suggests future development pathways. The dataset and code are publicly available.

Abstract: Precise automated understanding of agricultural tasks such as disease identification is essential for sustainable crop production. Recent advances in vision-language models (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code are available at https://dahlian00.github.io/AgroBenchPage/ .

[269] Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation

Hyung Kyu Kim, Hak Gu Kim

Main category: cs.CV

TL;DR: The paper introduces a phonetic context-aware loss to improve speech-driven 3D facial animation by addressing coarticulation issues, resulting in smoother and more natural animations.

Details

Motivation: Traditional frame-wise methods fail to capture facial motion continuity, causing jittery and unnatural outputs due to coarticulation.

Method: Proposes a phonetic context-aware loss with viseme coarticulation weights to adaptively prioritize facial movements based on dynamic changes over time.

Result: Experiments show improved quantitative metrics and visual quality compared to conventional reconstruction loss.

Conclusion: Explicitly modeling phonetic context-dependent visemes is crucial for natural speech-driven 3D facial animation.

Abstract: Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assign adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation. Project page: https://cau-irislab.github.io/interspeech25/

[270] Low-Cost Machine Vision System for Sorting Green Lentils (Lens Culinaris) Based on Pneumatic Ejection and Deep Learning

Davy Rojas Yana, Edwin Salcedo

Main category: cs.CV

TL;DR: A dynamic grain classification system for green lentils using YOLOv8 and pneumatic ejection achieves 87.2% accuracy at 59 mm/s conveyor speed.

Details

Motivation: To develop a real-time, accurate, and low-cost system for classifying and sorting green lentils using computer vision and mechanical components.

Method: Two-stage YOLOv8 pipeline: detection and classification into six categories, combined with pneumatic ejection and Arduino-based control.

Result: 87.2% grain separation accuracy at 59 mm/s, though processing rate is limited to 8 grams per minute.

Conclusion: The system shows promise for grain sorting and offers a modular foundation for future improvements.

Abstract: This paper presents the design, development, and evaluation of a dynamic grain classification system for green lentils (Lens Culinaris), which leverages computer vision and pneumatic ejection. The system integrates a YOLOv8-based detection model that identifies and locates grains on a conveyor belt, together with a second YOLOv8-based classification model that categorises grains into six classes: Good, Yellow, Broken, Peeled, Dotted, and Reject. This two-stage YOLOv8 pipeline enables accurate, real-time, multi-class categorisation of lentils, implemented on a low-cost, modular hardware platform. The pneumatic ejection mechanism separates defective grains, while an Arduino-based control system coordinates real-time interaction between the vision system and mechanical components. The system operates effectively at a conveyor speed of 59 mm/s, achieving a grain separation accuracy of 87.2%. Despite a limited processing rate of 8 grams per minute, the prototype demonstrates the potential of machine vision for grain sorting and provides a modular foundation for future enhancements.

[271] Annotation-Free Human Sketch Quality Assessment

Lan Yang, Kaiyue Pang, Honggang Zhang, Yi-Zhe Song

Main category: cs.CV

TL;DR: The paper introduces a method (GACL) for assessing sketch quality using feature magnitude as a metric, validated by human studies, and extends to natural images and noisy label cleansing.

Details

Motivation: To address the lack of quantitative metrics for sketch quality assessment and enable practical applications.

Method: Proposes Geometry-Aware Classification Layer (GACL), which uses feature magnitude ($L_2$ norm) as a quality metric without human annotations, optimizing recognition and quality simultaneously.

Result: GACL aligns with human perception in quality assessment and enables three sketch applications; it also works for natural image quality assessment and noisy label cleansing.

Conclusion: GACL is a versatile, annotation-free method for quality assessment, applicable beyond sketches to broader domains like image quality and data re-weighting.

Abstract: As lovely as bunnies are, your sketched version would probably not do them justice (Fig.~\ref{fig:intro}). This paper recognises this very problem and studies sketch quality assessment for the first time – letting you find these badly drawn ones. Our key discovery lies in exploiting the magnitude ($L_2$ norm) of a sketch feature as a quantitative quality metric. We propose Geometry-Aware Classification Layer (GACL), a generic method that makes feature-magnitude-as-quality-metric possible and importantly does it without the need for specific quality annotations from humans. GACL sees feature magnitude and recognisability learning as a dual task, which can be simultaneously optimised under a neat cross-entropy classification loss with theoretic guarantee. This gives GACL a nice geometric interpretation (the better the quality, the easier the recognition), and makes it agnostic to both network architecture changes and the underlying sketch representation. Through a large scale human study of 160,000 \doublecheck{trials}, we confirm the agreement between our GACL-induced metric and human quality perception. We further demonstrate how such a quality assessment capability can for the first time enable three practical sketch applications. Interestingly, we show GACL not only works on abstract visual representations such as sketch but also extends well to natural images on the problem of image quality assessment (IQA). Last but not least, we spell out the general properties of GACL as general-purpose data re-weighting strategy and demonstrate its applications in vertical problems such as noisy label cleansing. Code will be made publicly available at github.com/yanglan0225/SketchX-Quantifying-Sketch-Quality.

[272] Lightweight Remote Sensing Scene Classification on Edge Devices via Knowledge Distillation and Early-exit

Yang Zhao, Shusheng Li, Xueshang Feng

Main category: cs.CV

TL;DR: A lightweight RSSC framework with distilled GFNet and early-exit mechanism improves efficiency on edge devices, achieving faster inference and better energy efficiency without sacrificing accuracy.

Details

Motivation: Existing RSSC models struggle to balance accuracy, latency, and energy consumption on resource-constrained edge devices.

Method: Proposes a distilled GFNet model with frequency domain distillation and a dynamic early-exit mechanism for edge devices.

Result: Achieves 1.3x speedup in inference, over 40% energy efficiency improvement, and maintains high accuracy across four datasets.

Conclusion: The framework effectively optimizes performance for RSSC on edge devices, addressing key challenges in efficiency and resource constraints.

Abstract: As the development of lightweight deep learning algorithms, various deep neural network (DNN) models have been proposed for the remote sensing scene classification (RSSC) application. However, it is still challenging for these RSSC models to achieve optimal performance among model accuracy, inference latency, and energy consumption on resource-constrained edge devices. In this paper, we propose a lightweight RSSC framework, which includes a distilled global filter network (GFNet) model and an early-exit mechanism designed for edge devices to achieve state-of-the-art performance. Specifically, we first apply frequency domain distillation on the GFNet model to reduce model size. Then we design a dynamic early-exit model tailored for DNN models on edge devices to further improve model inference efficiency. We evaluate our E3C model on three edge devices across four datasets. Extensive experimental results show that it achieves an average of 1.3x speedup on model inference and over 40% improvement on energy efficiency, while maintaining high classification accuracy.

[273] FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling

Jingting Li, Yu Qian, Lin Zhao, Su-Jing Wang

Main category: cs.CV

TL;DR: The paper proposes FED-PsyAU, a framework combining psychological studies and federated learning to improve micro-expression recognition (MER) while addressing privacy and data limitations.

Details

Motivation: Micro-expressions (MEs) reveal concealed emotions but face challenges like small datasets, subtle features, and privacy concerns in real-world applications.

Method: The FED-PsyAU framework integrates psychological insights on facial action units (AUs) with a DPK-GAT network for hierarchical feature learning. Federated learning is used to enhance MER across clients without data sharing.

Result: Experiments on standard ME databases confirm the framework’s effectiveness in improving MER performance.

Conclusion: The approach successfully addresses MER challenges by leveraging psychological priors and federated learning, ensuring privacy and scalability.

Abstract: Micro-expressions (MEs) are brief, low-intensity, often localized facial expressions. They could reveal genuine emotions individuals may attempt to conceal, valuable in contexts like criminal interrogation and psychological counseling. However, ME recognition (MER) faces challenges, such as small sample sizes and subtle features, which hinder efficient modeling. Additionally, real-world applications encounter ME data privacy issues, leaving the task of enhancing recognition across settings under privacy constraints largely unexplored. To address these issues, we propose a FED-PsyAU research framework. We begin with a psychological study on the coordination of upper and lower facial action units (AUs) to provide structured prior knowledge of facial muscle dynamics. We then develop a DPK-GAT network that combines these psychological priors with statistical AU patterns, enabling hierarchical learning of facial motion features from regional to global levels, effectively enhancing MER performance. Additionally, our federated learning framework advances MER capabilities across multiple clients without data sharing, preserving privacy and alleviating the limited-sample issue for each client. Extensive experiments on commonly-used ME databases demonstrate the effectiveness of our approach.

[274] TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang

Main category: cs.CV

TL;DR: TransPrune is a training-free token pruning method for LVLMs, using Token Transition Variation and Instruction-Guided Attention to reduce computational costs while maintaining performance.

Details

Motivation: High computational costs in LVLMs due to visual tokens motivate the need for efficient token pruning without relying on attention-based criteria, which have limitations like positional bias.

Method: TransPrune assesses token importance via Token Transition Variation (TTV) and Instruction-Guided Attention (IGA), progressively pruning tokens to improve efficiency.

Result: TransPrune reduces inference TFLOPs by over half while matching original LVLM performance across eight benchmarks. TTV alone performs comparably to attention-based methods.

Conclusion: TransPrune offers an efficient, training-free solution for token pruning in LVLMs, leveraging token transitions and instruction-guided attention to maintain performance with reduced computational costs.

Abstract: Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.

[275] LSFDNet: A Single-Stage Fusion and Detection Network for Ships Using SWIR and LWIR

Yanyin Guo, Runxuan An, Junwei Li, Zhiyuan Zhang

Main category: cs.CV

TL;DR: The paper proposes LSFDNet, a single-stage image fusion detection algorithm combining SWIR and LWIR for ship detection, addressing limitations of single-modal methods. It introduces MLCF and OE loss for improved performance and a new dataset, NSLSR.

Details

Motivation: Traditional ship detection methods using single-modal images (visible or infrared) struggle in complex conditions like varying lighting or fog. This work explores SWIR and LWIR fusion for better performance.

Method: LSFDNet integrates feature interaction between fusion and detection tasks, uses MLCF for multi-level feature fusion, and employs OE loss for object semantics retention. A new dataset, NSLSR, is introduced for training.

Result: The algorithm achieves superior detection performance and generates visually impressive fused images, validated on two datasets.

Conclusion: LSFDNet effectively addresses limitations of single-modal methods, demonstrating the benefits of multi-modal fusion in ship detection.

Abstract: Traditional ship detection methods primarily rely on single-modal approaches, such as visible or infrared images, which limit their application in complex scenarios involving varying lighting conditions and heavy fog. To address this issue, we explore the advantages of short-wave infrared (SWIR) and long-wave infrared (LWIR) in ship detection and propose a novel single-stage image fusion detection algorithm called LSFDNet. This algorithm leverages feature interaction between the image fusion and object detection subtask networks, achieving remarkable detection performance and generating visually impressive fused images. To further improve the saliency of objects in the fused images and improve the performance of the downstream detection task, we introduce the Multi-Level Cross-Fusion (MLCF) module. This module combines object-sensitive fused features from the detection task and aggregates features across multiple modalities, scales, and tasks to obtain more semantically rich fused features. Moreover, we utilize the position prior from the detection task in the Object Enhancement (OE) loss function, further increasing the retention of object semantics in the fused images. The detection task also utilizes preliminary fused features from the fusion task to complement SWIR and LWIR features, thereby enhancing detection performance. Additionally, we have established a Nearshore Ship Long-Short Wave Registration (NSLSR) dataset to train effective SWIR and LWIR image fusion and detection networks, bridging a gap in this field. We validated the superiority of our proposed single-stage fusion detection algorithm on two datasets. The source code and dataset are available at https://github.com/Yanyin-Guo/LSFDNet

[276] AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations

Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muhammad Haris Khan, Usman Tariq, Tom Gedeon, Abhinav Dhall

Main category: cs.CV

TL;DR: The paper introduces AV-Deepfake1M++, a dataset with 2M video clips featuring diverse manipulation and perturbation strategies, to combat realistic video fabrication. It benchmarks state-of-the-art methods and hosts the 2025 1M-Deepfakes Detection Challenge.

Details

Motivation: Address the challenge of highly realistic video fabrication by providing a rich dataset for Deepfake detection research.

Method: Extends AV-Deepfake1M to include 2M video clips with varied manipulation and audio-visual perturbation strategies, followed by benchmarking using state-of-the-art methods.

Result: AV-Deepfake1M++ is created, and its effectiveness is demonstrated through benchmarking. The dataset supports the 2025 1M-Deepfakes Detection Challenge.

Conclusion: AV-Deepfake1M++ is a valuable resource for Deepfake research, and the hosted challenge aims to advance detection methods.

Abstract: The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at https://deepfakes1m.github.io/2025.

[277] A Multimodal Architecture for Endpoint Position Prediction in Team-based Multiplayer Games

Jonas Peche, Aliaksei Tsishurou, Alexander Zap, Guenter Wallner

Main category: cs.CV

TL;DR: A multimodal U-Net-based architecture predicts future player locations in games using heterogeneous data and attention mechanisms for agent communication.

Details

Motivation: Understanding and predicting player movement is essential for applications like bot navigation, strategy recommendation, and behavior analytics in complex game environments.

Method: Uses a U-Net-based approach with multimodal feature encoding and multi-head attention for feature groups to predict player locations via heatmaps.

Result: The architecture effectively leverages multimodal game state data, enabling accurate future player position prediction.

Conclusion: This method supports downstream tasks like predictive bot behavior and anomaly detection by reliably forecasting player movements.

Abstract: Understanding and predicting player movement in multiplayer games is crucial for achieving use cases such as player-mimicking bot navigation, preemptive bot control, strategy recommendation, and real-time player behavior analytics. However, the complex environments allow for a high degree of navigational freedom, and the interactions and team-play between players require models that make effective use of the available heterogeneous input data. This paper presents a multimodal architecture for predicting future player locations on a dynamic time horizon, using a U-Net-based approach for calculating endpoint location probability heatmaps, conditioned using a multimodal feature encoder. The application of a multi-head attention mechanism for different groups of features allows for communication between agents. In doing so, the architecture makes efficient use of the multimodal game state including image inputs, numerical and categorical features, as well as dynamic game data. Consequently, the presented technique lays the foundation for various downstream tasks that rely on future player positions such as the creation of player-predictive bot behavior or player anomaly detection.

[278] M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast

Jiacheng Lu, Hui Ding, Shiyu Zhang, Guoping Huo

Main category: cs.CV

TL;DR: M-Net proposes a novel Mesh-Cast mechanism and TPS training strategy for MRI tumor segmentation, leveraging spatial correlations between slices as ’temporal-like’ data to outperform existing methods.

Details

Motivation: Existing MRI segmentation models underutilize spatial correlations between adjacent slices, which can enhance continuity and accuracy.

Method: M-Net integrates Mesh-Cast for channel-temporal processing and uses TPS training to learn common patterns before refining slice-specific features.

Result: M-Net outperforms existing methods on BraTS2019 and BraTS2023 datasets across all key metrics.

Conclusion: M-Net is a robust, temporally-aware solution for MRI tumor segmentation, balancing accuracy and computational efficiency.

Abstract: MRI tumor segmentation remains a critical challenge in medical imaging, where volumetric analysis faces unique computational demands due to the complexity of 3D data. The spatially sequential arrangement of adjacent MRI slices provides valuable information that enhances segmentation continuity and accuracy, yet this characteristic remains underutilized in many existing models. The spatial correlations between adjacent MRI slices can be regarded as “temporal-like” data, similar to frame sequences in video segmentation tasks. To bridge this gap, we propose M-Net, a flexible framework specifically designed for sequential image segmentation. M-Net introduces the novel Mesh-Cast mechanism, which seamlessly integrates arbitrary sequential models into the processing of both channel and temporal information, thereby systematically capturing the inherent “temporal-like” spatial correlations between MRI slices. Additionally, we define an MRI sequential input pattern and design a Two-Phase Sequential (TPS) training strategy, which first focuses on learning common patterns across sequences before refining slice-specific feature extraction. This approach leverages temporal modeling techniques to preserve volumetric contextual information while avoiding the high computational cost of full 3D convolutions, thereby enhancing the generalizability and robustness of M-Net in sequential segmentation tasks. Experiments on the BraTS2019 and BraTS2023 datasets demonstrate that M-Net outperforms existing methods across all key metrics, establishing itself as a robust solution for temporally-aware MRI tumor segmentation.

Geng-Xin Xu, Xiang Zuo, Ye Li

Main category: cs.CV

TL;DR: MMQ-Net improves emotion recognition from physiological data by addressing incomplete signals and movement interference using multi-query mechanisms.

Details

Motivation: Challenges in emotion recognition include incomplete multi-modal signals and interference from body movements/artifacts.

Method: MMQ-Net integrates modality, category, and interference queries to reconstruct missing data, focus on emotional features, and separate noise.

Result: MMQ-Net outperforms existing methods, especially with highly incomplete data.

Conclusion: MMQ-Net effectively tackles key challenges in emotion recognition, offering superior performance.

Abstract: Emotion recognition from physiological data is crucial for mental health assessment, yet it faces two significant challenges: incomplete multi-modal signals and interference from body movements and artifacts. This paper presents a novel Multi-Masked Querying Network (MMQ-Net) to address these issues by integrating multiple querying mechanisms into a unified framework. Specifically, it uses modality queries to reconstruct missing data from incomplete signals, category queries to focus on emotional state features, and interference queries to separate relevant information from noise. Extensive experiment results demonstrate the superior emotion recognition performance of MMQ-Net compared to existing approaches, particularly under high levels of data incompleteness.

[280] Harnessing Diffusion-Yielded Score Priors for Image Restoration

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S. Ren, Jinjin Gu, Chao Dong

Main category: cs.CV

TL;DR: HYPIR is a novel method for deep image restoration that combines pre-trained diffusion models with adversarial training, achieving high-quality results efficiently.

Details

Motivation: Existing methods (MSE-based, GAN-based, diffusion-based) struggle to balance restoration quality, fidelity, and speed. HYPIR aims to address these challenges.

Method: HYPIR initializes the restoration model with a pre-trained diffusion model and fine-tunes it via adversarial training, avoiding diffusion loss and iterative sampling.

Result: HYPIR improves stability, avoids mode collapse, accelerates convergence, and outperforms state-of-the-art methods in efficiency and quality.

Conclusion: HYPIR offers a fast, high-quality solution for image restoration with additional user control features like text-guided restoration.

Abstract: Deep image restoration models aim to learn a mapping from degraded image space to natural image space. However, they face several critical challenges: removing degradation, generating realistic details, and ensuring pixel-level consistency. Over time, three major classes of methods have emerged, including MSE-based, GAN-based, and diffusion-based methods. However, they fail to achieve a good balance between restoration quality, fidelity, and speed. We propose a novel method, HYPIR, to address these challenges. Our solution pipeline is straightforward: it involves initializing the image restoration model with a pre-trained diffusion model and then fine-tuning it with adversarial training. This approach does not rely on diffusion loss, iterative sampling, or additional adapters. We theoretically demonstrate that initializing adversarial training from a pre-trained diffusion model positions the initial restoration model very close to the natural image distribution. Consequently, this initialization improves numerical stability, avoids mode collapse, and substantially accelerates the convergence of adversarial training. Moreover, HYPIR inherits the capabilities of diffusion models with rich user control, enabling text-guided restoration and adjustable texture richness. Requiring only a single forward pass, it achieves faster convergence and inference speed than diffusion-based methods. Extensive experiments show that HYPIR outperforms previous state-of-the-art methods, achieving efficient and high-quality image restoration.

[281] Enhanced Deep Learning DeepFake Detection Integrating Handcrafted Features

Alejandro Hinke-Navarro, Mario Nieto-Hidalgo, Juan M. Espin, Juan E. Tapia

Main category: cs.CV

TL;DR: A hybrid deep-learning framework combining frequency-domain features with RGB inputs improves detection of deepfake and face swap manipulations.

Details

Motivation: Addressing the limitations of conventional methods in detecting sophisticated facial manipulations in digital security.

Method: Proposes a hybrid approach using handcrafted frequency-domain features (e.g., SRM, DCT, ELA, SVD, DFT) alongside RGB inputs to exploit manipulation artifacts.

Result: The framework provides richer, more discriminative information for detecting manipulated images.

Conclusion: The hybrid approach enhances detection accuracy by leveraging both frequency and spatial domain artifacts.

Abstract: The rapid advancement of deepfake and face swap technologies has raised significant concerns in digital security, particularly in identity verification and onboarding processes. Conventional detection methods often struggle to generalize against sophisticated facial manipulations. This study proposes an enhanced deep-learning detection framework that combines handcrafted frequency-domain features with conventional RGB inputs. This hybrid approach exploits frequency and spatial domain artifacts introduced during image manipulation, providing richer and more discriminative information to the classifier. Several frequency handcrafted features were evaluated, including the Steganalysis Rich Model, Discrete Cosine Transform, Error Level Analysis, Singular Value Decomposition, and Discrete Fourier Transform

[282] DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection

Dezhi An, Wenqiang Liu, Kefan Wang, Zening Chen, Jun Lu, Shengcai Zhang

Main category: cs.CV

TL;DR: The paper introduces DAMS, a dual-path framework for video anomaly detection, combining multiscale spatiotemporal modeling and cross-modal semantic alignment to address challenges like multiscale dependencies and data scarcity.

Details

Motivation: Video anomaly detection is challenging due to multiscale temporal dependencies, visual-semantic heterogeneity, and limited labeled data. The study aims to address these issues by integrating hierarchical feature learning and complementary information.

Method: The DAMS framework uses a dual-path architecture: one path integrates AMTPN (for multiscale temporal features) and CBAM (for attention-based feature enhancement), while the other employs CLIP for cross-modal semantic alignment.

Result: Experiments on UCF-Crime and XD-Violence benchmarks demonstrate the effectiveness of DAMS in detecting and localizing anomalous events.

Conclusion: DAMS successfully combines spatiotemporal and semantic features, providing a robust solution for video anomaly detection.

Abstract: The goal of video anomaly detection is tantamount to performing spatio-temporal localization of abnormal events in the video. The multiscale temporal dependencies, visual-semantic heterogeneity, and the scarcity of labeled data exhibited by video anomalies collectively present a challenging research problem in computer vision. This study offers a dual-path architecture called the Dual-Branch Adaptive Multiscale Spatiotemporal Framework (DAMS), which is based on multilevel feature decoupling and fusion, enabling efficient anomaly detection modeling by integrating hierarchical feature learning and complementary information. The main processing path of this framework integrates the Adaptive Multiscale Time Pyramid Network (AMTPN) with the Convolutional Block Attention Mechanism (CBAM). AMTPN enables multigrained representation and dynamically weighted reconstruction of temporal features through a three-level cascade structure (time pyramid pooling, adaptive feature fusion, and temporal context enhancement). CBAM maximizes the entropy distribution of feature channels and spatial dimensions through dual attention mapping. Simultaneously, the parallel path driven by CLIP introduces a contrastive language-visual pre-training paradigm. Cross-modal semantic alignment and a multiscale instance selection mechanism provide high-order semantic guidance for spatio-temporal features. This creates a complete inference chain from the underlying spatio-temporal features to high-level semantic concepts. The orthogonal complementarity of the two paths and the information fusion mechanism jointly construct a comprehensive representation and identification capability for anomalous events. Extensive experimental results on the UCF-Crime and XD-Violence benchmarks establish the effectiveness of the DAMS framework.

[283] Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry

Matan Kichler, Shai Bagon, Mark Sheinin

Main category: cs.CV

TL;DR: The paper introduces a novel method to infer hidden liquid levels in opaque containers by analyzing surface vibrations, using a speckle-based sensing system and a transformer-based model.

Details

Motivation: Conventional computer vision is limited to visible surfaces, unable to determine hidden attributes like liquid levels in containers. This work aims to overcome this limitation by leveraging surface vibrations.

Method: A speckle-based vibration sensing system captures 2D grid vibrations remotely. A transformer-based model analyzes these vibrations to classify container type and liquid level, invariant to vibration sources.

Result: The method successfully classifies liquid levels in various containers, generalizing to unseen instances and ambient sound sources.

Conclusion: This approach expands computer vision capabilities to infer hidden attributes, offering a non-invasive, scalable solution for liquid level detection.

Abstract: Computer vision seeks to infer a wide range of information about objects and events. However, vision systems based on conventional imaging are limited to extracting information only from the visible surfaces of scene objects. For instance, a vision system can detect and identify a Coke can in the scene, but it cannot determine whether the can is full or empty. In this paper, we aim to expand the scope of computer vision to include the novel task of inferring the hidden liquid levels of opaque containers by sensing the tiny vibrations on their surfaces. Our method provides a first-of-a-kind way to inspect the fill level of multiple sealed containers remotely, at once, without needing physical manipulation and manual weighing. First, we propose a novel speckle-based vibration sensing system for simultaneously capturing scene vibrations on a 2D grid of points. We use our system to efficiently and remotely capture a dataset of vibration responses for a variety of everyday liquid containers. Then, we develop a transformer-based approach for analyzing the captured vibrations and classifying the container type and its hidden liquid level at the time of measurement. Our architecture is invariant to the vibration source, yielding correct liquid level estimates for controlled and ambient scene sound sources. Moreover, our model generalizes to unseen container instances within known classes (e.g., training on five Coke cans of a six-pack, testing on a sixth) and fluid levels. We demonstrate our method by recovering liquid levels from various everyday containers.

[284] Self-Supervised Continuous Colormap Recovery from a 2D Scalar Field Visualization without a Legend

Hongxu Liu, Xinyu Chen, Haoyang Zheng, Manyi Li, Zhenfan Liu, Fumeng Yang, Yunhai Wang, Changhe Tu, Qiong Zeng

Main category: cs.CV

TL;DR: A novel method for recovering colormaps from 2D scalar field visualizations without legends, using decoupling-and-reconstruction with self-supervised optimization.

Details

Motivation: Challenges in recovering colormaps from visualizations without legends, requiring accurate extraction of both colormap and underlying data.

Method: Decoupling module separates colormap and data, followed by differentiable color-mapping reconstruction. Uses reconstruction loss, cubic B-spline curves, and color order loss for smoothness and correctness.

Result: Evaluated on synthetic and real-world datasets (VIS30K), showing effectiveness in colormap recovery. Demonstrated in applications like colormap adjustment and transfer.

Conclusion: Proposed method successfully recovers colormaps, generalizes to visualizations with legends or discrete palettes, and supports practical applications.

Abstract: Recovering a continuous colormap from a single 2D scalar field visualization can be quite challenging, especially in the absence of a corresponding color legend. In this paper, we propose a novel colormap recovery approach that extracts the colormap from a color-encoded 2D scalar field visualization by simultaneously predicting the colormap and underlying data using a decoupling-and-reconstruction strategy. Our approach first separates the input visualization into colormap and data using a decoupling module, then reconstructs the visualization with a differentiable color-mapping module. To guide this process, we design a reconstruction loss between the input and reconstructed visualizations, which serves both as a constraint to ensure strong correlation between colormap and data during training, and as a self-supervised optimizer for fine-tuning the predicted colormap of unseen visualizations during inferencing. To ensure smoothness and correct color ordering in the extracted colormap, we introduce a compact colormap representation using cubic B-spline curves and an associated color order loss. We evaluate our method quantitatively and qualitatively on a synthetic dataset and a collection of real-world visualizations from the VIS30K dataset. Additionally, we demonstrate its utility in two prototype applications – colormap adjustment and colormap transfer – and explore its generalization to visualizations with color legends and ones encoded using discrete color palettes.

[285] Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data

Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, Sebastien Marcel

Main category: cs.CV

TL;DR: Synthetic data, like FairFaceGen, shows promise for training fairer face recognition models, though it lags in generalization compared to real data.

Details

Motivation: To evaluate if synthetic data can achieve both high accuracy and fairness in face recognition systems.

Method: Generated balanced synthetic datasets (FairFaceGen) using Flux.1-dev and Stable Diffusion v3.5, combined with identity augmentation methods, and compared performance and bias against real datasets.

Result: Synthetic data, especially from SD35, aids bias mitigation but underperforms in generalization on challenging benchmarks. Intra-class augmentation quality impacts accuracy and fairness.

Conclusion: Synthetic data offers practical guidelines for building fairer face recognition systems, though real data still outperforms in generalization.

Abstract: Synthetic data has emerged as a promising alternative for training face recognition (FR) models, offering advantages in scalability, privacy compliance, and potential for bias mitigation. However, critical questions remain on whether both high accuracy and fairness can be achieved with synthetic data. In this work, we evaluate the impact of synthetic data on bias and performance of FR systems. We generate balanced face dataset, FairFaceGen, using two state of the art text-to-image generators, Flux.1-dev and Stable Diffusion v3.5 (SD35), and combine them with several identity augmentation methods, including Arc2Face and four IP-Adapters. By maintaining equal identity count across synthetic and real datasets, we ensure fair comparisons when evaluating FR performance on standard (LFW, AgeDB-30, etc.) and challenging IJB-B/C benchmarks and FR bias on Racial Faces in-the-Wild (RFW) dataset. Our results demonstrate that although synthetic data still lags behind the real datasets in the generalization on IJB-B/C, demographically balanced synthetic datasets, especially those generated with SD35, show potential for bias mitigation. We also observe that the number and quality of intra-class augmentations significantly affect FR accuracy and fairness. These findings provide practical guidelines for constructing fairer FR systems using synthetic data.

[286] Lightweight Transformer-Driven Segmentation of Hotspots and Snail Trails in Solar PV Thermal Imagery

Deepak Joshi, Mayukha Pal

Main category: cs.CV

TL;DR: A supervised deep learning framework using SegFormer for segmenting thermal infrared images of PV panels outperforms baseline models in accuracy and efficiency, enabling real-time defect detection.

Details

Motivation: Accurate defect detection in photovoltaic modules is crucial for energy efficiency and system reliability.

Method: The method involves preprocessing (resizing, CLAHE, denoising, normalization) and a lightweight SegFormer model with a custom encoder-decoder, fine-tuned on annotated images.

Result: The SegFormer-based model outperforms U-Net, DeepLabV3, PSPNet, and Mask2Former in segmenting small and irregular defects.

Conclusion: The model is efficient for real-time deployment on edge devices and integrates well with drone-based systems for automated solar farm inspections.

Abstract: Accurate detection of defects such as hotspots and snail trails in photovoltaic modules is essential for maintaining energy efficiency and system reliablility. This work presents a supervised deep learning framework for segmenting thermal infrared images of PV panels, using a dataset of 277 aerial thermographic images captured by zenmuse XT infrared camera mounted on a DJI Matrice 100 drone. The preprocessing pipeline includes image resizing, CLAHE based contrast enhancement, denoising, and normalisation. A lightweight semantic segmentation model based on SegFormer is developed, featuring a customised Transformwer encoder and streamlined decoder, and fine-tuned on annotated images with manually labeled defect regions. To evaluate performance, we benchmark our model against U-Net, DeepLabV3, PSPNet, and Mask2Former using consistent preprocessing and augmentation. Evaluation metrices includes per-class Dice score, F1-score, Cohen’s kappa, mean IoU, and pixel accuracy. The SegFormer-based model outperforms baselines in accuracy and efficiency, particularly for segmenting small and irregular defects. Its lightweight design real-time deployment on edge devices and seamless integration with drone-based systems for automated inspection of large-scale solar farms.

[287] Automatic camera orientation estimation for a partially calibrated camera above a plane with a line at known planar distance

Gergely Dinya, Anna Gelencsér-Horváth

Main category: cs.CV

TL;DR: Estimates camera roll and pitch using minimal scene info (known intrinsics, fixed height, and a reference line).

Details

Motivation: Addresses scenarios where full camera calibration is impractical, offering a lightweight solution for constrained environments.

Method: Uses inverse projection geometry with a single straight reference line and known planar distance, incorporating lens distortion correction.

Result: Provides roll and pitch angle estimates without full calibration.

Conclusion: A practical, lightweight method for orientation estimation in constrained multi-camera setups.

Abstract: We present a derivation for estimating the roll and pitch orientation of a partially calibrated camera mounted above a planar surface, using minimal scene information. Specifically, we assume known intrinsic parameters and a fixed height between the camera and the observed plane. By detecting a single straight reference line at a known planar distance – such as the edge between a floor and a wall – we estimate the roll and pitch angles via inverse projection geometry. The method leverages geometric constraints and the camera model, including lens distortion correction. This approach is suitable for scenarios where full calibration is impractical and offers a lightweight alternative for multi-camera systems operating in constrained environments.

[288] AIComposer: Any Style and Content Image Composition via Feature Integration

Haowen Li, Zhenfeng Fan, Zhang Wen, Zhengzhou Zhu, Yunjin Li

Main category: cs.CV

TL;DR: A novel cross-domain image composition method without text prompts, using CLIP features and local cross-attention for seamless stylization and improved metrics.

Details

Motivation: Cross-domain image composition is under-explored due to challenges like stochastic diffusion models and style gaps, with heavy reliance on text prompts limiting practicality.

Method: Uses backward inversion and forward denoising without training, integrates CLIP features via MLP, and employs local cross-attention for stable stylization.

Result: Outperforms SOTA with 30.5% LPIPS and 18.1% CSD improvements, validated on a new benchmark dataset.

Conclusion: The method advances cross-domain composition, offering efficiency and robustness without text prompts, with potential for future research and applications.

Abstract: Image composition has advanced significantly with large-scale pre-trained T2I diffusion models. Despite progress in same-domain composition, cross-domain composition remains under-explored. The main challenges are the stochastic nature of diffusion models and the style gap between input images, leading to failures and artifacts. Additionally, heavy reliance on text prompts limits practical applications. This paper presents the first cross-domain image composition method that does not require text prompts, allowing natural stylization and seamless compositions. Our method is efficient and robust, preserving the diffusion prior, as it involves minor steps for backward inversion and forward denoising without training the diffuser. Our method also uses a simple multilayer perceptron network to integrate CLIP features from foreground and background, manipulating diffusion with a local cross-attention strategy. It effectively preserves foreground content while enabling stable stylization without a pre-stylization network. Finally, we create a benchmark dataset with diverse contents and styles for fair evaluation, addressing the lack of testing datasets for cross-domain image composition. Our method outperforms state-of-the-art techniques in both qualitative and quantitative evaluations, significantly improving the LPIPS score by 30.5% and the CSD metric by 18.1%. We believe our method will advance future research and applications. Code and benchmark at https://github.com/sherlhw/AIComposer.

[289] Style-Aware Blending and Prototype-Based Cross-Contrast Consistency for Semi-Supervised Medical Image Segmentation

Chaowei Chen, Xiang Zhang, Honglie Guo, Shunfang Wang

Main category: cs.CV

TL;DR: The paper proposes a style-aware blending and prototype-based cross-contrast consistency learning framework to address deficiencies in weak-strong consistency learning for semi-supervised medical image segmentation.

Details

Motivation: Existing methods overlook inherent limitations like separated training data streams and incomplete supervisory information utilization, leading to confirmation bias and limited exploration of strong-to-weak consistency.

Method: The framework includes a style-guided distribution blending module to unify labeled and unlabeled data streams and a prototype-based cross-contrast strategy to leverage both weak-to-strong and strong-to-weak predictions while reducing noise impact.

Result: The proposed framework outperforms existing methods across multiple medical segmentation benchmarks in semi-supervised settings.

Conclusion: The framework effectively addresses the identified deficiencies, improving performance and robustness in semi-supervised medical image segmentation.

Abstract: Weak-strong consistency learning strategies are widely employed in semi-supervised medical image segmentation to train models by leveraging limited labeled data and enforcing weak-to-strong consistency. However, existing methods primarily focus on designing and combining various perturbation schemes, overlooking the inherent potential and limitations within the framework itself. In this paper, we first identify two critical deficiencies: (1) separated training data streams, which lead to confirmation bias dominated by the labeled stream; and (2) incomplete utilization of supervisory information, which limits exploration of strong-to-weak consistency. To tackle these challenges, we propose a style-aware blending and prototype-based cross-contrast consistency learning framework. Specifically, inspired by the empirical observation that the distribution mismatch between labeled and unlabeled data can be characterized by statistical moments, we design a style-guided distribution blending module to break the independent training data streams. Meanwhile, considering the potential noise in strong pseudo-labels, we introduce a prototype-based cross-contrast strategy to encourage the model to learn informative supervisory signals from both weak-to-strong and strong-to-weak predictions, while mitigating the adverse effects of noise. Experimental results demonstrate the effectiveness and superiority of our framework across multiple medical segmentation benchmarks under various semi-supervised settings.

[290] Implicit Counterfactual Learning for Audio-Visual Segmentation

Mingfeng Zha, Tianyu Li, Guoqing Wang, Peng Wang, Yangyang Wu, Yang Yang, Heng Tao Shen

Main category: cs.CV

TL;DR: The paper proposes an implicit counterfactual framework (ICF) for unbiased audio-visual segmentation, addressing modality discrepancies and imbalances through multi-granularity implicit text (MIT) and semantic counterfactual (SC) learning.

Details

Motivation: Existing AVS methods focus on interaction efficiency but neglect modality representation discrepancies and imbalances, leading to biased cross-modal understanding.

Method: The ICF uses MIT to bridge modality gaps and SC to learn orthogonal representations. CDCL aligns representations through contrastive learning.

Result: The method achieves state-of-the-art performance on three public datasets.

Conclusion: The proposed framework effectively mitigates modality biases and enhances segmentation accuracy in complex scenes.

Abstract: Audio-visual segmentation (AVS) aims to segment objects in videos based on audio cues. Existing AVS methods are primarily designed to enhance interaction efficiency but pay limited attention to modality representation discrepancies and imbalances. To overcome this, we propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding. Due to the lack of semantics, heterogeneous representations may lead to erroneous matches, especially in complex scenes with ambiguous visual content or interference from multiple audio sources. We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space, reducing modality gaps and providing prior guidance. Visual content carries more information and typically dominates, thereby marginalizing audio features in the decision-making. To mitigate knowledge preference, we propose the semantic counterfactual (SC) to learn orthogonal representations in the latent space, generating diverse counterfactual samples, thus avoiding biases introduced by complex functional designs and explicit modifications of text structures or attributes. We further formulate the collaborative distribution-aware contrastive learning (CDCL), incorporating factual-counterfactual and inter-modality contrasts to align representations, promoting cohesion and decoupling. Extensive experiments on three public datasets validate that the proposed method achieves state-of-the-art performance.

[291] Not Only Grey Matter: OmniBrain for Robust Multimodal Classification of Alzheimer’s Disease

Ahmed Sharshar, Yasser Ashraf, Tameem Bakr, Salma Hassan, Hosam Elgendy, Mohammad Yaqub, Mohsen Guizani

Main category: cs.CV

TL;DR: OmniBrain, a multimodal framework, integrates brain MRI, radiomics, gene expression, and clinical data to improve Alzheimer’s diagnosis, achieving high accuracy and explainability.

Details

Motivation: Existing Alzheimer's diagnostic methods lack accuracy, generalization, robustness, and explainability simultaneously, limiting clinical reliability.

Method: OmniBrain uses a unified model with cross-attention and modality dropout to integrate multiple data types.

Result: Achieves 92.2% accuracy on ANMerge and 70.4% on ADNI, outperforming prior methods.

Conclusion: OmniBrain provides a robust, interpretable solution for real-world Alzheimer’s diagnosis.

Abstract: Alzheimer’s disease affects over 55 million people worldwide and is projected to more than double by 2050, necessitating rapid, accurate, and scalable diagnostics. However, existing approaches are limited because they cannot achieve clinically acceptable accuracy, generalization across datasets, robustness to missing modalities, and explainability all at the same time. This inability to satisfy all these requirements simultaneously undermines their reliability in clinical settings. We propose OmniBrain, a multimodal framework that integrates brain MRI, radiomics, gene expression, and clinical data using a unified model with cross-attention and modality dropout. OmniBrain achieves $92.2 \pm 2.4%$accuracy on the ANMerge dataset and generalizes to the MRI-only ADNI dataset with $70.4 \pm 2.7%$ accuracy, outperforming unimodal and prior multimodal approaches. Explainability analyses highlight neuropathologically relevant brain regions and genes, enhancing clinical trust. OmniBrain offers a robust, interpretable, and practical solution for real-world Alzheimer’s diagnosis.

[292] KASportsFormer: Kinematic Anatomy Enhanced Transformer for 3D Human Pose Estimation on Short Sports Scene Video

Zhuoer Yin, Calvin Yeung, Tomohiro Suzuki, Ryota Tanaka, Keisuke Fujii

Main category: cs.CV

TL;DR: KASportsFormer, a transformer-based 3D pose estimation framework for sports, improves performance in complex sports scenarios by incorporating kinematic anatomy-informed features.

Details

Motivation: Current transformer-based methods struggle with sports scenarios due to complex movements, motion blur, occlusions, and domain shifts, especially in momentary actions like shooting.

Method: Introduces BoneExt and LimbFus modules to extract and fuse kinematic motion information, enhancing pose comprehension in short videos.

Result: Achieves state-of-the-art MPJPE errors of 58.0mm and 34.3mm on SportsPose and WorldPose datasets.

Conclusion: KASportsFormer effectively addresses challenges in sports pose estimation, outperforming existing methods.

Abstract: Recent transformer based approaches have demonstrated impressive performance in solving real-world 3D human pose estimation problems. Albeit these approaches achieve fruitful results on benchmark datasets, they tend to fall short of sports scenarios where human movements are more complicated than daily life actions, as being hindered by motion blur, occlusions, and domain shifts. Moreover, due to the fact that critical motions in a sports game often finish in moments of time (e.g., shooting), the ability to focus on momentary actions is becoming a crucial factor in sports analysis, where current methods appear to struggle with instantaneous scenarios. To overcome these limitations, we introduce KASportsFormer, a novel transformer based 3D pose estimation framework for sports that incorporates a kinematic anatomy-informed feature representation and integration module. In which the inherent kinematic motion information is extracted with the Bone Extractor (BoneExt) and Limb Fuser (LimbFus) modules and encoded in a multimodal manner. This improved the capability of comprehending sports poses in short videos. We evaluate our method through two representative sports scene datasets: SportsPose and WorldPose. Experimental results show that our proposed method achieves state-of-the-art results with MPJPE errors of 58.0mm and 34.3mm, respectively. Our code and models are available at: https://github.com/jw0r1n/KASportsFormer

[293] SCORPION: Addressing Scanner-Induced Variability in Histopathology

Jeongun Ryu, Heon Song, Seungeun Lee, Soo Ick Cho, Jiwon Shin, Kyunghyun Paeng, Sérgio Pereira

Main category: cs.CV

TL;DR: The paper introduces SCORPION, a dataset for evaluating model reliability under scanner variability, and SimCons, a framework to improve scanner generalization in computational pathology.

Details

Motivation: Ensuring reliable model performance across diverse scanners is critical for real-world adoption of computational pathology, as scanner differences can affect diagnosis and treatment planning.

Method: The SCORPION dataset includes 480 tissue samples scanned with 5 scanners, providing 2,400 aligned patches. SimCons combines augmentation-based domain generalization with a consistency loss to address scanner variability.

Result: SimCons improves model consistency across scanners without compromising task-specific performance.

Conclusion: The SCORPION dataset and SimCons framework provide resources for evaluating and improving model consistency, setting a new standard for reliability testing in computational pathology.

Abstract: Ensuring reliable model performance across diverse domains is a critical challenge in computational pathology. A particular source of variability in Whole-Slide Images is introduced by differences in digital scanners, thus calling for better scanner generalization. This is critical for the real-world adoption of computational pathology, where the scanning devices may differ per institution or hospital, and the model should not be dependent on scanner-induced details, which can ultimately affect the patient’s diagnosis and treatment planning. However, past efforts have primarily focused on standard domain generalization settings, evaluating on unseen scanners during training, without directly evaluating consistency across scanners for the same tissue. To overcome this limitation, we introduce SCORPION, a new dataset explicitly designed to evaluate model reliability under scanner variability. SCORPION includes 480 tissue samples, each scanned with 5 scanners, yielding 2,400 spatially aligned patches. This scanner-paired design allows for the isolation of scanner-induced variability, enabling a rigorous evaluation of model consistency while controlling for differences in tissue composition. Furthermore, we propose SimCons, a flexible framework that combines augmentation-based domain generalization techniques with a consistency loss to explicitly address scanner generalization. We empirically show that SimCons improves model consistency on varying scanners without compromising task-specific performance. By releasing the SCORPION dataset and proposing SimCons, we provide the research community with a crucial resource for evaluating and improving model consistency across diverse scanners, setting a new standard for reliability testing.

[294] ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions

Kangcheng Bin, Chen Chen, Ting Hu, Jiahao Qi, Ping Zhong

Main category: cs.CV

TL;DR: The paper introduces ATR-UMMIM, the first benchmark dataset for multimodal image registration in UAV-based applications, addressing the lack of public benchmarks in this area.

Details

Motivation: The absence of a publicly available benchmark for multimodal registration in UAV scenarios limits the development and evaluation of advanced methods.

Method: The dataset includes 7,969 triplets of visible, infrared, and registered visible images with semi-automated annotation for pixel-level ground truth and imaging condition attributes.

Result: ATR-UMMIM provides diverse scenarios, high-quality registration, and object-level annotations for 11 categories, supporting downstream tasks.

Conclusion: ATR-UMMIM is a foundational benchmark for advancing multimodal registration, fusion, and perception in UAV applications.

Abstract: Multimodal fusion has become a key enabler for UAV-based object detection, as each modality provides complementary cues for robust feature extraction. However, due to significant differences in resolution, field of view, and sensing characteristics across modalities, accurate registration is a prerequisite before fusion. Despite its importance, there is currently no publicly available benchmark specifically designed for multimodal registration in UAV-based aerial scenarios, which severely limits the development and evaluation of advanced registration methods under real-world conditions. To bridge this gap, we present ATR-UMMIM, the first benchmark dataset specifically tailored for multimodal image registration in UAV-based applications. This dataset includes 7,969 triplets of raw visible, infrared, and precisely registered visible images captured covers diverse scenarios including flight altitudes from 80m to 300m, camera angles from 0{\deg} to 75{\deg}, and all-day, all-year temporal variations under rich weather and illumination conditions. To ensure high registration quality, we design a semi-automated annotation pipeline to introduce reliable pixel-level ground truth to each triplet. In addition, each triplet is annotated with six imaging condition attributes, enabling benchmarking of registration robustness under real-world deployment settings. To further support downstream tasks, we provide object-level annotations on all registered images, covering 11 object categories with 77,753 visible and 78,409 infrared bounding boxes. We believe ATR-UMMIM will serve as a foundational benchmark for advancing multimodal registration, fusion, and perception in real-world UAV scenarios. The datatset can be download from https://github.com/supercpy/ATR-UMMIM

Jialei Cui, Jianwei Du, Yanzhe Li, Lei Gao, Hui Jiang, Chenfu Bao

Main category: cs.CV

TL;DR: HAMLET-FFD is a hierarchical adaptive multi-modal learning framework for face forgery detection, leveraging bidirectional cross-modal reasoning to improve cross-domain generalization.

Details

Motivation: Addressing the challenge of cross-domain generalization in face forgery detection, as conventional methods fail to learn domain-invariant representations.

Method: Uses a knowledge refinement loop integrating visual and conceptual cues, inspired by expert forensic analysis. It employs bidirectional fusion of textual authenticity embeddings and hierarchical visual features, while freezing pretrained CLIP parameters.

Result: Demonstrates superior generalization to unseen manipulations across benchmarks, with distinct embeddings specializing in artifact recognition.

Conclusion: HAMLET-FFD effectively enhances authenticity assessment by aligning visual observations with semantic priors, serving as a plug-in without altering CLIP’s original capabilities.

Abstract: The rapid evolution of face manipulation techniques poses a critical challenge for face forgery detection: cross-domain generalization. Conventional methods, which rely on simple classification objectives, often fail to learn domain-invariant representations. We propose HAMLET-FFD, a cognitively inspired Hierarchical Adaptive Multi-modal Learning framework that tackles this challenge via bidirectional cross-modal reasoning. Building on contrastive vision-language models such as CLIP, HAMLET-FFD introduces a knowledge refinement loop that iteratively assesses authenticity by integrating visual evidence with conceptual cues, emulating expert forensic analysis. A key innovation is a bidirectional fusion mechanism in which textual authenticity embeddings guide the aggregation of hierarchical visual features, while modulated visual features refine text embeddings to generate image-adaptive prompts. This closed-loop process progressively aligns visual observations with semantic priors to enhance authenticity assessment. By design, HAMLET-FFD freezes all pretrained parameters, serving as an external plugin that preserves CLIP’s original capabilities. Extensive experiments demonstrate its superior generalization to unseen manipulations across multiple benchmarks, and visual analyses reveal a division of labor among embeddings, with distinct representations specializing in fine-grained artifact recognition.

[296] Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Yang Chen, Yufan Shen, Wenxuan Huang, Shen Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Botian Shi, Yu Qiao

Main category: cs.CV

TL;DR: RRVF framework reduces reliance on curated image-text supervision for MLLMs by using visual feedback and RL, outperforming existing methods in visual reasoning tasks.

Details

Motivation: Address the bottleneck of MLLMs' heavy reliance on curated image-text supervision for deep visual reasoning.

Method: Introduces RRVF, a framework using the ‘Asymmetry of Verification’ principle, enabling self-correction via reasoning, rendering, and visual feedback, optimized with RL.

Result: RRVF outperforms open-source MLLMs and supervised fine-tuning baselines in image-to-code generation tasks.

Conclusion: Visual feedback-driven systems offer a robust, generalizable path for MLLMs without explicit supervision.

Abstract: Multimodal Large Language Models (MLLMs) have exhibited impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework termed Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the Asymmetry of Verification’’ principle to train MLLMs, i.e., verifying the rendered output against a source image is easier than generating it. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL) training, reducing the reliance on the image-text supervision. Guided by the above principle, RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform self-correction through multi-turn interactions and tool invocation, while this pipeline can be optimized by the GRPO algorithm in an end-to-end manner. Extensive experiments on image-to-code generation for data charts and web interfaces show that RRVF substantially outperforms existing open-source MLLMs and surpasses supervised fine-tuning baselines. Our findings demonstrate that systems driven by purely visual feedback present a viable path toward more robust and generalizable reasoning models without requiring explicit supervision. Code will be available at https://github.com/L-O-I/RRVF.

Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, Kun Fu, Xian Sun

Main category: cs.CV

TL;DR: RingMo-Agent is a unified model for multi-modal, multi-platform remote sensing imagery, excelling in perception and reasoning tasks with strong generalizability.

Details

Motivation: Existing methods lack versatility for diverse RS data sources and are limited to basic tasks, failing real-world applicability.

Method: RingMo-Agent uses a large-scale dataset (RS-VL3M), modality-adaptive embeddings, and task-specific tokens for unified task modeling.

Result: The model performs well in visual understanding and analytical tasks, showing cross-platform and cross-modal generalizability.

Conclusion: RingMo-Agent addresses limitations of current RS vision-language models, offering a robust, adaptable framework.

Abstract: Remote sensing (RS) images from multiple modalities and platforms exhibit diverse details due to differences in sensor characteristics and imaging perspectives. Existing vision-language research in RS largely relies on relatively homogeneous data sources. Moreover, they still remain limited to conventional visual perception tasks such as classification or captioning. As a result, these methods fail to serve as a unified and standalone framework capable of effectively handling RS imagery from diverse sources in real-world applications. To address these issues, we propose RingMo-Agent, a model designed to handle multi-modal and multi-platform data that performs perception and reasoning tasks based on user textual instructions. Compared with existing models, RingMo-Agent 1) is supported by a large-scale vision-language dataset named RS-VL3M, comprising over 3 million image-text pairs, spanning optical, SAR, and infrared (IR) modalities collected from both satellite and UAV platforms, covering perception and challenging reasoning tasks; 2) learns modality adaptive representations by incorporating separated embedding layers to construct isolated features for heterogeneous modalities and reduce cross-modal interference; 3) unifies task modeling by introducing task-specific tokens and employing a token-based high-dimensional hidden state decoding mechanism designed for long-horizon spatial tasks. Extensive experiments on various RS vision-language tasks demonstrate that RingMo-Agent not only proves effective in both visual understanding and sophisticated analytical tasks, but also exhibits strong generalizability across different platforms and sensing modalities.

[298] An Efficient Machine Learning Framework for Forest Height Estimation from Multi-Polarimetric Multi-Baseline SAR data

Francesca Razzano, Wenyu Yang, Sergio Vitale, Giampaolo Ferraioli, Silvia Liberata Ullo, Gilda Schirinzi

Main category: cs.CV

TL;DR: FGump, a gradient boosting framework for forest height estimation using multi-channel SAR and LiDAR, balances accuracy and efficiency, outperforming SOTA methods.

Details

Motivation: Accurate forest height estimation is vital for climate and carbon cycle monitoring, with SAR and LiDAR offering key data. Traditional ML/DL methods are resource-heavy, prompting a need for efficient alternatives.

Method: FGump uses gradient boosting with multi-channel SAR and LiDAR GT, avoiding complex preprocessing and large datasets. It employs hand-designed features for efficiency.

Result: FGump achieves higher accuracy and lower computational costs than SOTA AI and classical methods, excelling in regression for fine-grained estimates.

Conclusion: FGump is a robust, efficient solution for forest height estimation, combining SAR and LiDAR with gradient boosting for superior performance.

Abstract: Accurate forest height estimation is crucial for climate change monitoring and carbon cycle assessment. Synthetic Aperture Radar (SAR), particularly in multi-channel configurations, has provided support for a long time in 3D forest structure reconstruction through model-based techniques. More recently, data-driven approaches using Machine Learning (ML) and Deep Learning (DL) have enabled new opportunities for forest parameter retrieval. This paper introduces FGump, a forest height estimation framework by gradient boosting using multi-channel SAR processing with LiDAR profiles as Ground Truth(GT). Unlike typical ML and DL approaches that require large datasets and complex architectures, FGump ensures a strong balance between accuracy and computational efficiency, using a limited set of hand-designed features and avoiding heavy preprocessing (e.g., calibration and/or quantization). Evaluated under both classification and regression paradigms, the proposed framework demonstrates that the regression formulation enables fine-grained, continuous estimations and avoids quantization artifacts by resulting in more precise measurements without rounding. Experimental results confirm that FGump outperforms State-of-the-Art (SOTA) AI-based and classical methods, achieving higher accuracy and significantly lower training and inference times, as demonstrated in our results.

[299] FantasyID: A dataset for detecting digital manipulations of ID-documents

Pavel Korshunov, Amir Mohammadi, Vidit Vidit, Christophe Ecabert, Sébastien Marcel

Main category: cs.CV

TL;DR: A novel dataset, FantasyID, is introduced to aid in detecting forged IDs in KYC applications, challenging current state-of-the-art detection algorithms.

Details

Motivation: The rise of image generation tools enables malicious actors to forge IDs, threatening KYC systems, necessitating robust detection methods.

Method: FantasyID mimics real-world IDs with diverse designs, languages, and real faces, printed and captured for bonafide samples, and includes digitally forged samples.

Result: Current detection algorithms perform poorly on FantasyID, with high false negative rates (close to 50%) at a 10% false positive rate.

Conclusion: FantasyID serves as a complex benchmark for evaluating forgery detection algorithms in realistic KYC scenarios.

Abstract: Advancements in image generation led to the availability of easy-to-use tools for malicious actors to create forged images. These tools pose a serious threat to the widespread Know Your Customer (KYC) applications, requiring robust systems for detection of the forged Identity Documents (IDs). To facilitate the development of the detection algorithms, in this paper, we propose a novel publicly available (including commercial use) dataset, FantasyID, which mimics real-world IDs but without tampering with legal documents and, compared to previous public datasets, it does not contain generated faces or specimen watermarks. FantasyID contains ID cards with diverse design styles, languages, and faces of real people. To simulate a realistic KYC scenario, the cards from FantasyID were printed and captured with three different devices, constituting the bonafide class. We have emulated digital forgery/injection attacks that could be performed by a malicious actor to tamper the IDs using the existing generative tools. The current state-of-the-art forgery detection algorithms, such as TruFor, MMFusion, UniFD, and FatFormer, are challenged by FantasyID dataset. It especially evident, in the evaluation conditions close to practical, with the operational threshold set on validation set so that false positive rate is at 10%, leading to false negative rates close to 50% across the board on the test set. The evaluation experiments demonstrate that FantasyID dataset is complex enough to be used as an evaluation benchmark for detection algorithms.

[300] JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1

Xinhan Di, Kristin Qi, Pengqian Yu

Main category: cs.CV

TL;DR: The paper introduces JWB-DH-V1, a dataset and evaluation framework for joint whole-body motion and speech generation, highlighting gaps in current methods and benchmarks.

Details

Motivation: Current diffusion-based video generation lacks multi-modal consistency and comprehensive evaluation for joint audio-video generation, especially for whole-body avatars.

Method: The authors propose JWB-DH-V1, a large-scale dataset with 10,000 identities and 2 million samples, along with an evaluation protocol for joint audio-video generation.

Result: Evaluation of SOTA models shows performance disparities between face/hand-centric and whole-body generation, identifying key research areas.

Conclusion: The dataset and tools address current gaps and provide a benchmark for future research in joint whole-body and speech generation.

Abstract: Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at https://github.com/deepreasonings/WholeBodyBenchmark.

[301] SCANet: Split Coordinate Attention Network for Building Footprint Extraction

Chunshi Wang, Bin Zhao, Shuxue Ding

Main category: cs.CV

TL;DR: The paper introduces Split Coordinate Attention (SCA), a plug-and-play module for building footprint extraction, enhancing feature extraction in CNNs. SCANet, incorporating SCA, outperforms SOTA methods on benchmark datasets.

Details

Motivation: Building footprint extraction is crucial for urban planning and environmental protection, but existing methods face challenges in feature extraction.

Method: Proposes SCA, a module using dual pooling kernels and split operations for efficient feature extraction, integrated into a CNN (SCANet).

Result: SCANet achieves top IoU scores (91.61% and 75.49%) on WHU and Massachusetts datasets, surpassing SOTA methods.

Conclusion: SCA and SCANet significantly improve building footprint extraction, offering a robust solution for remote sensing applications.

Abstract: Building footprint extraction holds immense significance in remote sensing image analysis and has great value in urban planning, land use, environmental protection and disaster assessment. Despite the progress made by conventional and deep learning approaches in this field, they continue to encounter significant challenges. This paper introduces a novel plug-and-play attention module, Split Coordinate Attention (SCA), which ingeniously captures spatially remote interactions by employing two spatial range of pooling kernels, strategically encoding each channel along x and y planes, and separately performs a series of split operations for each feature group, thus enabling more efficient semantic feature extraction. By inserting into a 2D CNN to form an effective SCANet, our SCANet outperforms recent SOTA methods on the public Wuhan University (WHU) Building Dataset and Massachusetts Building Dataset in terms of various metrics. Particularly SCANet achieves the best IoU, 91.61% and 75.49% for the two datasets. Our code is available at https://github.com/AiEson/SCANet

Shen Li, Liuyi Yao, Wujia Niu, Lan Zhang, Yaliang Li

Main category: cs.CV

TL;DR: Security tensors enhance LVLMs’ safety by transferring textual alignment to visual inputs without altering model parameters.

Details

Motivation: Existing safety mechanisms for text-based LLMs don't apply to visual inputs, leaving LVLMs vulnerable to harmful images.

Method: Introduce security tensors—trainable input vectors applied during inference—optimized using a curated dataset of malicious, contrastive benign, and general benign samples.

Result: Security tensors significantly improve LVLMs’ ability to reject harmful visual inputs while maintaining performance on benign tasks.

Conclusion: Security tensors successfully extend text-based safety to visual modalities by activating the language module’s safety layers.

Abstract: Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model’s parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs’ ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module’s textual “safety layers” in visual inputs, thereby effectively extending text-based safety to the visual modality.

[303] Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

Alexey Kravets, Da Chen, Vinay P. Namboodiri

Main category: cs.CV

TL;DR: The paper critiques current few-shot evaluation methods for CLIP, introduces an unlearning technique for true inductive baselines, and proposes an improved few-shot classification method with state-of-the-art results.

Details

Motivation: Current few-shot evaluation of CLIP is flawed due to partial transductivity, as datasets are often pre-seen by CLIP. A true inductive evaluation is needed.

Method: Proposes a pipeline using unlearning to create true inductive baselines and introduces an improved few-shot classification technique.

Result: Significant performance drop (-55% on average) in the new inductive setting. The proposed method outperforms 13 baselines in 5880 experiments.

Conclusion: The work identifies evaluation flaws in CLIP-based few-shot classification, provides a solution via unlearning, sets new benchmarks, and offers an improved method.

Abstract: CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far, all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, the methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method.

[304] METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian

Main category: cs.CV

TL;DR: METEOR introduces a progressive pruning framework for multi-encoder vision-language models, reducing redundant visual tokens across encoding, fusion, and decoding stages while maintaining performance.

Details

Motivation: Single-encoder architectures like CLIP struggle with generalization across diverse tasks, while multi-encoder methods are computationally expensive. METEOR aims to balance efficiency and performance.

Method: METEOR employs a rank-guided collaborative token assignment for encoding, cooperative pruning for fusion, and adaptive token pruning for decoding.

Result: METEOR reduces 76% of visual tokens with only a 0.3% performance drop compared to EAGLE, validated on 11 benchmarks.

Conclusion: METEOR successfully achieves an efficient multi-encoder vision-language model with multi-stage pruning, offering a practical solution for multimodal tasks.

Abstract: Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratios for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, METEOR reduces 76% visual tokens with only 0.3% performance drop in average. The code is available at https://github.com/YuchenLiu98/METEOR.

[305] $S^3$LAM: Surfel Splatting SLAM for Geometrically Accurate Tracking and Mapping

Ruoyu Fan, Yuhui Wen, Jiajia Dai, Tao Zhang, Long Zeng, Yong-jin Liu

Main category: cs.CV

TL;DR: $S^3$LAM is an RGB-D SLAM system using 2D surfel splatting for efficient and accurate tracking and mapping, outperforming 3D Gaussian-based methods.

Details

Motivation: Existing 3D Gaussian-based SLAM systems are inefficient for scene representation. $S^3$LAM aims to improve accuracy and efficiency by focusing on 2D surfel splatting.

Method: The system uses 2D Gaussian surfels for scene representation, introduces adaptive surface rendering for real-time optimization, and derives camera pose Jacobians from 2D splatting.

Result: $S^3$LAM achieves state-of-the-art performance on synthetic and real-world datasets, with high-quality geometry reconstruction.

Conclusion: The proposed method demonstrates superior accuracy and efficiency, with plans to release the code publicly.

Abstract: We propose $S^3$LAM, a novel RGB-D SLAM system that leverages 2D surfel splatting to achieve highly accurate geometric representations for simultaneous tracking and mapping. Unlike existing 3DGS-based SLAM approaches that rely on 3D Gaussian ellipsoids, we utilize 2D Gaussian surfels as primitives for more efficient scene representation. By focusing on the surfaces of objects in the scene, this design enables $S^3$LAM to reconstruct high-quality geometry, benefiting both mapping and tracking. To address inherent SLAM challenges including real-time optimization under limited viewpoints, we introduce a novel adaptive surface rendering strategy that improves mapping accuracy while maintaining computational efficiency. We further derive camera pose Jacobians directly from 2D surfel splatting formulation, highlighting the importance of our geometrically accurate representation that improves tracking convergence. Extensive experiments on both synthetic and real-world datasets validate that $S^3$LAM achieves state-of-the-art performance. Code will be made publicly available.

[306] Compositional Video Synthesis by Temporal Object-Centric Learning

Adil Kaan Akan, Yucel Yemez

Main category: cs.CV

TL;DR: A novel framework for compositional video synthesis using temporally consistent object-centric representations, outperforming existing methods in quality and coherence.

Details

Motivation: Existing object-centric approaches lack generative capabilities or ignore explicit object-level structure in videos. This work aims to bridge this gap.

Method: Leverages pose-invariant object-centric slots and conditions them on pretrained diffusion models for high-quality, coherent video synthesis.

Result: Sets new benchmarks in video generation quality and temporal consistency, with intuitive editing capabilities like object insertion or replacement.

Conclusion: Advances interactive and controllable video generation, enabling new possibilities in content creation and dynamic scene understanding.

Abstract: We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations, extending our previous work, SlotAdapt, from images to video. While existing object-centric approaches either lack generative capabilities entirely or treat video sequences holistically, thus neglecting explicit object-level structure, our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models. This design enables high-quality, pixel-level video synthesis with superior temporal coherence, and offers intuitive compositional editing capabilities such as object insertion, deletion, or replacement, maintaining consistent object identities across frames. Extensive experiments demonstrate that our method sets new benchmarks in video generation quality and temporal consistency, outperforming previous object-centric generative methods. Although our segmentation performance closely matches state-of-the-art methods, our approach uniquely integrates this capability with robust generative performance, significantly advancing interactive and controllable video generation and opening new possibilities for advanced content creation, semantic editing, and dynamic scene understanding.

[307] Ensemble Foreground Management for Unsupervised Object Discovery

Ziling Wu, Armaghan Moemeni, Praminda Caleb-Solly

Main category: cs.CV

TL;DR: UnionCut introduces a robust foreground prior for unsupervised object discovery, addressing challenges in distinguishing foreground/background and determining undiscovered objects. UnionSeg, a distilled version, enhances efficiency and accuracy, improving UOD performance.

Details

Motivation: Existing UOD methods struggle with heuristic foreground priors and fixed discovery iterations, leading to under/over-segmentation. UnionCut and UnionSeg aim to provide a more reliable solution.

Method: UnionCut uses min-cut and ensemble methods to detect foreground unions. UnionSeg is a distilled transformer for efficient foreground union detection.

Result: Combining UnionCut/UnionSeg with UOD methods improves performance in single object discovery, saliency detection, and self-supervised instance segmentation.

Conclusion: UnionCut and UnionSeg offer a robust and efficient solution for UOD challenges, enhancing existing methods’ performance.

Abstract: Unsupervised object discovery (UOD) aims to detect and segment objects in 2D images without handcrafted annotations. Recent progress in self-supervised representation learning has led to some success in UOD algorithms. However, the absence of ground truth provides existing UOD methods with two challenges: 1) determining if a discovered region is foreground or background, and 2) knowing how many objects remain undiscovered. To address these two problems, previous solutions rely on foreground priors to distinguish if the discovered region is foreground, and conduct one or fixed iterations of discovery. However, the existing foreground priors are heuristic and not always robust, and a fixed number of discoveries leads to under or over-segmentation, since the number of objects in images varies. This paper introduces UnionCut, a robust and well-grounded foreground prior based on min-cut and ensemble methods that detects the union of foreground areas of an image, allowing UOD algorithms to identify foreground objects and stop discovery once the majority of the foreground union in the image is segmented. In addition, we propose UnionSeg, a distilled transformer of UnionCut that outputs the foreground union more efficiently and accurately. Our experiments show that by combining with UnionCut or UnionSeg, previous state-of-the-art UOD methods witness an increase in the performance of single object discovery, saliency detection and self-supervised instance segmentation on various benchmarks. The code is available at https://github.com/YFaris/UnionCut.

[308] DriveAgent-R1: Advancing VLM-based Autonomous Driving with Hybrid Thinking and Active Perception

Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, Hang Zhao

Main category: cs.CV

TL;DR: DriveAgent-R1 improves autonomous driving by combining Hybrid-Thinking and Active Perception, outperforming leading models like Claude Sonnet 4.

Details

Motivation: Address limitations of VLMs in autonomous driving, such as myopic decision-making and passive perception, to enhance reliability in complex environments.

Method: Introduces Hybrid-Thinking (text-based and tool-based reasoning) and Active Perception with a vision toolkit, trained via three-stage progressive reinforcement learning.

Result: Achieves state-of-the-art performance, surpassing proprietary models, with decisions grounded in visual evidence.

Conclusion: DriveAgent-R1 advances safer, more intelligent autonomous systems by balancing efficiency and reliability.

Abstract: Vision-Language Models (VLMs) are advancing autonomous driving, yet their potential is constrained by myopic decision-making and passive perception, limiting reliability in complex environments. We introduce DriveAgent-R1 to tackle these challenges in long-horizon, high-level behavioral decision-making. DriveAgent-R1 features two core innovations: a Hybrid-Thinking framework that adaptively switches between efficient text-based and in-depth tool-based reasoning, and an Active Perception mechanism with a vision toolkit to proactively resolve uncertainties, thereby balancing decision-making efficiency and reliability. The agent is trained using a novel, three-stage progressive reinforcement learning strategy designed to master these hybrid capabilities. Extensive experiments demonstrate that DriveAgent-R1 achieves state-of-the-art performance, outperforming even leading proprietary large multimodal models, such as Claude Sonnet 4. Ablation studies validate our approach and confirm that the agent’s decisions are robustly grounded in actively perceived visual evidence, paving a path toward safer and more intelligent autonomous systems.

[309] Endoscopic Depth Estimation Based on Deep Learning: A Survey

Ke Niu, Zeyun Liu, Xue Feng, Heng Li, Kaize Shi

Main category: cs.CV

TL;DR: A comprehensive survey of deep learning-based endoscopic depth estimation methods, covering data, techniques, applications, and future research directions.

Details

Motivation: To address the lack of a thorough overview of recent deep learning advancements in endoscopic depth estimation, which is crucial for minimally invasive surgery.

Method: Systematic review of state-of-the-art literature, categorizing methods by supervision strategies and network architectures, and analyzing datasets and evaluation metrics.

Result: Identifies key challenges, summarizes datasets, and reviews applications in robot-assisted surgery.

Conclusion: Highlights future research directions like domain adaptation and real-time implementation, offering a foundation for further advancements.

Abstract: Endoscopic depth estimation is a critical technology for improving the safety and precision of minimally invasive surgery. It has attracted considerable attention from researchers in medical imaging, computer vision, and robotics. Over the past decade, a large number of methods have been developed. Despite the existence of several related surveys, a comprehensive overview focusing on recent deep learning-based techniques is still limited. This paper endeavors to bridge this gap by systematically reviewing the state-of-the-art literature. Specifically, we provide a thorough survey of the field from three key perspectives: data, methods, and applications, covering a range of methods including both monocular and stereo approaches. We describe common performance evaluation metrics and summarize publicly available datasets. Furthermore, this review analyzes the specific challenges of endoscopic scenes and categorizes representative techniques based on their supervision strategies and network architectures. The application of endoscopic depth estimation in the important area of robot-assisted surgery is also reviewed. Finally, we outline potential directions for future research, such as domain adaptation, real-time implementation, and enhanced model generalization, thereby providing a valuable starting point for researchers to engage with and advance the field.

[310] Event-Based De-Snowing for Autonomous Driving

Manasi Muglikar, Nico Messikommer, Marco Cannici, Davide Scaramuzza

Main category: cs.CV

TL;DR: The paper proposes an event camera-based method for de-snowing images, outperforming traditional methods by 3 dB in PSNR and improving downstream tasks like depth estimation by 20%.

Details

Motivation: Adverse weather, especially heavy snowfall, challenges vision systems. Traditional methods introduce artifacts or require high frame rates, while event cameras offer low-latency, compressed data ideal for de-snowing.

Method: Uses event cameras to capture snowflake streaks, designs an attention module to identify occlusions, and recovers original image intensity. Benchmarked on DSEC-Snow dataset.

Result: Outperforms state-of-the-art by 3 dB in PSNR and improves depth estimation and optical flow by 20%.

Conclusion: The method enhances vision system reliability in winter, advancing robust all-weather applications.

Abstract: Adverse weather conditions, particularly heavy snowfall, pose significant challenges to both human drivers and autonomous vehicles. Traditional image-based de-snowing methods often introduce hallucination artifacts as they rely solely on spatial information, while video-based approaches require high frame rates and suffer from alignment artifacts at lower frame rates. Camera parameters, such as exposure time, also influence the appearance of snowflakes, making the problem difficult to solve and heavily dependent on network generalization. In this paper, we propose to address the challenge of desnowing by using event cameras, which offer compressed visual information with submillisecond latency, making them ideal for de-snowing images, even in the presence of ego-motion. Our method leverages the fact that snowflake occlusions appear with a very distinctive streak signature in the spatio-temporal representation of event data. We design an attention-based module that focuses on events along these streaks to determine when a background point was occluded and use this information to recover its original intensity. We benchmark our method on DSEC-Snow, a new dataset created using a green-screen technique that overlays pre-recorded snowfall data onto the existing DSEC driving dataset, resulting in precise ground truth and synchronized image and event streams. Our approach outperforms state-of-the-art de-snowing methods by 3 dB in PSNR for image reconstruction. Moreover, we show that off-the-shelf computer vision algorithms can be applied to our reconstructions for tasks such as depth estimation and optical flow, achieving a $20%$ performance improvement over other de-snowing methods. Our work represents a crucial step towards enhancing the reliability and safety of vision systems in challenging winter conditions, paving the way for more robust, all-weather-capable applications.

[311] RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation

Kai Ye, YingShi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, Liujuan Cao

Main category: cs.CV

TL;DR: The paper introduces RIS-LAD, the first benchmark for Referring Image Segmentation in Low-Altitude Drone scenarios, and proposes SAARN, a Semantic-Aware Adaptive Reasoning Network, to address unique challenges like diverse viewpoints and high object density.

Details

Motivation: Existing RIS datasets and methods are designed for high-altitude, static-view imagery and fail to address the challenges of Low-Altitude Drone (LAD) scenarios, such as diverse viewpoints and high object density.

Method: The authors propose SAARN, which decomposes and routes semantic information to different network stages. It includes Category-Dominated Linguistic Enhancement (CDLE) for early encoding and Adaptive Reasoning Fusion Module (ARFM) for dynamic semantic cue selection.

Result: RIS-LAD presents significant challenges to state-of-the-art RIS algorithms, and SAARN demonstrates effectiveness in addressing these challenges.

Conclusion: The paper fills a gap in RIS for LAD scenarios with RIS-LAD and introduces SAARN as a robust solution, with plans to release the dataset and code publicly.

Abstract: Referring Image Segmentation (RIS), which aims to segment specific objects based on natural language descriptions, plays an essential role in vision-language understanding. Despite its progress in remote sensing applications, RIS in Low-Altitude Drone (LAD) scenarios remains underexplored. Existing datasets and methods are typically designed for high-altitude and static-view imagery. They struggle to handle the unique characteristics of LAD views, such as diverse viewpoints and high object density. To fill this gap, we present RIS-LAD, the first fine-grained RIS benchmark tailored for LAD scenarios. This dataset comprises 13,871 carefully annotated image-text-mask triplets collected from realistic drone footage, with a focus on small, cluttered, and multi-viewpoint scenes. It highlights new challenges absent in previous benchmarks, such as category drift caused by tiny objects and object drift under crowded same-class objects. To tackle these issues, we propose the Semantic-Aware Adaptive Reasoning Network (SAARN). Rather than uniformly injecting all linguistic features, SAARN decomposes and routes semantic information to different stages of the network. Specifically, the Category-Dominated Linguistic Enhancement (CDLE) aligns visual features with object categories during early encoding, while the Adaptive Reasoning Fusion Module (ARFM) dynamically selects semantic cues across scales to improve reasoning in complex scenes. The experimental evaluation reveals that RIS-LAD presents substantial challenges to state-of-the-art RIS algorithms, and also demonstrates the effectiveness of our proposed model in addressing these challenges. The dataset and code will be publicly released soon at: https://github.com/AHideoKuzeA/RIS-LAD/.

[312] Exploring text-to-image generation for historical document image retrieval

Melissa Cote, Alexandra Branzan Albu

Main category: cs.CV

TL;DR: The paper introduces T2I-QBE, a method using generative AI to bridge QBE and ABDIR for document image retrieval, validated on historical documents.

Details

Motivation: Overcome the limitation of QBE requiring sample query documents by leveraging ABDIR's attribute-based approach with generative AI.

Method: Proposes T2I-QBE, using text-to-image generation (Leonardo.Ai) to create query images from ABDIR-like attributes, then applying QBE for retrieval.

Result: Experiments on HisIR19 dataset confirm T2I-QBE’s viability for historical document image retrieval.

Conclusion: T2I-QBE is a novel and effective approach for DIR, especially for historical documents, marking the first use of T2I generation in this context.

Abstract: Attribute-based document image retrieval (ABDIR) was recently proposed as an alternative to query-by-example (QBE) searches, the dominant document image retrieval (DIR) paradigm. One drawback of QBE searches is that they require sample query documents on hand that may not be available. ABDIR aims to offer users a flexible way to retrieve document images based on memorable visual features of document contents, describing document images with combinations of visual attributes determined via convolutional neural network (CNN)-based binary classifiers. We present an exploratory study of the use of generative AI to bridge the gap between QBE and ABDIR, focusing on historical documents as a use case for their diversity and uniqueness in visual features. We hypothesize that text-to-image (T2I) generation can be leveraged to create query document images using text prompts based on ABDIR-like attributes. We propose T2I-QBE, which uses Leonardo.Ai as the T2I generator with prompts that include a rough description of the desired document type and a list of the desired ABDIR-style attributes. This creates query images that are then used within the traditional QBE paradigm, which compares CNN-extracted query features to those of the document images in the dataset to retrieve the most relevant documents. Experiments on the HisIR19 dataset of historical documents confirm our hypothesis and suggest that T2I-QBE is a viable option for historical document image retrieval. To the authors’ knowledge, this is the first attempt at utilizing T2I generation for DIR.

[313] ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan

Main category: cs.CV

TL;DR: ARC-Hunyuan-Video is a 7B-parameter multimodal model for structured video comprehension, excelling in tasks like captioning, summarization, and reasoning, with proven real-world impact.

Details

Motivation: Current models lack detailed video comprehension for real-world shorts, which are complex and fast-paced, requiring advanced multimodal reasoning.

Method: The model processes visual, audio, and textual inputs end-to-end, trained via pre-training, fine-tuning, RL, and stress-tested for efficiency.

Result: Demonstrates strong performance on ShortVid-Bench and improves user engagement in production, with fast inference times.

Conclusion: ARC-Hunyuan-Video effectively addresses real-world video comprehension challenges and supports diverse applications efficiently.

Abstract: Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.

[314] Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation

Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Hazım Kemal Ekenel, Alexander Waibel

Main category: cs.CV

TL;DR: A mask-free approach for audio-driven talking face generation avoids information loss and identity reference issues by transforming input images to closed mouths instead of masking.

Details

Motivation: Current masking strategies in audio-driven talking face generation cause information loss, identity reference variation, and unintended copying, degrading performance.

Method: Proposes a two-step landmark-based approach to transform input images to closed mouths without masking, then uses a lip adaptation model with audio for lip movements.

Result: Validated on LRS2 and HDTF datasets, the method avoids masked inputs and identity references while maintaining quality.

Conclusion: The mask-free approach effectively addresses limitations of masking strategies, improving audio-lip synchronization and identity preservation.

Abstract: Audio-Driven Talking Face Generation aims at generating realistic videos of talking faces, focusing on accurate audio-lip synchronization without deteriorating any identity-related visual details. Recent state-of-the-art methods are based on inpainting, meaning that the lower half of the input face is masked, and the model fills the masked region by generating lips aligned with the given audio. Hence, to preserve identity-related visual details from the lower half, these approaches additionally require an unmasked identity reference image randomly selected from the same video. However, this common masking strategy suffers from (1) information loss in the input faces, significantly affecting the networks’ ability to preserve visual quality and identity details, (2) variation between identity reference and input image degrading reconstruction performance, and (3) the identity reference negatively impacting the model, causing unintended copying of elements unaligned with the audio. To address these issues, we propose a mask-free talking face generation approach while maintaining the 2D-based face editing task. Instead of masking the lower half, we transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner. Subsequently, we provide these edited but unmasked faces to a lip adaptation model alongside the audio to generate appropriate lip movements. Thus, our approach needs neither masked input images nor identity reference images. We conduct experiments on the benchmark LRS2 and HDTF datasets and perform various ablation studies to validate our contributions.

[315] GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction

Tianhao Li, Yang Li, Mengtian Li, Yisheng Deng, Weifeng Ge

Main category: cs.CV

TL;DR: The paper proposes GTAD, a global temporal aggregation denoising network, to improve 3D scene understanding by effectively leveraging both local and global temporal information from sequences.

Details

Motivation: Existing methods for dynamic environment perception in autonomous systems inadequately use temporal information, focusing only on local interactions between adjacent frames and missing global sequence insights.

Method: GTAD introduces a framework that aggregates local temporal features from the current moment and global temporal features from historical sequences using an in-model latent denoising network.

Result: Experiments on nuScenes and Occ3D-nuScenes benchmarks show GTAD’s superiority, providing coherent and comprehensive environment understanding.

Conclusion: GTAD effectively addresses the limitation of existing methods by integrating global temporal information, enhancing 3D scene perception for autonomous systems.

Abstract: Accurately perceiving dynamic environments is a fundamental task for autonomous driving and robotic systems. Existing methods inadequately utilize temporal information, relying mainly on local temporal interactions between adjacent frames and failing to leverage global sequence information effectively. To address this limitation, we investigate how to effectively aggregate global temporal features from temporal sequences, aiming to achieve occupancy representations that efficiently utilize global temporal information from historical observations. For this purpose, we propose a global temporal aggregation denoising network named GTAD, introducing a global temporal information aggregation framework as a new paradigm for holistic 3D scene understanding. Our method employs an in-model latent denoising network to aggregate local temporal features from the current moment and global temporal features from historical sequences. This approach enables the effective perception of both fine-grained temporal information from adjacent frames and global temporal patterns from historical observations. As a result, it provides a more coherent and comprehensive understanding of the environment. Extensive experiments on the nuScenes and Occ3D-nuScenes benchmark and ablation studies demonstrate the superiority of our method.

[316] Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision

Xiao Fang, Minhyek Jeon, Zheyang Qin, Stanislav Panev, Celso de Melo, Shuowen Hu, Shayok Chakraborty, Fernando De la Torre

Main category: cs.CV

TL;DR: Proposes a generative AI method using latent diffusion models (LDMs) to synthesize aerial images and labels, improving vehicle detection across domains.

Details

Motivation: Addresses the challenge of domain shifts in vehicle detection from aerial imagery due to geographic variability.

Method: Multi-stage, multi-modal knowledge transfer framework with fine-tuned LDMs for data augmentation.

Result: Achieves 4-23% AP50 improvement over baselines, introduces two new datasets.

Conclusion: Demonstrates effectiveness of generative AI for domain adaptation in aerial imagery.

Abstract: Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Extensive experiments across diverse aerial imagery domains show consistent performance improvements in AP50 over supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50%, respectively. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah to support further research in this field. Project page is available at: https://humansensinglab.github.io/AGenDA

[317] LargeMvC-Net: Anchor-based Deep Unfolding Network for Large-scale Multi-view Clustering

Shide Du, Chunming Wu, Zihan Fang, Wendi Zhao, Yilin Wu, Changwei Wang, Shiping Wang

Main category: cs.CV

TL;DR: LargeMvC-Net improves anchor-based multi-view clustering by integrating optimization principles into a deep network, outperforming existing methods.

Details

Motivation: Existing anchor-based clustering methods lack optimization-aware designs, limiting their effectiveness.

Method: LargeMvC-Net unfolds the optimization problem into three modules (RepresentModule, NoiseModule, AnchorModule) and uses an unsupervised reconstruction loss.

Result: Outperforms state-of-the-art methods in effectiveness and scalability on large-scale benchmarks.

Conclusion: LargeMvC-Net provides a principled, optimization-aware framework for scalable multi-view clustering.

Abstract: Deep anchor-based multi-view clustering methods enhance the scalability of neural networks by utilizing representative anchors to reduce the computational complexity of large-scale clustering. Despite their scalability advantages, existing approaches often incorporate anchor structures in a heuristic or task-agnostic manner, either through post-hoc graph construction or as auxiliary components for message passing. Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. The proposed model decomposes the anchor-based clustering process into three modules: RepresentModule, NoiseModule, and AnchorModule, corresponding to representation learning, noise suppression, and anchor indicator estimation. Each module is derived by unfolding a step of the original optimization procedure into a dedicated network component, providing structural clarity and optimization traceability. In addition, an unsupervised reconstruction loss aligns each view with the anchor-induced latent space, encouraging consistent clustering structures across views. Extensive experiments on several large-scale multi-view benchmarks show that LargeMvC-Net consistently outperforms state-of-the-art methods in terms of both effectiveness and scalability.

[318] Improving Adversarial Robustness Through Adaptive Learning-Driven Multi-Teacher Knowledge Distillation

Hayat Ullah, Syed Muhammad Talha Zaidi, Arslan Munir

Main category: cs.CV

TL;DR: The paper proposes a multi-teacher adversarial robustness distillation method with adaptive learning to enhance CNN robustness against adversarial attacks without adversarial data exposure.

Details

Motivation: CNNs are vulnerable to adversarial attacks, and existing adversarial training methods still leave a gap between accuracy and robustness.

Method: Train multiple adversarially robust teacher models, then distill their knowledge into a student model using adaptive learning weights based on prediction precision.

Result: The method improves adversarial robustness on MNIST-Digits and Fashion-MNIST datasets across various attacks.

Conclusion: Multi-teacher adversarial distillation with adaptive learning effectively enhances CNN robustness against adversarial attacks.

Abstract: Convolutional neural networks (CNNs) excel in computer vision but are susceptible to adversarial attacks, crafted perturbations designed to mislead predictions. Despite advances in adversarial training, a gap persists between model accuracy and robustness. To mitigate this issue, in this paper, we present a multi-teacher adversarial robustness distillation using an adaptive learning strategy. Specifically, our proposed method first trained multiple clones of a baseline CNN model using an adversarial training strategy on a pool of perturbed data acquired through different adversarial attacks. Once trained, these adversarially trained models are used as teacher models to supervise the learning of a student model on clean data using multi-teacher knowledge distillation. To ensure an effective robustness distillation, we design an adaptive learning strategy that controls the knowledge contribution of each model by assigning weights as per their prediction precision. Distilling knowledge from adversarially pre-trained teacher models not only enhances the learning capabilities of the student model but also empowers it with the capacity to withstand different adversarial attacks, despite having no exposure to adversarial data. To verify our claims, we extensively evaluated our proposed method on MNIST-Digits and Fashion-MNIST datasets across diverse experimental settings. The obtained results exhibit the efficacy of our multi-teacher adversarial distillation and adaptive learning strategy, enhancing CNNs’ adversarial robustness against various adversarial attacks.

[319] Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions

Licai Sun, Xingxun Jiang, Haoyu Chen, Yante Li, Zheng Lian, Biu Liu, Yuan Zong, Wenming Zheng, Jukka M. Leppänen, Guoying Zhao

Main category: cs.CV

TL;DR: The paper introduces EmoCap100K, a large-scale facial emotion caption dataset, and EmoCapCLIP, a framework for learning facial emotion representations from rich natural language captions, outperforming traditional methods.

Details

Motivation: Current facial emotion recognition systems oversimplify emotions into predefined categories or scales, limiting generalization. Natural language offers richer supervision but lacks large-scale datasets and effective frameworks.

Method: The authors propose EmoCapCLIP, a joint global-local contrastive learning framework with cross-modal guided positive mining, leveraging the EmoCap100K dataset.

Result: Extensive evaluations on 20 benchmarks across five tasks show superior performance, demonstrating the effectiveness of learning from rich captions.

Conclusion: The work highlights the potential of using large-scale semantically rich captions for facial emotion representation learning, with code and data made publicly available.

Abstract: Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging semantically rich natural language captions as supervisory signals for facial emotion representation learning remains relatively underexplored, primarily due to two key challenges:

the lack of large-scale caption datasets with rich emotional semantics, and
the absence of effective frameworks tailored to harness such rich supervision. To this end, we introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples, featuring rich and structured semantic descriptions that capture both global affective states and fine-grained local facial behaviors. Building upon this dataset, we further propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module. This design facilitates the comprehensive exploitation of multi-level caption information while accommodating semantic similarities between closely related expressions. Extensive evaluations on over 20 benchmarks covering five tasks demonstrate the superior performance of our method, highlighting the promise of learning facial emotion representations from large-scale semantically rich captions. The code and data will be available at https://github.com/sunlicai/EmoCapCLIP.

[320] Deep Learning for Skeleton Based Human Motion Rehabilitation Assessment: A Benchmark

Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, Germain Forestier

Main category: cs.CV

TL;DR: The paper introduces Rehab-Pile, a unified archive of rehabilitation datasets, and a benchmarking framework for deep learning in rehabilitation motion assessment, aiming to standardize research and improve comparability.

Details

Motivation: To address the lack of standardized benchmarks and reproducible methodologies in automated rehabilitation motion assessment, which hinders progress and comparability.

Method: Aggregates datasets into Rehab-Pile, proposes a benchmarking framework, and evaluates multiple deep learning architectures for classification and regression tasks.

Result: Extensive benchmarking of architectures is conducted, with datasets and implementations released publicly to support transparency.

Conclusion: The work establishes a foundation for future research in automated rehabilitation assessment, promoting reliable and accessible solutions.

Abstract: Automated assessment of human motion plays a vital role in rehabilitation, enabling objective evaluation of patient performance and progress. Unlike general human activity recognition, rehabilitation motion assessment focuses on analyzing the quality of movement within the same action class, requiring the detection of subtle deviations from ideal motion. Recent advances in deep learning and video-based skeleton extraction have opened new possibilities for accessible, scalable motion assessment using affordable devices such as smartphones or webcams. However, the field lacks standardized benchmarks, consistent evaluation protocols, and reproducible methodologies, limiting progress and comparability across studies. In this work, we address these gaps by (i) aggregating existing rehabilitation datasets into a unified archive called Rehab-Pile, (ii) proposing a general benchmarking framework for evaluating deep learning methods in this domain, and (iii) conducting extensive benchmarking of multiple architectures across classification and regression tasks. All datasets and implementations are released to the community to support transparency and reproducibility. This paper aims to establish a solid foundation for future research in automated rehabilitation assessment and foster the development of reliable, accessible, and personalized rehabilitation solutions. The datasets, source-code and results of this article are all publicly available.

[321] GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset

Yuhan Wang, Siwei Yang, Bingchen Zhao, Letian Zhang, Qing Liu, Yuyin Zhou, Cihang Xie

Main category: cs.CV

TL;DR: The paper introduces GPT-IMAGE-EDIT-1.5M, a large-scale open-source dataset for instruction-guided image editing, addressing the limitations of proprietary models like GPT-4o.

Details

Motivation: Proprietary models like GPT-4o hinder open-source research due to their closed nature. The authors aim to bridge this gap by providing a publicly available dataset.

Method: The dataset is constructed by unifying and refining three existing datasets (OmniEdit, HQ-Edit, UltraEdit) using GPT-4o, enhancing visual quality and semantic clarity. Advanced open-source models are fine-tuned on this dataset for validation.

Result: Fine-tuned models, such as FluxKontext, achieve competitive performance on benchmarks (e.g., 7.24 on GEdit-EN), surpassing previous open-source methods and narrowing the gap to proprietary models.

Conclusion: The release of GPT-IMAGE-EDIT-1.5M aims to foster open research in instruction-guided image editing by providing a high-quality, accessible resource.

Abstract: Recent advancements in large multimodal models like GPT-4o have set a new standard for high-fidelity, instruction-guided image editing. However, the proprietary nature of these models and their training data creates a significant barrier for open-source research. To bridge this gap, we introduce GPT-IMAGE-EDIT-1.5M, a publicly available, large-scale image-editing corpus containing more than 1.5 million high-quality triplets (instruction, source image, edited image). We systematically construct this dataset by leveraging the versatile capabilities of GPT-4o to unify and refine three popular image-editing datasets: OmniEdit, HQ-Edit, and UltraEdit. Specifically, our methodology involves 1) regenerating output images to enhance visual quality and instruction alignment, and 2) selectively rewriting prompts to improve semantic clarity. To validate the efficacy of our dataset, we fine-tune advanced open-source models on GPT-IMAGE-EDIT-1.5M. The empirical results are exciting, e.g., the fine-tuned FluxKontext achieves highly competitive performance across a comprehensive suite of benchmarks, including 7.24 on GEdit-EN, 3.80 on ImgEdit-Full, and 8.78 on Complex-Edit, showing stronger instruction following and higher perceptual quality while maintaining identity. These scores markedly exceed all previously published open-source methods and substantially narrow the gap to leading proprietary models. We hope the full release of GPT-IMAGE-EDIT-1.5M can help to catalyze further open research in instruction-guided image editing.

[322] Reconstructing 4D Spatial Intelligence: A Survey

Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowei Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu

Main category: cs.CV

TL;DR: A survey organizing 4D spatial intelligence reconstruction into five hierarchical levels, addressing gaps in existing literature and highlighting future challenges.

Details

Motivation: The rapid evolution of 3D representations and deep learning has outpaced previous surveys, leaving a gap in analyzing the hierarchical structure of 4D scene reconstruction.

Method: Organizes methods into five levels: low-level 3D attributes, 3D scene components, 4D dynamic scenes, interaction modeling, and physical constraints.

Result: A structured framework for understanding 4D spatial intelligence, with identified challenges and future directions.

Conclusion: The survey provides a comprehensive perspective on 4D reconstruction, emphasizing hierarchical levels and future advancements.

Abstract: Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 – reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 – reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 – reconstruction of 4D dynamic scenes; (4) Level 4 – modeling of interactions among scene components; and (5) Level 5 – incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces Otter, a model leveraging textual and visual in-context examples for instruction tuning, and the MIMIC-IT dataset to enhance multimodal instruction following.

Details

Motivation: Addressing the gap in using both images and text as in-context examples to improve instruction-following capabilities in Large Multimodal Models (LMMs).

Method: Otter, built on Flamingo with Perceiver architecture, is instruction-tuned using the MIMIC-IT dataset, which includes 3M multimodal instruction-response pairs.

Result: Instruction tuning with in-context examples improves model convergence and generalization, excelling in complex video and multi-image tasks.

Conclusion: Otter and MIMIC-IT advance multimodal instruction tuning, demonstrating enhanced capabilities in handling diverse multimodal inputs.

Abstract: Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

[324] Everything is a Video: Unifying Modalities through Next-Frame Prediction

G. Thomas Hudson, Dean Slack, Thomas Winterbottom, Jamie Sterling, Chenghao Xiao, Junjie Shentu, Noura Al Moubayed

Main category: cs.CV

TL;DR: A novel framework reformulates diverse multimodal tasks into a unified next-frame prediction problem, enabling a single model to handle multiple modalities without modality-specific components.

Details

Motivation: Traditional multimodal approaches rely on modality-specific encoders and late fusion, limiting scalability and flexibility for new tasks or modalities.

Method: The proposed method reformulates tasks into a next-frame prediction problem, treating all inputs and outputs as sequential video frames for seamless modality integration.

Result: The framework demonstrates generalization across text, image, video, and audio tasks with minimal adaptation, simplifying multimodal model design.

Conclusion: Task reformulation simplifies multimodal learning and paves the way for generalized foundation models.

Abstract: Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder scalability and flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text, demonstrating the model’s ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.

[325] Explainable Synthetic Image Detection through Diffusion Timestep Ensembling

Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang

Main category: cs.CV

TL;DR: A novel method detects synthetic images by analyzing DDIM inversion timesteps, achieving high accuracy and introducing benchmarks for evaluation.

Details

Motivation: Address security risks from realistic synthetic images by identifying subtle distinctions between real and fake images.

Method: Utilizes features of intermediately noised images via an ensemble trained on multiple timesteps, avoiding reconstruction-based approaches. Includes explanation generation for human understanding.

Result: Achieves 98.91% and 95.89% accuracy on regular and challenging samples, with benchmarks GenHard and GenExplain introduced.

Conclusion: The method is robust, generalizable, and state-of-the-art, with code and datasets publicly available.

Abstract: Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, in the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel synthetic image detection method that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation generation and refinement module to identify and explain AI-generated flaws. Additionally, we construct the GenHard and GenExplain benchmarks to provide detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness. Our code and datasets are available at https://github.com/Shadowlized/ESIDE.

[326] Benchmarking and Analyzing Generative Data for Visual Recognition

Bo Li, Haotian Liu, Liangyu Chen, Yong Jae Lee, Chunyuan Li, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces GenBench, a benchmark for evaluating generative data in visual recognition, proposes the CLER metric for assessing generative data efficiency, compares generative vs. retrieved data, and explores external knowledge injection via Textual Inversion.

Details

Motivation: To assess the impact of generative images in visual recognition and address the lack of metrics correlating with downstream performance.

Method: Constructs GenBench (22 datasets, 2548 categories), proposes CLER metric, compares generative vs. retrieved data, and fine-tunes embeddings via Textual Inversion.

Result: Generative data shows promise, with performance improvements in 17 datasets, though challenges remain with low-resolution images.

Conclusion: Generative data holds potential for visual recognition, but further research is needed to address identified challenges.

Abstract: Advancements in large pre-trained generative models have expanded their potential as effective data generators in visual recognition. This work delves into the impact of generative images, primarily comparing paradigms that harness external data (\ie generative \vs retrieval \vs original). Our key contributions are: \textbf{1) GenBench Construction:} We devise \textbf{GenBench}, a broad benchmark comprising 22 datasets with 2548 categories, to appraise generative data across various visual recognition tasks. \textbf{2) CLER Score:} To address the insufficient correlation of existing metrics (\eg, FID, CLIP score) with downstream recognition performance, we propose \textbf{CLER}, a training-free metric indicating generative data’s efficiency for recognition tasks prior to training. \textbf{3) New Baselines:} Comparisons of generative data with retrieved data from the same external pool help to elucidate the unique traits of generative data. \textbf{4) External Knowledge Injection:} By fine-tuning special token embeddings for each category via Textual Inversion, performance improves across 17 datasets, except when dealing with low-resolution reference images. Our exhaustive benchmark and analysis spotlight generative data’s promise in visual recognition, while identifying key challenges for future investigation.

[327] Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner

Zhimin Chen, Xuewei Chen, Xiao Guo, Yingwei Li, Longlong Jing, Liang Yang, Bing Li

Main category: cs.CV

TL;DR: A 3D to Multi-View Learner (Multi-View ML) is proposed to address inefficiencies in multi-modal masked autoencoders by using only 3D inputs, enhancing 3D geometric representation learning.

Details

Motivation: Existing multi-modal MAEs inefficiently require both 2D and 3D inputs and hinder 3D geometric learning by relying on visible 2D information.

Method: The method projects 3D point clouds to multi-view 2D images, uses a 3D to multi-view autoencoder, and incorporates a multi-scale multi-head attention mechanism. A two-stage self-training strategy aligns 2D and 3D representations.

Result: The approach outperforms state-of-the-art methods in 3D classification, part segmentation, and object detection.

Conclusion: The proposed Multi-View ML effectively captures rich spatial information in 3D point clouds without relying on 2D inputs, improving performance in downstream tasks.

Abstract: Recently, multi-modal masked autoencoders (MAE) has been introduced in 3D self-supervised learning, offering enhanced feature learning by leveraging both 2D and 3D data to capture richer cross-modal representations. However, these approaches have two limitations: (1) they inefficiently require both 2D and 3D modalities as inputs, even though the inherent multi-view properties of 3D point clouds already contain 2D modality. (2) input 2D modality causes the reconstruction learning to unnecessarily rely on visible 2D information, hindering 3D geometric representation learning. To address these challenges, we propose a 3D to Multi-View Learner (Multi-View ML) that only utilizes 3D modalities as inputs and effectively capture rich spatial information in 3D point clouds. Specifically, we first project 3D point clouds to multi-view 2D images at the feature level based on 3D-based pose. Then, we introduce two components: (1) a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features; (2) a multi-scale multi-head (MSMH) attention mechanism that facilitates local-global information interactions in each decoder transformer block through attention heads at various scales. Additionally, a novel two-stage self-training strategy is proposed to align 2D and 3D representations. Our method outperforms state-of-the-art counterparts across various downstream tasks, including 3D classification, part segmentation, and object detection.

[328] Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs

Miguel Lopez-Duran, Julian Fierrez, Aythami Morales, Ruben Tolosana, Oscar Delgado-Mohatar, Alvaro Ortigosa

Main category: cs.CV

TL;DR: Benchmarking GNNs for layout classification in digital-born PDFs, introducing two graph structures and multimodal features, with GraphSAGE on k-closest-neighbor graphs achieving top accuracy.

Details

Motivation: Challenges in analyzing document layouts due to heterogeneous elements and imprecise metadata in PDFs.

Method: Evaluated GNN architectures with k-closest-neighbor and fully connected graphs, using text and vision features, tested in single-modality, concatenated, and dual-branch frameworks.

Result: GraphSAGE on k-closest-neighbor graphs in dual-branch configuration outperformed baselines, highlighting local layout and multimodal fusion.

Conclusion: GNNs, especially GraphSAGE, are effective for document layout analysis, leveraging local relationships and multimodal data.

Abstract: The automatic analysis of document layouts in digital-born PDF documents remains a challenging problem due to the heterogeneous arrangement of textual and nontextual elements and the imprecision of the textual metadata in the Portable Document Format. In this work, we benchmark Graph Neural Network (GNN) architectures for the task of fine-grained layout classification of text blocks from digital native documents. We introduce two graph construction structures: a k-closest-neighbor graph and a fully connected graph, and generate node features via pre-trained text and vision models, thus avoiding manual feature engineering. Three experimental frameworks are evaluated: single-modality (text or visual), concatenated multimodal, and dual-branch multimodal. We evaluated four foundational GNN models and compared them with the baseline. Our experiments are specifically conducted on a rich dataset of public affairs documents that includes more than 20 sources (e.g., regional and national-level official gazettes), 37K PDF documents, with 441K pages in total. Our results demonstrate that GraphSAGE operating on the k-closest-neighbor graph in a dual-branch configuration achieves the highest per-class and overall accuracy, outperforming the baseline in some sources. These findings confirm the importance of local layout relationships and multimodal fusion exploited through GNNs for the analysis of native digital document layouts.

[329] Uncertainty-Aware Testing-Time Optimization for 3D Human Pose Estimation

Ti Wang, Mengyuan Liu, Hong Liu, Bin Ren, Yingxuan You, Wenhao Li, Nicu Sebe, Xia Li

Main category: cs.CV

TL;DR: The paper proposes an Uncertainty-Aware testing-time Optimization (UAO) framework to improve 3D human pose estimation by addressing domain gaps and overfitting issues. It combines pre-trained model priors with joint uncertainty to enhance optimization.

Details

Motivation: Data-driven methods for 3D human pose estimation face domain gaps and limited generalization, while optimization-based methods struggle with overall performance and overfitting due to reliance on 2D projection constraints.

Method: The UAO framework uses a 2D-to-3D network to estimate poses and quantify joint uncertainty during training. During testing, it optimizes a latent state while freezing the pre-trained model, using projection loss and joint uncertainty to guide optimization.

Result: The framework achieves superior performance, outperforming the previous best result by 5.5% on Human3.6M and demonstrating effectiveness on MPI-INF-3DHP and 3DPW datasets.

Conclusion: The UAO framework effectively addresses overfitting and domain gaps, leveraging uncertainty to improve 3D pose estimation, with significant performance gains on benchmark datasets.

Abstract: Although data-driven methods have achieved success in 3D human pose estimation, they often suffer from domain gaps and exhibit limited generalization. In contrast, optimization-based methods excel in fine-tuning for specific cases but are generally inferior to data-driven methods in overall performance. We observe that previous optimization-based methods commonly rely on a projection constraint, which only ensures alignment in 2D space, potentially leading to the overfitting problem. To address this, we propose an Uncertainty-Aware testing-time Optimization (UAO) framework, which keeps the prior information of the pre-trained model and alleviates the overfitting problem using the uncertainty of joints. Specifically, during the training phase, we design an effective 2D-to-3D network for estimating the corresponding 3D pose while quantifying the uncertainty of each 3D joint. For optimization during testing, the proposed optimization framework freezes the pre-trained model and optimizes only a latent state. Projection loss is then employed to ensure the generated poses are well aligned in 2D space for high-quality optimization. Furthermore, we utilize the uncertainty of each joint to determine how much each joint is allowed for optimization. The effectiveness and superiority of the proposed framework are validated through extensive experiments on challenging datasets: Human3.6M, MPI-INF-3DHP, and 3DPW. Notably, our approach outperforms the previous best result by a large margin of 5.5% on Human3.6M. Code is available at \href{https://github.com/xiu-cs/UAO-Pose3D}{https://github.com/xiu-cs/UAO-Pose3D}.

[330] Visual Enumeration Remains Challenging for Multimodal Generative AI

Alberto Testolin, Kuinan Hou, Marco Zorzi

Main category: cs.CV

TL;DR: The paper evaluates AI models’ visual enumeration skills, revealing their limitations compared to humans, even with advanced systems like GPT and DALL-E.

Details

Motivation: To assess and benchmark the number sense and counting abilities of AI models, inspired by cognitive science, as current systems perform poorly in enumeration tasks.

Method: Proposes two benchmark tasks to evaluate visual enumeration in multimodal foundation models, testing popular VQA and image/text generation models.

Result: Advanced AI models fail to accurately enumerate objects or generate images with target numbers, especially outside the subitizing range, with errors varying by object category.

Conclusion: AI lacks intuitive number understanding, and scaling model size alone won’t solve this; the benchmark is released for future evaluation.

Abstract: Many animal species can approximately judge the number of objects in a visual scene at a single glance, and humans can further determine the exact cardinality of a set by deploying systematic counting procedures. In contrast, it has been observed that even state-of-the-art AI systems have very limited enumeration skills. In this work, we propose two benchmark tasks inspired by cognitive science that allow to precisely evaluate the visual enumeration capabilities of multimodal foundation models, thereby providing an objective measure of their number sense and counting level. We consider popular visual question answering models (BLIP, LLaVA and ViLT) as well as advanced image-to-text (Gemini, GPT and Qwen) and text-to-image (DALL-E, FLUX and Stable Diffusion) AI systems. Our analyses show that even the most advanced models cannot reliably name the number of objects in simple visual stimuli or generate images containing a target number of items, as indexed by their low accuracy in both types of tasks. Especially for numbers outside the subitizing range, their responses are often far from the target numerosity, and, in stark contrast with human behavior, in many cases the distribution of errors depends on the object category. We also observe some striking mistakes with small numbers. Our findings demonstrate that developing an intuitive visual understanding of number remains challenging for AI models and that merely increasing model size might not be a viable strategy to promote the emergence of systematic counting skills. We release the full code of our benchmark to facilitate the evaluation of enumeration skills in future AI systems.

[331] Edge-guided Low-light Image Enhancement with Inertial Bregman Alternating Linearized Minimization

Chaoyan Huang, Zhongming Wu, Tieyong Zeng

Main category: cs.CV

TL;DR: A novel edge-guided Retinex model and inertial Bregman algorithm for low-light image enhancement, proven effective through theory and experiments.

Details

Motivation: Overcoming limitations in prior-based methods for extracting useful information from dim images.

Method: Edge extraction network for fine features, edge-guided Retinex model decomposition, and inertial Bregman algorithm for optimization.

Result: Proven convergence to stationary points; superior performance on real-world datasets.

Conclusion: The proposed method effectively enhances low-light images with theoretical and empirical validation.

Abstract: Prior-based methods for low-light image enhancement often face challenges in extracting available prior information from dim images. To overcome this limitation, we introduce a simple yet effective Retinex model with the proposed edge extraction prior. More specifically, we design an edge extraction network to capture the fine edge features from the low-light image directly. Building upon the Retinex theory, we decompose the low-light image into its illumination and reflectance components and introduce an edge-guided Retinex model for enhancing low-light images. To solve the proposed model, we propose a novel inertial Bregman alternating linearized minimization algorithm. This algorithm addresses the optimization problem associated with the edge-guided Retinex model, enabling effective enhancement of low-light images. Through rigorous theoretical analysis, we establish the convergence properties of the algorithm. Besides, we prove that the proposed algorithm converges to a stationary point of the problem through nonconvex optimization theory. Furthermore, extensive experiments are conducted on multiple real-world low-light image datasets to demonstrate the efficiency and superiority of the proposed scheme.

[332] GLC++: Source-Free Universal Domain Adaptation through Global-Local Clustering and Contrastive Affinity Learning

Sanqing Qu, Tianpei Zou, Florian Röhrbein, Cewu Lu, Guang Chen, Dacheng Tao, Changjun Jiang

Main category: cs.CV

TL;DR: The paper introduces GLC and GLC++ for SF-UniDA, improving classification of ‘known’ data and segregation of ‘unknown’ data, outperforming existing methods in benchmarks.

Details

Motivation: Address sub-optimal performance of deep neural networks under covariate and category shifts, focusing on SF-UniDA to handle both 'known' and 'unknown' data.

Method: Proposes GLC (global and local clustering) and GLC++ (with contrastive affinity learning) to classify ‘known’ data and segregate ‘unknown’ data.

Result: GLC and GLC++ outperform GATE by 16.8% and 18.9% in H-score on VisDA, and GLC++ improves novel category clustering accuracy by 4.1% on Office-Home.

Conclusion: GLC++ enhances performance and existing methods, demonstrating effectiveness in SF-UniDA scenarios.

Abstract: Deep neural networks often exhibit sub-optimal performance under covariate and category shifts. Source-Free Domain Adaptation (SFDA) presents a promising solution to this dilemma, yet most SFDA approaches are restricted to closed-set scenarios. In this paper, we explore Source-Free Universal Domain Adaptation (SF-UniDA) aiming to accurately classify “known” data belonging to common categories and segregate them from target-private “unknown” data. We propose a novel Global and Local Clustering (GLC) technique, which comprises an adaptive one-vs-all global clustering algorithm to discern between target classes, complemented by a local k-NN clustering strategy to mitigate negative transfer. Despite the effectiveness, the inherent closed-set source architecture leads to uniform treatment of “unknown” data, impeding the identification of distinct “unknown” categories. To address this, we evolve GLC to GLC++, integrating a contrastive affinity learning strategy. We examine the superiority of GLC and GLC++ across multiple benchmarks and category shift scenarios. Remarkably, in the most challenging open-partial-set scenarios, GLC and GLC++ surpass GATE by 16.8% and 18.9% in H-score on VisDA, respectively. GLC++ enhances the novel category clustering accuracy of GLC by 4.1% in open-set scenarios on Office-Home. Furthermore, the introduced contrastive learning strategy not only enhances GLC but also significantly facilitates existing methodologies. The code is available at https://github.com/ispc-lab/GLC-plus.

[333] VLM-CPL: Consensus Pseudo Labels from Vision-Language Models for Human Annotation-Free Pathological Image Classification

Lanfeng Zhong, Zongyao Huang, Yang Liu, Wenjun Liao, Shichuan Zhang, Guotai Wang, Shaoting Zhang

Main category: cs.CV

TL;DR: A novel method, VLM-CPL, leverages pre-trained Vision-Language Models (VLMs) for pathological image classification without human annotation, using noisy label filtering and semi-supervised learning to outperform existing methods.

Details

Motivation: To reduce reliance on labeled data and human annotation in pathological image classification by utilizing VLMs.

Method: VLM-CPL combines prompt-based and feature-based pseudo-labels with consensus filtering and semi-supervised learning, including an open-set prompting strategy.

Result: Outperforms zero-shot VLM classification and existing noisy label learning methods on five public datasets.

Conclusion: VLM-CPL is effective for human annotation-free pathological image classification, demonstrating superior performance and robustness.

Abstract: Classification of pathological images is the basis for automatic cancer diagnosis. Despite that deep learning methods have achieved remarkable performance, they heavily rely on labeled data, demanding extensive human annotation efforts. In this study, we present a novel human annotation-free method by leveraging pre-trained Vision-Language Models (VLMs). Without human annotation, pseudo-labels of the training set are obtained by utilizing the zero-shot inference capabilities of VLM, which may contain a lot of noise due to the domain gap between the pre-training and target datasets. To address this issue, we introduce VLM-CPL, a novel approach that contains two noisy label filtering techniques with a semi-supervised learning strategy. Specifically, we first obtain prompt-based pseudo-labels with uncertainty estimation by zero-shot inference with the VLM using multiple augmented views of an input. Then, by leveraging the feature representation ability of VLM, we obtain feature-based pseudo-labels via sample clustering in the feature space. Prompt-feature consensus is introduced to select reliable samples based on the consensus between the two types of pseudo-labels. We further propose High-confidence Cross Supervision by to learn from samples with reliable pseudo-labels and the remaining unlabeled samples. Additionally, we present an innovative open-set prompting strategy that filters irrelevant patches from whole slides to enhance the quality of selected patches. Experimental results on five public pathological image datasets for patch-level and slide-level classification showed that our method substantially outperformed zero-shot classification by VLMs, and was superior to existing noisy label learning methods. The code is publicly available at https://github.com/HiLab-git/VLM-CPL.

[334] Histogram Layers for Neural Engineered Features

Joshua Peeples, Salim Al Kharsa, Luke Saleh, Alina Zare

Main category: cs.CV

TL;DR: The paper explores learning histogram-based features (like local binary patterns and edge histogram descriptors) via neural network layers for improved deep learning performance in computer vision tasks.

Details

Motivation: To leverage engineered histogram features within deep learning frameworks by embedding them as learnable layers in neural networks.

Method: Develops neural versions of local binary pattern and edge histogram descriptors to enhance feature representation and classification.

Result: Experiments on benchmark and real-world datasets demonstrate improved feature representation and classification.

Conclusion: Histogram layers in neural networks can effectively learn and enhance traditional histogram-based features for computer vision tasks.

Abstract: In the computer vision literature, many effective histogram-based features have been developed. These engineered features include local binary patterns and edge histogram descriptors among others and they have been shown to be informative features for a variety of computer vision tasks. In this paper, we explore whether these features can be learned through histogram layers embedded in a neural network and, therefore, be leveraged within deep learning frameworks. By using histogram features, local statistics of the feature maps from the convolution neural networks can be used to better represent the data. We present neural versions of local binary pattern and edge histogram descriptors that jointly improve the feature representation and perform image classification. Experiments are presented on benchmark and real-world datasets.

[335] MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders

Baijiong Lin, Weisen Jiang, Pengguang Chen, Shu Liu, Ying-Cong Chen

Main category: cs.CV

TL;DR: MTMamba++ is a novel architecture for multi-task dense scene understanding, using Mamba-based decoders with self-task and cross-task blocks to enhance long-range dependency and task interactions.

Details

Motivation: Improving multi-task dense prediction by capturing long-range dependencies and enhancing cross-task interactions.

Method: Proposes MTMamba++ with self-task Mamba (STM) blocks for long-range dependency and cross-task Mamba (CTM) blocks (F-CTM and S-CTM) for task interaction.

Result: Outperforms CNN-based, Transformer-based, and diffusion-based methods on NYUDv2, PASCAL-Context, and Cityscapes datasets with high efficiency.

Conclusion: MTMamba++ is effective for multi-task scene understanding, offering superior performance and computational efficiency.

Abstract: Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Extensive experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior performance of MTMamba++ over CNN-based, Transformer-based, and diffusion-based methods while maintaining high computational efficiency. The code is available at https://github.com/EnVision-Research/MTMamba.

[336] iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

Lorenzo Agnolucci, Alberto Baldrati, Alberto Del Bimbo, Marco Bertini

Main category: cs.CV

TL;DR: The paper introduces Zero-Shot Composed Image Retrieval (ZS-CIR) to avoid labeled datasets, proposes iSEARLE for mapping images into CLIP space, and releases CIRCO dataset for benchmarking.

Details

Motivation: Supervised CIR methods rely on labeled datasets, which are labor-intensive. The goal is to enable CIR without labeled data.

Method: iSEARLE maps reference images into CLIP token space as pseudo-words and combines them with relative captions.

Result: iSEARLE achieves state-of-the-art performance on FashionIQ, CIRR, and CIRCO datasets, including domain conversion and object composition tasks.

Conclusion: iSEARLE and CIRCO advance ZS-CIR research, offering a scalable solution without labeled data.

Abstract: Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets – FashionIQ, CIRR, and the proposed CIRCO – and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at https://github.com/miccunifi/SEARLE.

[337] Motion Keyframe Interpolation for Any Human Skeleton via Temporally Consistent Point Cloud Sampling and Reconstruction

Clinton Mo, Kun Hu, Chengjiang Long, Dong Yuan, Zhiyong Wang

Main category: cs.CV

TL;DR: PC-MRL is an unsupervised method for cross-compatible motion interpolation between skeletons, avoiding reliance on large supervised datasets.

Details

Motivation: Supervised models require large datasets for specific skeletons, limiting practical use. PC-MRL aims to enable motion interpolation for any skeleton without dataset constraints.

Method: Uses skeleton obfuscation via temporal point cloud sampling and unsupervised reconstruction. Introduces temporal point-wise K-nearest neighbors loss, FOQ, and RPA for robustness.

Result: PC-MRL effectively interpolates motion for desired skeletons without supervision from native datasets.

Conclusion: PC-MRL provides a viable unsupervised solution for cross-skeleton motion interpolation, overcoming dataset dependency.

Abstract: In the character animation field, modern supervised keyframe interpolation models have demonstrated exceptional performance in constructing natural human motions from sparse pose definitions. As supervised models, large motion datasets are necessary to facilitate the learning process; however, since motion is represented with fixed hierarchical skeletons, such datasets are incompatible for skeletons outside the datasets’ native configurations. Consequently, the expected availability of a motion dataset for desired skeletons severely hinders the feasibility of learned interpolation in practice. To combat this limitation, we propose Point Cloud-based Motion Representation Learning (PC-MRL), an unsupervised approach to enabling cross-compatibility between skeletons for motion interpolation learning. PC-MRL consists of a skeleton obfuscation strategy using temporal point cloud sampling, and an unsupervised skeleton reconstruction method from point clouds. We devise a temporal point-wise K-nearest neighbors loss for unsupervised learning. Moreover, we propose First-frame Offset Quaternion (FOQ) and Rest Pose Augmentation (RPA) strategies to overcome necessary limitations of our unsupervised point cloud-to-skeletal motion process. Comprehensive experiments demonstrate the effectiveness of PC-MRL in motion interpolation for desired skeletons without supervision from native datasets.

[338] Activator: GLU Activation Function as the Core Component of a Vision Transformer

Abdullah Nazhat Abdullah, Tarkan Aydin

Main category: cs.CV

TL;DR: The paper proposes replacing the MLP and attention mechanism in transformers with a GLU-based architecture to reduce computational costs while maintaining competitive performance.

Details

Motivation: Transformers are computationally expensive due to their reliance on scaled dot product attention and softmax. This work aims to reduce this cost by using GLU-based MLPs.

Method: The study substitutes traditional MLP and attention mechanisms with a GLU activation function structure, evaluating its computational efficiency and performance.

Result: The proposed GLU-based architecture reduces computational complexity while offering competitive performance compared to baseline transformer architectures.

Conclusion: GLU-based MLPs provide a more efficient yet capable alternative to traditional transformer components, supporting the goal of reducing computational costs.

Abstract: The transformer architecture has driven many successes in a variety of tasks within the field of deep learning, in particular the recent advances in natural language processing (NLP) culminating with large language models (LLM). Adding to that success, transformer architecture has found widespread interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multitask and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities for both training and inference. This paper investigates substituting the MLP and attention mechanism usually adopted for transformer architecture with an architecture based on incorporating a gated linear unit (GLU) activation function structure with the aim of reducing the computational cost. The equalized experimental assessments conducted in this work show that the proposed modification with the targeted reductions in computational complexity offers competitive performance compared to the selected baseline architectures. The results are significantly in support of the aims of this work, in which the focus was to extensively utilize GLU-based MLPs, establishing a more efficient but capable alternative to the traditional MLP and the attention mechanism as the core component in the design of transformer architectures.

[339] The DeepSpeak Dataset

Sarah Barrington, Matyas Bohacek, Hany Farid

Main category: cs.CV

TL;DR: The paper introduces DeepSpeak, a diverse and multimodal dataset to address limitations in current deepfake detection training datasets, ensuring high-quality and realistic deepfakes.

Details

Motivation: Current deepfake detection datasets are inadequate due to low-quality generators, lack of consent, and poor multimodal coverage, limiting the effectiveness of detection classifiers.

Method: The authors created DeepSpeak, a dataset with 100+ hours of real and deepfake audiovisual content, using self-recorded data, advanced synthesis engines, and identity-matching protocols.

Result: State-of-the-art deepfake detectors failed to generalize to DeepSpeak without retraining, emphasizing the need for diverse and up-to-date datasets.

Conclusion: DeepSpeak addresses critical gaps in deepfake detection research, highlighting the importance of high-quality, diverse datasets for robust detector performance.

Abstract: Deepfakes represent a growing concern across domains such as impostor hiring, fraud, and disinformation. Despite significant efforts to develop robust detection classifiers to distinguish the real from the fake, commonly used training datasets remain inadequate: relying on low-quality and outdated deepfake generators, consisting of content scraped from online repositories without participant consent, lacking in multimodal coverage, and rarely employing identity-matching protocols to ensure realistic fakes. To overcome these limitations, we present the DeepSpeak dataset, a diverse and multimodal dataset comprising over 100 hours of authentic and deepfake audiovisual content. We contribute: i) more than 50 hours of real, self-recorded data collected from 500 diverse and consenting participants using a custom-built data collection tool, ii) more than 50 hours of state-of-the-art audio and visual deepfakes generated using 14 video synthesis engines and three voice cloning engines, and iii) an embedding-based, identity-matching approach to ensure the creation of convincing, high-quality identity swaps that realistically simulate adversarial deepfake attacks. We also perform large-scale evaluations of state-of-the-art deepfake detectors and show that, without retraining, these detectors fail to generalize to the DeepSpeak dataset. These evaluations highlight the importance of a large and diverse dataset containing deepfakes from the latest generative-AI tools.

[340] Knowledge Distillation with Refined Logits

Wujie Sun, Defang Chen, Siwei Lyu, Genlang Chen, Chun Chen, Can Wang

Main category: cs.CV

TL;DR: The paper introduces Refined Logit Distillation (RLD), a method to improve logit distillation by dynamically refining teacher logits using labeling information, addressing limitations of current methods.

Details

Motivation: Current logit distillation methods suffer from inconsistencies due to incorrect teacher predictions, which can mislead student models.

Method: RLD uses labeling information to dynamically refine teacher logits, eliminating misleading information while preserving class correlations.

Result: Experiments on CIFAR-100 and ImageNet show RLD outperforms existing methods.

Conclusion: RLD enhances the efficiency and value of distilled knowledge by refining teacher logits dynamically.

Abstract: Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions, creating an exacerbated divergence between the standard distillation loss and the cross-entropy loss, which can undermine the consistency of the student model’s learning objectives. Previous attempts to use labels to empirically correct teacher predictions may undermine the class correlations. In contrast, our RLD employs labeling information to dynamically refine teacher logits. In this way, our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations, thus enhancing the value and efficiency of distilled knowledge. Experimental results on CIFAR-100 and ImageNet demonstrate its superiority over existing methods. Our code is available at https://github.com/zju-SWJ/RLD.

[341] Text-to-Image Generation Via Energy-Based CLIP

Roy Ganz, Michael Elad

Main category: cs.CV

TL;DR: CLIP-JEM extends Joint Energy Models (JEMs) to multimodal vision-language tasks using CLIP, combining generative and discriminative objectives for improved performance.

Details

Motivation: JEMs have not scaled well to high-resolution datasets. CLIP-JEM aims to bridge this gap by leveraging CLIP for multimodal tasks.

Method: Uses a joint-energy function based on Cosine similarity in CLIP space for generative tasks and contrastive adversarial loss for discriminative tasks.

Result: CLIP-JEM generates realistic images from text, outperforms competitors on benchmarks, and enhances CLIP-based generative frameworks.

Conclusion: CLIP-JEM is a scalable, robust model for multimodal tasks, offering superior performance and evaluation capabilities.

Abstract: Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present CLIP-JEM, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative one, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative one, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. CLIP-JEM not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of CLIP-JEM by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that our model can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.

[342] LM-Gaussian: Boost Sparse-view 3D Gaussian Splatting with Large Model Priors

Hanyang Yu, Xiaoxiao Long, Ping Tan

Main category: cs.CV

TL;DR: LM-Gaussian improves sparse-view 3D reconstruction using priors from large-scale vision models, reducing input image requirements while maintaining quality.

Details

Motivation: Sparse-view reconstruction is ill-posed and under-constrained, leading to poor results. Current methods like 3DGS need dense inputs, which are impractical.

Method: Introduces LM-Gaussian with robust initialization using stereo priors, diffusion-based refinement for detail preservation, and video diffusion priors for realism.

Result: Achieves high-quality reconstructions from fewer images, validated on public datasets.

Conclusion: LM-Gaussian reduces data needs and enhances sparse-view 3D reconstruction, showing promise for practical applications.

Abstract: We aim to address sparse-view reconstruction of a 3D scene by leveraging priors from large-scale vision models. While recent advancements such as 3D Gaussian Splatting (3DGS) have demonstrated remarkable successes in 3D reconstruction, these methods typically necessitate hundreds of input images that densely capture the underlying scene, making them time-consuming and impractical for real-world applications. However, sparse-view reconstruction is inherently ill-posed and under-constrained, often resulting in inferior and incomplete outcomes. This is due to issues such as failed initialization, overfitting on input images, and a lack of details. To mitigate these challenges, we introduce LM-Gaussian, a method capable of generating high-quality reconstructions from a limited number of images. Specifically, we propose a robust initialization module that leverages stereo priors to aid in the recovery of camera poses and the reliable point clouds. Additionally, a diffusion-based refinement is iteratively applied to incorporate image diffusion priors into the Gaussian optimization process to preserve intricate scene details. Finally, we utilize video diffusion priors to further enhance the rendered images for realistic visual effects. Overall, our approach significantly reduces the data acquisition requirements compared to previous 3DGS methods. We validate the effectiveness of our framework through experiments on various public datasets, demonstrating its potential for high-quality 360-degree scene reconstruction. Visual results are on our website.

[343] An Effective UNet Using Feature Interaction and Fusion for Organ Segmentation in Medical Image

Xiaolin Gou, Chuanlin Liao, Jizhe Zhou, Fengshuo Ye, Yi Lin

Main category: cs.CV

TL;DR: A novel U-shaped model with three plug-and-play modules improves medical image segmentation by enhancing feature representation and multi-scale fusion, outperforming state-of-the-art methods.

Details

Motivation: Existing methods underutilize pre-trained encoder features, limiting segmentation performance in medical imaging.

Method: Proposes a U-shaped model with three modules: channel spatial interaction, channel attention-based decoder, and multi-level fusion for feature enhancement.

Result: Achieves highest Dice scores (86.05% and 92.58%) on two datasets, with improved accuracy and computational efficiency (86.91M parameters, 23.26 GFLOPs).

Conclusion: The model effectively leverages encoder features, balancing accuracy and complexity, and outperforms current methods.

Abstract: Nowadays, pre-trained encoders are widely used in medical image segmentation due to their strong capability in extracting rich and generalized feature representations. However, existing methods often fail to fully leverage these features, limiting segmentation performance. In this work, a novel U-shaped model is proposed to address the above issue, including three plug-and-play modules. A channel spatial interaction module is introduced to improve the quality of skip connection features by modeling inter-stage interactions between the encoder and decoder. A channel attention-based module integrating squeeze-and-excitation mechanisms with convolutional layers is employed in the decoder blocks to strengthen the representation of critical features while suppressing irrelevant ones. A multi-level fusion module is designed to aggregate multi-scale decoder features, improving spatial detail and consistency in the final prediction. Comprehensive experiments on the synapse multi-organ segmentation dataset and automated cardiac diagnosis challenge dataset demonstrate that the proposed model outperforms existing state-of-the-art methods, achieving the highest average Dice score of 86.05% and 92.58%, yielding improvements of 1.15% and 0.26%, respectively. In addition, the proposed model provides a balance between accuracy and computational complexity, with only 86.91 million parameters and 23.26 giga floating-point operations.

[344] Parameter-Efficient Fine-Tuning in Spectral Domain for Point Cloud Learning

Dingkang Liang, Tianrui Feng, Xin Zhou, Yumeng Zhang, Zhikang Zou, Xiang Bai

Main category: cs.CV

TL;DR: PointGST introduces a parameter-efficient fine-tuning method for point cloud models by freezing pre-trained models and using a lightweight spectral adapter, reducing computational costs while improving performance.

Details

Motivation: Existing pre-training methods for point cloud models require full fine-tuning, which is storage-intensive and computationally demanding. PointGST aims to address this inefficiency.

Method: PointGST freezes the pre-trained model and introduces a Point Cloud Spectral Adapter (PCSA) for fine-tuning in the spectral domain, leveraging orthogonal components to de-correlate token confusion.

Result: PointGST outperforms full fine-tuning methods and significantly reduces trainable parameters, as demonstrated by experiments on various point cloud datasets.

Conclusion: PointGST is an efficient solution for transferring general knowledge to downstream tasks in point cloud learning, reducing training costs while maintaining performance.

Abstract: Recently, leveraging pre-training techniques to enhance point cloud models has become a prominent research topic. However, existing approaches typically require full fine-tuning of pre-trained models to achieve satisfactory performance on downstream tasks, which is storage-intensive and computationally demanding. To address this issue, we propose a novel Parameter-Efficient Fine-Tuning (PEFT) method for point cloud, called \textbf{PointGST} (\textbf{Point} cloud \textbf{G}raph \textbf{S}pectral \textbf{T}uning). PointGST freezes the pre-trained model and introduces a lightweight, trainable Point Cloud Spectral Adapter (PCSA) for fine-tuning parameters in the spectral domain. The core idea is built on two observations: 1) The inner tokens from frozen models might present confusion in the spatial domain; 2) Task-specific intrinsic information is important for transferring the general knowledge to the downstream task. Specifically, PointGST transfers the point tokens from the spatial domain to the spectral domain, effectively de-correlating confusion among tokens by using orthogonal components for separation. Moreover, the generated spectral basis involves intrinsic information about the downstream point clouds, enabling more targeted tuning. As a result, PointGST facilitates the efficient transfer of general knowledge to downstream tasks while significantly reducing training costs. Extensive experiments on challenging point cloud datasets across various tasks demonstrate that PointGST not only outperforms its fully fine-tuning counterpart but also significantly reduces trainable parameters, making it a promising solution for efficient point cloud learning. The code will be made available at https://github.com/jerryfeng2003/PointGST

[345] KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C. K. Chan, Hexiang Hu, Yu-Chuan Su, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: The paper introduces KITTEN, a benchmark to evaluate text-to-image models’ ability to generate realistic visual entities, revealing their shortcomings in accuracy and creativity.

Details

Motivation: To assess whether text-to-image models can accurately represent real-world visual entities, beyond just aesthetics or text alignment.

Method: Proposes KITTEN benchmark, evaluates models using human and automatic metrics, and compares text-to-image and retrieval-augmented models.

Result: Advanced models fail in accurate entity representation; retrieval-augmented models improve fidelity but lack creativity in novel configurations.

Conclusion: Current models struggle with realistic entity generation, highlighting a need for better balance between fidelity and creativity.

Abstract: Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose KITTEN, a benchmark for Knowledge-InTensive image generaTion on real-world ENtities. Using KITTEN, we conduct a systematic study of the latest text-to-image models and retrieval-augmented models, focusing on their ability to generate real-world visual entities, such as landmarks and animals. Analysis using carefully designed human evaluations, automatic metrics, and MLLM evaluations show that even advanced text-to-image models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entity in creative text prompts.

[346] LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

Juliette Marrie, Romain Menegaux, Michael Arbel, Diane Larlus, Julien Mairal

Main category: cs.CV

TL;DR: A novel method uplifts 2D image features (e.g., from DINO, SAM, CLIP) into 3D Gaussian Splatting representations using feature aggregation and graph diffusion, achieving competitive performance with speed-ups.

Details

Motivation: Extend vision foundation models (DINO, SAM, CLIP) to 3D tasks efficiently, avoiding reliance on reconstruction losses.

Method: Feature aggregation augmented by graph diffusion, leveraging 3D geometry and DINOv2 similarities for refinement.

Result: Comparable to state-of-the-art on downstream tasks, with speed-ups; strong segmentation and open-vocabulary performance.

Conclusion: The method is versatile, efficient, and effective for 3D tasks using 2D foundation models.

Abstract: We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object segmentation tasks, highlighting the versatility of our approach.

[347] Evaluating Self-Supervised Learning in Medical Imaging: A Benchmark for Robustness, Generalizability, and Multi-Domain Impact

Valay Bundele, Karahan Sarıtaş, Bora Kargi, Oğuz Ata Çal, Kıvanç Tezören, Zohreh Ghaderi, Hendrik Lensch

Main category: cs.CV

TL;DR: A comprehensive evaluation of self-supervised learning (SSL) methods in medical imaging, focusing on robustness and generalizability across diverse datasets and conditions.

Details

Motivation: Addressing the fragmented evaluation of SSL in medical imaging due to limited labeled data and the need for robust, generalizable models in critical healthcare settings.

Method: Evaluated 8 major SSL methods across 11 medical datasets using MedMNIST, analyzing in-domain and OOD performance, initialization strategies, architectures, and multi-domain pre-training.

Result: Provided insights into SSL performance under varying label proportions (1%, 10%, 100%) and cross-dataset generalizability, simulating real-world scenarios.

Conclusion: The study offers a standardized benchmark to guide practitioners in applying SSL methods effectively in medical imaging.

Abstract: Self-supervised learning (SSL) has emerged as a promising paradigm in medical imaging, addressing the chronic challenge of limited labeled data in healthcare settings. While SSL has shown impressive results, existing studies in the medical domain are often limited in scope, focusing on specific datasets or modalities, or evaluating only isolated aspects of model performance. This fragmented evaluation approach poses a significant challenge, as models deployed in critical medical settings must not only achieve high accuracy but also demonstrate robust performance and generalizability across diverse datasets and varying conditions. To address this gap, we present a comprehensive evaluation of SSL methods within the medical domain, with a particular focus on robustness and generalizability. Using the MedMNIST dataset collection as a standardized benchmark, we evaluate 8 major SSL methods across 11 different medical datasets. Our study provides an in-depth analysis of model performance in both in-domain scenarios and the detection of out-of-distribution (OOD) samples, while exploring the effect of various initialization strategies, model architectures, and multi-domain pre-training. We further assess the generalizability of SSL methods through cross-dataset evaluations and the in-domain performance with varying label proportions (1%, 10%, and 100%) to simulate real-world scenarios with limited supervision. We hope this comprehensive benchmark helps practitioners and researchers make more informed decisions when applying SSL methods to medical applications.

[348] Adaptive Real-Time Multi-Loss Function Optimization Using Dynamic Memory Fusion Framework: A Case Study on Breast Cancer Segmentation

Amin Golnari, Mostafa Diba

Main category: cs.CV

TL;DR: A novel framework, dynamic memory fusion, adaptively adjusts multi-loss function weights in real-time using historical loss data, improving deep learning model performance, especially in class-imbalanced tasks like breast ultrasound segmentation.

Details

Motivation: Manual tuning of multi-loss functions in deep learning is inefficient and inflexible, impacting model performance.

Method: Proposes dynamic memory fusion for real-time adaptive loss weighting and introduces class-balanced dice loss for class imbalance.

Result: Experiments on breast ultrasound datasets show improved segmentation performance across metrics.

Conclusion: The framework dynamically prioritizes relevant criteria, enhancing performance in evolving environments; code is publicly available.

Abstract: Deep learning has proven to be a highly effective tool for a wide range of applications, significantly when leveraging the power of multi-loss functions to optimize performance on multiple criteria simultaneously. However, optimal selection and weighting loss functions in deep learning tasks can significantly influence model performance, yet manual tuning of these functions is often inefficient and inflexible. We propose a novel framework called dynamic memory fusion for adaptive multi-loss function penalizing in real-time to address this. This framework leverages historical loss values data to dynamically adjust the weighting of multiple loss functions throughout the training process. Additionally, this framework integrates an auxiliary loss function to enhance model performance in the early stages. To further research horizons, we introduce the class-balanced dice loss function, designed to address class imbalance by prioritizing underrepresented classes. Experiments on breast ultrasound datasets demonstrate that the framework improves segmentation performance across various metrics. These results demonstrate the effectiveness of our proposed framework in ensuring that the model dynamically adjusts its focus to prioritize the most relevant criteria, leading to improved performance in evolving environments. The source code for our proposed methodology is publicly available on GitHub.

[349] Towards End-to-End Neuromorphic Event-based 3D Object Reconstruction Without Physical Priors

Chuanzhi Xu, Langyi Chen, Haodong Chen, Vera Chung, Qiang Qu

Main category: cs.CV

TL;DR: Proposes an end-to-end method for dense voxel 3D reconstruction using neuromorphic cameras, eliminating physical priors and improving accuracy by 54.6%.

Details

Motivation: Existing methods for 3D reconstruction with monocular neuromorphic cameras are limited, rely on physical priors, and use complex pipelines.

Method: Introduces a novel event representation for edge feature enhancement and an Optimal Binarization Threshold Selection Principle.

Result: Achieves a 54.6% improvement in reconstruction accuracy over baseline methods.

Conclusion: The method simplifies 3D reconstruction with neuromorphic cameras and sets a benchmark for future work.

Abstract: Neuromorphic cameras, also known as event cameras, are asynchronous brightness-change sensors that can capture extremely fast motion without suffering from motion blur, making them particularly promising for 3D reconstruction in extreme environments. However, existing research on 3D reconstruction using monocular neuromorphic cameras is limited, and most of the methods rely on estimating physical priors and employ complex multi-step pipelines. In this work, we propose an end-to-end method for dense voxel 3D reconstruction using neuromorphic cameras that eliminates the need to estimate physical priors. Our method incorporates a novel event representation to enhance edge features, enabling the proposed feature-enhancement model to learn more effectively. Additionally, we introduced Optimal Binarization Threshold Selection Principle as a guideline for future related work, using the optimal reconstruction results achieved with threshold optimization as the benchmark. Our method achieves a 54.6% improvement in reconstruction accuracy compared to the baseline method.

[350] DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes

Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan

Main category: cs.CV

TL;DR: DeSiRe-GS is a self-supervised Gaussian splatting method for static-dynamic decomposition and high-fidelity surface reconstruction in driving scenarios, outperforming prior methods without needing external annotations.

Details

Motivation: Addressing the challenge of reconstructing surfaces in dynamic driving environments with sparse data, while avoiding overfitting and ensuring physical plausibility.

Method: Uses a two-stage optimization pipeline: first extracts 2D motion masks from static regions, then maps these into Gaussian space with geometric regularizations and temporal cross-view consistency.

Result: Achieves high-quality surface reconstruction, surpassing self-supervised methods and matching accuracy of annotation-dependent approaches.

Conclusion: DeSiRe-GS is efficient, effective, and avoids reliance on external annotations, making it suitable for complex driving scenarios.

Abstract: We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Our approach employs a two-stage optimization pipeline of dynamic street Gaussians. In the first stage, we extract 2D motion masks based on the observation that 3D Gaussian Splatting inherently can reconstruct only the static regions in dynamic environments. These extracted 2D motion priors are then mapped into the Gaussian space in a differentiable manner, leveraging an efficient formulation of dynamic Gaussians in the second stage. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air. Furthermore, we introduce temporal cross-view consistency to ensure coherence across time and viewpoints, resulting in high-quality surface reconstruction. Comprehensive experiments demonstrate the efficiency and effectiveness of DeSiRe-GS, surpassing prior self-supervised arts and achieving accuracy comparable to methods relying on external 3D bounding box annotations. Code is available at https://github.com/chengweialan/DeSiRe-GS

[351] Back Home: A Computer Vision Solution to Seashell Identification for Ecological Restoration

Alexander Valverde, Luis Solano, André Montoya

Main category: cs.CV

TL;DR: A lightweight pipeline using BackHome19K dataset (19,058 images, 516 species) identifies seashell origins in real-time, aiding wildlife officers in repatriating confiscated shells.

Details

Motivation: Illegal souvenir collection removes seashells from Costa Rican beaches, but their origin is hard to verify, preventing repatriation.

Method: Uses a large-scale annotated image corpus (BackHome19K) and a real-time inference pipeline with an anomaly filter for robustness.

Result: Achieves 86.3% balanced accuracy, rejects 93% of out-of-domain objects, and processes 70,000 shells in under 3 seconds per image.

Conclusion: The system enables efficient repatriation of confiscated seashells, with the dataset publicly available.

Abstract: Illegal souvenir collection strips an estimated five tonnes of seashells from Costa Rica’s beaches each year. Yet, once these specimens are seized, their coastal origin – Pacific or Caribbean – cannot be verified easily due to the lack of information, preventing their return when confiscated by local authorities. To solve this issue, we introduce BackHome19K, the first large-scale image corpus (19{,}058 photographs, 516 species) annotated with coast-level labels, and propose a lightweight pipeline that infers provenance in real time on a mobile-grade CPU. A trained anomaly filter pre-screens uploads, increasing robustness to user-generated noise. On a held-out test set, the classifier attains 86.3% balanced accuracy, while the filter rejects 93% of 180 out-of-domain objects with zero false negatives. Deployed as a web application, the system has already processed 70{,}000 shells for wildlife officers in under three seconds per image, enabling confiscated specimens to be safely repatriated to their native ecosystems. The dataset is available at https://huggingface.co/datasets/FIFCO/BackHome19K

[352] Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting

Haoyu Zhao, Hao Wang, Xingyue Zhao, Hao Fei, Hongqiu Wang, Chengjiang Long, Hua Zou

Main category: cs.CV

TL;DR: Sim Anything is a physics-based method using MLLM to predict object properties and simulate dynamic 3D object movements efficiently.

Details

Motivation: Current methods for simulating dynamic 3D objects are either manual or computationally intensive, prompting a need for an efficient, automated solution.

Method: The approach involves scene reconstruction, MLLM-based physical property prediction, material distribution estimation, and adaptive sampling for simulation.

Result: Sim Anything achieves realistic motion faster than state-of-the-art methods, completing simulations in under 2 minutes on a single GPU.

Conclusion: The method successfully combines MLLM and physics-based simulation for efficient and realistic 3D object dynamics.

Abstract: Recent advancements in 3D generation models have opened new possibilities for simulating dynamic 3D object movements and customizing behaviors, yet creating this content remains challenging. Current methods often require manual assignment of precise physical properties for simulations or rely on video generation models to predict them, which is computationally intensive. In this paper, we rethink the usage of multi-modal large language model (MLLM) in physics-based simulation, and present Sim Anything, a physics-based approach that endows static 3D objects with interactive dynamics. We begin with detailed scene reconstruction and object-level 3D open-vocabulary segmentation, progressing to multi-view image in-painting. Inspired by human visual reasoning, we propose MLLM-based Physical Property Perception (MLLM-P3) to predict mean physical properties of objects in a zero-shot manner. Based on the mean values and the object’s geometry, the Material Property Distribution Prediction model (MPDP) model then estimates the full distribution, reformulating the problem as probability distribution estimation to reduce computational costs. Finally, we simulate objects in an open-world scene with particles sampled via the Physical-Geometric Adaptive Sampling (PGAS) strategy, efficiently capturing complex deformations and significantly reducing computational costs. Extensive experiments and user studies demonstrate our Sim Anything achieves more realistic motion than state-of-the-art methods within 2 minutes on a single GPU.

[353] Generative AI for Cel-Animation: A Survey

Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, Yizhi Song, Liu He, Jing Bi, Mingqian Feng, Xinyang Li, Zeliang Zhang, Chenliang Xu

Main category: cs.CV

TL;DR: GenAI is transforming traditional Cel-Animation by automating tasks like inbetweening and colorization, reducing manual effort, and enhancing accessibility, though challenges like visual consistency and ethics remain.

Details

Motivation: The manual and time-intensive nature of traditional Cel-Animation production limits efficiency and scalability, prompting the need for automation through GenAI.

Method: The paper surveys the integration of GenAI tools (e.g., AniDoc, ToonCrafter, AniSora) into animation workflows to automate tasks such as inbetweening, colorization, and storyboarding.

Result: GenAI lowers technical barriers, broadens accessibility, and allows artists to focus on creativity, but issues like visual consistency and ethics persist.

Conclusion: GenAI holds promise for revolutionizing animation, but further advancements are needed to address challenges and fully realize its potential.

Abstract: Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation.

[354] Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization

Hao Ju, Shaofei Huang, Si Liu, Zhedong Zheng

Main category: cs.CV

TL;DR: The paper introduces Video2BEV, a video-based drone geo-localization paradigm that transforms drone videos into Bird’s Eye View (BEV) for improved matching, using Gaussian Splatting for 3D reconstruction and a diffusion module for hard negative samples. It outperforms existing methods on the new UniV dataset.

Details

Motivation: Existing image-based drone geo-localization methods underutilize video data and struggle with occlusions and viewpoint disparities.

Method: Proposes Video2BEV, which transforms drone videos into BEV using Gaussian Splatting for 3D reconstruction and includes a diffusion module for generating hard negative samples.

Result: Video2BEV achieves competitive recall rates, outperforms conventional video-based methods, and shows robustness at lower elevations with occlusions.

Conclusion: The Video2BEV paradigm effectively addresses limitations of image-based methods and demonstrates superior performance in video-based drone geo-localization.

Abstract: Existing approaches to drone visual geo-localization predominantly adopt the image-based setting, where a single drone-view snapshot is matched with images from other platforms. Such task formulation, however, underutilizes the inherent video output of the drone and is sensitive to occlusions and viewpoint disparity. To address these limitations, we formulate a new video-based drone geo-localization task and propose the Video2BEV paradigm. This paradigm transforms the video into a Bird’s Eye View (BEV), simplifying the subsequent \textbf{inter-platform} matching process. In particular, we employ Gaussian Splatting to reconstruct a 3D scene and obtain the BEV projection. Different from the existing transform methods, \eg, polar transform, our BEVs preserve more fine-grained details without significant distortion. To facilitate the discriminative \textbf{intra-platform} representation learning, our Video2BEV paradigm also incorporates a diffusion-based module for generating hard negative samples. To validate our approach, we introduce UniV, a new video-based geo-localization dataset that extends the image-based University-1652 dataset. UniV features flight paths at $30^\circ$ and $45^\circ$ elevation angles with increased frame rates of up to 10 frames per second (FPS). Extensive experiments on the UniV dataset show that our Video2BEV paradigm achieves competitive recall rates and outperforms conventional video-based methods. Compared to other competitive methods, our proposed approach exhibits robustness at lower elevations with more occlusions.

[355] Survey on Hand Gesture Recognition from Visual Input

Manousos Linardakis, Iraklis Varlamis, Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: A survey on hand gesture recognition covering advancements, datasets, and open challenges in the field.

Details

Motivation: The growing demand for human-computer interaction in sign language, VR/AR, and robotics drives the need for a comprehensive survey on hand gesture recognition.

Method: Examines advancements in gesture and 3D hand pose recognition using RGB, depth images, and videos from monocular/multiview cameras, and reviews datasets.

Result: Provides insights into current trends, methodologies, and applications, while identifying open challenges like robustness, occlusions, generalization, and real-time efficiency.

Conclusion: The survey synthesizes recent research to guide future directions in hand gesture recognition, highlighting opportunities and challenges.

Abstract: Hand gesture recognition has become an important research area, driven by the growing demand for human-computer interaction in fields such as sign language recognition, virtual and augmented reality, and robotics. Despite the rapid growth of the field, there are few surveys that comprehensively cover recent research developments, available solutions, and benchmark datasets. This survey addresses this gap by examining the latest advancements in hand gesture and 3D hand pose recognition from various types of camera input data including RGB images, depth images, and videos from monocular or multiview cameras, examining the differing methodological requirements of each approach. Furthermore, an overview of widely used datasets is provided, detailing their main characteristics and application domains. Finally, open challenges such as achieving robust recognition in real-world environments, handling occlusions, ensuring generalization across diverse users, and addressing computational efficiency for real-time applications are highlighted to guide future research directions. By synthesizing the objectives, methodologies, and applications of recent studies, this survey offers valuable insights into current trends, challenges, and opportunities for future research in human hand gesture recognition.

[356] Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives

Ziyu Zhang, Binbin Huang, Hanqing Jiang, Liyang Zhou, Xiaojun Xiang, Shunhan Shen

Main category: cs.CV

TL;DR: QGS introduces deformable quadric surfaces with geodesic distance-based density for better geometry capture, reducing errors and memory usage while maintaining rendering efficiency.

Details

Motivation: Prior methods use Euclidean distance, misaligned with surface geometry under deformation, leading to inconsistencies. QGS aims to improve this by adapting density weights to curvature.

Method: QGS replaces static primitives with deformable quadric surfaces, using geodesic distance for density distribution. It solves geodesic distances in closed form for surface-aware splatting.

Result: QGS reduces geometric error by 33% over 2DGS and 27% over GOF on DTU, with competitive appearance quality.

Conclusion: QGS bridges geometric precision and visual fidelity, making it suitable for robotics and immersive reality.

Abstract: We propose Quadratic Gaussian Splatting (QGS), a novel representation that replaces static primitives with deformable quadric surfaces (e.g., ellipse, paraboloids) to capture intricate geometry. Unlike prior works that rely on Euclidean distance for primitive density modeling–a metric misaligned with surface geometry under deformation–QGS introduces geodesic distance-based density distributions. This innovation ensures that density weights adapt intrinsically to the primitive curvature, preserving consistency during shape changes (e.g., from planar disks to curved paraboloids). By solving geodesic distances in closed form on quadric surfaces, QGS enables surface-aware splatting, where a single primitive can represent complex curvature that previously required dozens of planar surfels, potentially reducing memory usage while maintaining efficient rendering via fast ray-quadric intersection. Experiments on DTU, Tanks and Temples, and MipNeRF360 datasets demonstrate state-of-the-art surface reconstruction, with QGS reducing geometric error (chamfer distance) by 33% over 2DGS and 27% over GOF on the DTU dataset. Crucially, QGS retains competitive appearance quality, bridging the gap between geometric precision and visual fidelity for applications like robotics and immersive reality.

[357] PaRCE: Probabilistic and Reconstruction-based Competency Estimation for CNN-based Image Classification

Sara Pohland, Claire Tomlin

Main category: cs.CV

TL;DR: The paper introduces PaRCE, a probabilistic and reconstruction-based method for holistic confidence estimation in CNNs, outperforming existing approaches in distinguishing correct, misclassified, and OOD samples, as well as localizing anomalies.

Details

Motivation: CNNs are overly confident in predictions, and existing methods lack a holistic approach to quantify uncertainty across various sources.

Method: Develops PaRCE, combining probabilistic and reconstruction-based techniques for competency estimation.

Result: PaRCE excels in distinguishing sample types and localizing anomalies, providing interpretable confidence scores.

Conclusion: PaRCE offers a reliable, holistic solution for perception model confidence estimation and anomaly localization.

Abstract: Convolutional neural networks (CNNs) are extremely popular and effective for image classification tasks but tend to be overly confident in their predictions. Various works have sought to quantify uncertainty associated with these models, detect out-of-distribution (OOD) inputs, or identify anomalous regions in an image, but limited work has sought to develop a holistic approach that can accurately estimate perception model confidence across various sources of uncertainty. We develop a probabilistic and reconstruction-based competency estimation (PaRCE) method and compare it to existing approaches for uncertainty quantification and OOD detection. We find that our method can best distinguish between correctly classified, misclassified, and OOD samples with anomalous regions, as well as between samples with visual image modifications resulting in high, medium, and low prediction accuracy. We describe how to extend our approach for anomaly localization tasks and demonstrate the ability of our approach to distinguish between regions in an image that are familiar to the perception model from those that are unfamiliar. We find that our method generates interpretable scores that most reliably capture a holistic notion of perception model confidence.

[358] FREE-Merging: Fourier Transform for Efficient Model Merging

Shenghe Zheng, Hongzhi Wang

Main category: cs.CV

TL;DR: The paper introduces FR-Merging and FREE-Merging, methods to address task interference in model merging by focusing on the frequency domain, improving performance while balancing costs.

Details

Motivation: Existing model merging methods struggle with trade-offs between performance and deployment costs due to task interference, which is overlooked in the frequency domain.

Method: Proposes FR-Merging to filter harmful frequency domain interference and FREE-Merging, which adds a lightweight task-specific expert module to compensate for information loss.

Result: Demonstrates effectiveness across CV, NLP, and Multi-Modal tasks, balancing training cost, inference latency, storage, and performance.

Conclusion: FR-Merging and FREE-Merging offer flexible, efficient solutions for model merging, addressing frequency domain interference and performance trade-offs.

Abstract: With the rapid growth of deep learning, there is an increasing availability of open-source models for various tasks. However, single fine-tuned models often fall short of meeting the diverse needs of users. Model merging has thus emerged as an efficient method to integrate the capabilities of existing models into a unified model. Nevertheless, existing model merging methods face challenging trade-offs between performance and deployment costs, primarily due to task interference. For the first time, we reveal that task interference is evident in the frequency domain of model parameters, yet current efforts only focus on spatial domain solutions, which are largely ineffective in addressing frequency domain interference. To mitigate the impact of frequency domain interference, we propose FR-Merging, an innovative method that effectively filters harmful frequency domain interference on the backbone with minimal computational overhead. Since performance loss is inevitable with cost-free methods, we propose a lightweight task-specific expert module that dynamically compensates for information loss during merging. This proposed framework, FREE-Merging (FR-Merging with experts), strikes a balanced trade-off between training cost, inference latency, storage requirements, and performance. We demonstrate the effectiveness of both FR-Merging and FREE-Merging on multiple tasks across CV, NLP, and Multi-Modal domains and show that they can be flexibly adapted to specific needs.

[359] Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

Imad Eddine Marouf, Enzo Tartaglione, Stephane Lathuiliere, Joost van de Weijer

Main category: cs.CV

TL;DR: QUAD introduces a novel method for Continual Learning in Visual Question Answering (VQACL) using question-only replay and attention distillation, reducing memory and privacy issues while outperforming existing methods.

Details

Motivation: The challenge of balancing plasticity and stability in multimodal VQACL, where existing unimodal methods fail, motivates the need for a specialized approach.

Method: QUAD uses Question-only Replay to avoid overfitting and Attention Consistency Distillation to maintain visual-linguistic associations across tasks.

Result: QUAD outperforms state-of-the-art methods on VQAv2 and NExT-QA, demonstrating robust continual VQA performance.

Conclusion: QUAD effectively addresses the dual requirements of plasticity and stability in VQACL, offering a memory-efficient and privacy-conscious solution.

Abstract: Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement. In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the answer space of the current task, addressing the problem out of answer set. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA. Code is available at: https://github.com/IemProg/QUAD.

[360] MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie, Chen Du, Ping Song, Chang Liu

Main category: cs.CV

TL;DR: MUSE-VL introduces Semantic Discrete Encoding (SDE) to align visual and language tokens, reducing training data needs and improving performance in multimodal tasks.

Details

Motivation: Existing vision tokenizers lack semantic alignment with language tokens, leading to high training complexity and subpar performance compared to dedicated models.

Method: Proposes Semantic Discrete Encoding (SDE) to add semantic constraints to visual tokenizers, enhancing alignment with language tokens.

Result: Improves understanding performance by 4.8% over SOTA Emu3 and surpasses LLaVA-NeXT 34B by 3.7%. Also outperforms existing unified models in visual generation.

Conclusion: SDE effectively unifies vision-language tasks, reducing data needs and boosting performance, making MUSE-VL a strong contender in multimodal models.

Abstract: We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with language tokens. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces the amount of training data and improves the performance of the unified model. With the same LLM size, our method improved the understanding performance by 4.8% compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7%. Our model also surpasses the existing unified models on visual generation benchmarks.

[361] A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

Chensheng Peng, Ido Sobol, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu, Or Litany

Main category: cs.CV

TL;DR: A framework for training 3D diffusion models using 2D supervision, addressing the lack of large-scale 3D datasets by leveraging sparse-view supervision and decoupling denoising from supervision.

Details

Motivation: The ambiguity in 3D reconstruction from 2D images and the impracticality of full 3D supervision necessitate scalable alternatives for 3D generative modeling.

Method: Uses sparse-view supervision and decouples noisy 3D samples (denoised) from 2D supervision, leveraging predictions from a deterministic teacher model.

Result: Consistently improves upon deterministic models, enabling scalable and high-fidelity 3D generative modeling.

Conclusion: The proposed framework effectively trains 3D diffusion models with 2D supervision, outperforming deterministic approaches.

Abstract: We present a novel framework for training 3D image-conditioned diffusion models using only 2D supervision. Recovering 3D structure from 2D images is inherently ill-posed due to the ambiguity of possible reconstructions, making generative models a natural choice. However, most existing 3D generative models rely on full 3D supervision, which is impractical due to the scarcity of large-scale 3D datasets. To address this, we propose leveraging sparse-view supervision as a scalable alternative. While recent reconstruction models use sparse-view supervision with differentiable rendering to lift 2D images to 3D, they are predominantly deterministic, failing to capture the diverse set of plausible solutions and producing blurry predictions in uncertain regions. A key challenge in training 3D diffusion models with 2D supervision is that the standard training paradigm requires both the denoising process and supervision to be in the same modality. We address this by decoupling the noisy samples being denoised from the supervision signal, allowing the former to remain in 3D while the latter is provided in 2D. Our approach leverages suboptimal predictions from a deterministic image-to-3D model-acting as a “teacher”-to generate noisy 3D inputs, enabling effective 3D diffusion training without requiring full 3D ground truth. We validate our framework on both object-level and scene-level datasets, using two different 3D Gaussian Splat (3DGS) teachers. Our results show that our approach consistently improves upon these deterministic teachers, demonstrating its effectiveness in scalable and high-fidelity 3D generative modeling. See our project page at https://lesson-in-splats.github.io/

[362] MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking

Han Han, Wei Zhai, Yang Cao, Bin Li, Zheng-jun Zha

Main category: cs.CV

TL;DR: An event-based framework for tracking any point (TAP) addresses challenges of spatial sparsity and motion sensitivity in event cameras, improving tracking accuracy and outperforming existing methods.

Details

Motivation: Traditional video-based TAP methods fail under large displacements or nonlinear motion due to assumptions of linear motion between frames. Event cameras offer high temporal resolution and motion blur-free data, enabling finer motion analysis.

Method: The framework includes a motion-guidance module for handling event sparsity and a variable motion-aware module for consistent responses to varying velocities, enhancing local matching precision.

Result: The method improves the Survival50 metric by 17.9% over baseline event-only tracking and outperforms all existing methods on standard benchmarks, including hybrid event-video approaches.

Conclusion: The proposed event-based TAP framework effectively leverages event camera advantages, achieving superior tracking performance and robustness in challenging motion scenarios.

Abstract: Tracking Any Point (TAP) plays a crucial role in motion analysis. Video-based approaches rely on iterative local matching for tracking, but they assume linear motion during the blind time between frames, which leads to point loss under large displacements or nonlinear motion. The high temporal resolution and motion blur-free characteristics of event cameras provide continuous, fine-grained motion information, capturing subtle variations with microsecond precision. This paper presents an event-based framework for tracking any point, which tackles the challenges posed by spatial sparsity and motion sensitivity in events through two tailored modules. Specifically, to resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process. Additionally, a variable motion aware module is integrated to ensure temporally consistent responses that are insensitive to varying velocities, thereby enhancing matching precision. To validate the effectiveness of the approach, two event dataset for tracking any point is constructed by simulation. The method improves the $Survival_{50}$ metric by 17.9% over event-only tracking of any point baseline. Moreover, on standard feature tracking benchmarks, it outperforms all existing methods, even those that combine events and video frames.

[363] BadPatch: Diffusion-Based Generation of Physical Adversarial Patches

Zhixiang Wang, Xingjun Ma, Yu-Gang Jiang

Main category: cs.CV

TL;DR: BadPatch is a diffusion-based framework for creating customizable, natural-looking adversarial patches that balance stealthiness and attack effectiveness, outperforming existing methods.

Details

Motivation: Existing adversarial patches are often not stealthy or customizable, limiting their practical use. BadPatch aims to address these gaps.

Method: Uses Null-text inversion and Incomplete Diffusion Optimization (IDO) to generate patches from reference images, allowing varied shapes and preserving semantics.

Result: Achieves attack performance comparable to non-naturalistic patches while maintaining a natural appearance. Introduces AdvT-shirt-1K dataset.

Conclusion: BadPatch offers a flexible, effective solution for adversarial patches, with potential applications in defense method development.

Abstract: Physical adversarial patches printed on clothing can enable individuals to evade person detectors, but most existing methods prioritize attack effectiveness over stealthiness, resulting in aesthetically unpleasing patches. While generative adversarial networks and diffusion models can produce more natural-looking patches, they often fail to balance stealthiness with attack effectiveness and lack flexibility for user customization. To address these limitations, we propose BadPatch, a novel diffusion-based framework for generating customizable and naturalistic adversarial patches. Our approach allows users to start from a reference image (rather than random noise) and incorporates masks to create patches of various shapes, not limited to squares. To preserve the original semantics during the diffusion process, we employ Null-text inversion to map random noise samples to a single input image and generate patches through Incomplete Diffusion Optimization (IDO). Our method achieves attack performance comparable to state-of-the-art non-naturalistic patches while maintaining a natural appearance. Using BadPatch, we construct AdvT-shirt-1K, the first physical adversarial T-shirt dataset comprising over a thousand images captured in diverse scenarios. AdvT-shirt-1K can serve as a useful dataset for training or testing future defense methods.

[364] Continual Low-Rank Scaled Dot-product Attention

Ginés Carreto Picón, Illia Oleksiienko, Lukas Hedegaard, Arian Bakhtiarnia, Alexandros Iosifidis

Main category: cs.CV

TL;DR: The paper introduces a Nyström approximation-based Scaled Dot-product Attention for Transformers, reducing computational costs while maintaining performance in continual inference tasks like Online Audio Classification and Online Action Detection.

Details

Motivation: Transformers' high computational and memory demands hinder their use in stream data processing with latency and resource constraints. Existing methods to reduce costs are insufficient for continual inference.

Method: Proposes a Continual Scaled Dot-product Attention using Nyström approximation, tailored for continual inference tasks.

Result: Achieves up to three orders of magnitude reduction in operations compared to original Transformers, with no loss in predictive performance.

Conclusion: The new formulation enables efficient Transformer use in continual inference tasks, balancing computational efficiency and performance.

Abstract: Transformers are widely used for their ability to capture data relations in sequence processing, with great success for a wide range of static tasks. However, the computational and memory footprint of their main component, i.e., the Scaled Dot-product Attention, is commonly overlooked. This makes their adoption in applications involving stream data processing with constraints in response latency, computational and memory resources infeasible. Some works have proposed methods to lower the computational cost of Transformers, i.e. low-rank approximations, sparsity in attention, and efficient formulations for Continual Inference. In this paper, we introduce a new formulation of the Scaled Dot-product Attention based on the Nystr"om approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude compared to the original Transformers while retaining the predictive performance of competing models.

[365] CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models

Shunchang Liu, Zhuan Shi, Lingjuan Lyu, Yaochu Jin, Boi Faltings

Main category: cs.CV

TL;DR: CopyJudge is an automated framework using LVLMs to assess copyright infringement in AI-generated images, offering identification and mitigation strategies.

Details

Motivation: To resolve copyright disputes by determining substantial similarity between AI-generated and copyrighted images.

Method: Uses an abstraction-filtration-comparison test with multi-LVLM debate for infringement assessment and introduces a mitigation strategy via prompt optimization and noise vector exploration.

Result: Achieves state-of-the-art performance in infringement identification and effectively mitigates memorization and IP infringement.

Conclusion: CopyJudge provides a robust, interpretable, and generalizable solution for copyright infringement in AI-generated images.

Abstract: Assessing whether AI-generated images are substantially similar to source works is a crucial step in resolving copyright disputes. In this paper, we propose CopyJudge, a novel automated infringement identification framework that leverages large vision-language models (LVLMs) to simulate practical court processes for determining substantial similarity between copyrighted images and those generated by text-to-image diffusion models. Specifically, we employ an abstraction-filtration-comparison test framework based on the multi-LVLM debate to assess the likelihood of infringement and provide detailed judgment rationales. Based on these judgments, we further introduce a general LVLM-based mitigation strategy that automatically optimizes infringing prompts by avoiding sensitive expressions while preserving the non-infringing content. Furthermore, assuming the input noise is controllable, our approach can be enhanced by iteratively exploring non-infringing noise vectors within the diffusion latent space, even without modifying the original prompts. Experimental results show that our automated identification method achieves comparable state-of-the-art performance, while offering superior generalization and interpretability across various forms of infringement, and that our mitigation method more effectively mitigates memorization and IP infringement with a high degree of alignment to the original non-infringing expressions.

Xiaohe Ma, Valentin Deschaintre, Miloš Hašan, Fujun Luan, Kun Zhou, Hongzhi Wu, Yiwei Hu

Main category: cs.CV

TL;DR: MaterialPicker uses a Diffusion Transformer (DiT) to generate high-quality materials from text or images, handling distortions and occlusions.

Details

Motivation: Simplify and improve material generation for virtual environments and inverse rendering.

Method: Finetune a DiT-based video generator to treat material maps as video frames, enabling generation from text or images.

Result: Produces diverse materials with better distortion correction than prior work.

Conclusion: MaterialPicker advances material generation by combining multi-modal inputs and leveraging DiT architecture.

Abstract: High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.

[367] Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Shengyuan Zhang, An Zhao, Ling Yang, Zejian Li, Chenye Meng, Haoran Xu, Tianrun Chen, AnYang Wei, Perry Pengyun GU, Lingyun Sun

Main category: cs.CV

TL;DR: ScoreLiDAR is a novel distillation method for 3D LiDAR scene completion, improving speed and quality by reducing sampling steps and introducing a Structural Loss.

Details

Motivation: Diffusion models for 3D LiDAR scene completion are slow, hindering practical use in autonomous vehicles.

Method: Proposes ScoreLiDAR, a distillation method with a Structural Loss (scene-wise and point-wise terms) to enhance efficiency and quality.

Result: Achieves >5x speedup (5.37s vs. 30.55s per frame) and superior performance on SemanticKITTI.

Conclusion: ScoreLiDAR enables efficient, high-quality 3D LiDAR scene completion, advancing practical applications.

Abstract: Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality. However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D Li- DAR scene completion models, dubbed ScoreLiDAR, which achieves efficient yet high-quality scene completion. Score- LiDAR enables the distilled model to sample in significantly fewer steps after distillation. To improve completion quality, we also introduce a novel Structural Loss, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene. The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration. Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame (>5x) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models. Our model and code are publicly available on https://github.com/happyw1nd/ScoreLiDAR.

[368] Versatile Multimodal Controls for Expressive Talking Human Animation

Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Zixin Zhu, Sanping Zhou, Ming Yang, Le Wang

Main category: cs.CV

TL;DR: VersaAnimator is a framework for generating expressive talking human videos from images, using audio and text prompts for control, and ensuring realistic lip sync and body movements.

Details

Motivation: The need for AI-generated content to not only produce basic gestures and lip sync but also allow direct guidance for expressive and semantically accurate body movements.

Method: A motion generator creates rhythmic movements from audio and supports text-prompt control. A multi-modal video diffusion ensures photorealistic output, and a token2pose translator maps 3D motion to 2D poses smoothly.

Result: VersaAnimator produces lip-synced, identity-preserving videos with expressive and meaningful whole-body motions.

Conclusion: VersaAnimator effectively addresses the challenge of generating guided, expressive, and realistic human animations.

Abstract: In filmmaking, directors typically allow actors to perform freely based on the script before providing specific guidance on how to present key actions. AI-generated content faces similar requirements, where users not only need automatic generation of lip synchronization and basic gestures from audio input but also desire semantically accurate and expressive body movement that can be ``directly guided’’ through text descriptions. Therefore, we present VersaAnimator, a versatile framework that synthesizes expressive talking human videos from arbitrary portrait images. Specifically, we design a motion generator that produces basic rhythmic movements from audio input and supports text-prompt control for specific actions. The generated whole-body 3D motion tokens can animate portraits of various scales, producing talking heads, half-body gestures and even leg movements for whole-body images. Besides, we introduce a multi-modal controlled video diffusion that generates photorealistic videos, where speech signals govern lip synchronization, facial expressions, and head motions while body movements are guided by the 2D poses. Furthermore, we introduce a token2pose translator to smoothly map 3D motion tokens to 2D pose sequences. This design mitigates the stiffness resulting from direct 3D to 2D conversion and enhances the details of the generated body movements. Extensive experiments shows that VersaAnimator synthesizes lip-synced and identity-preserving videos while generating expressive and semantically meaningful whole-body motions.

[369] Generalizable Targeted Data Poisoning against Varying Physical Objects

Zhizhen Chen, Zhengyu Zhao, Subrat Kishore Dutta, Chenhao Lin, Chao Shen, Xiao Zhang

Main category: cs.CV

TL;DR: The paper addresses the limitations of existing targeted data poisoning (TDP) methods by improving generalizability across varying physical conditions through optimized gradient direction and magnitude.

Details

Motivation: Existing TDP methods assume an ideal threat model with identical target images during poisoning and inference, which doesn't reflect real-world variations like viewpoint, background, and lighting changes.

Method: The proposed method optimizes both gradient direction and magnitude for more generalizable gradient matching, enhancing poisoning success rates.

Result: The method outperforms the state-of-the-art by 19.49% in poisoning CIFAR-10 images targeting multi-view cars.

Conclusion: Optimizing gradient direction and magnitude significantly improves TDP generalizability, making it more effective in real-world scenarios.

Abstract: Targeted data poisoning (TDP) aims to compromise the model’s prediction on a specific (test) target by perturbing a small subset of training data. Existing work on TDP has focused on an overly ideal threat model in which the same image sample of the target is used during both poisoning and inference stages. However, in the real world, a target object often appears in complex variations due to changes of physical settings such as viewpoint, background, and lighting conditions. In this work, we take the first step toward understanding the real-world threats of TDP by studying its generalizability across varying physical conditions. In particular, we observe that solely optimizing gradient directions, as adopted by the best previous TDP method, achieves limited generalization. To address this limitation, we propose optimizing both the gradient direction and magnitude for more generalizable gradient matching, thereby leading to higher poisoning success rates. For instance, our method outperforms the state of the art by 19.49% when poisoning CIFAR-10 images targeting multi-view cars.

[370] Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models

In Cho, Youngbeom Yoo, Subin Jeon, Seon Joo Kim

Main category: cs.CV

TL;DR: COD-VAE introduces a compact 1D latent space for 3D shapes via a two-stage autoencoder, improving compression and decoding efficiency while maintaining quality.

Details

Motivation: Efficient 3D diffusion models require a compressed latent space without quality loss, motivating the development of COD-VAE.

Method: A two-stage autoencoder: progressive compression of point clouds into compact latent vectors and triplane-based decoding. Uncertainty-guided token pruning further enhances efficiency.

Result: Achieves 16x compression and 20.8x speedup in generation without quality loss.

Conclusion: COD-VAE demonstrates that high-quality 3D reconstruction and generation don’t require many latent vectors, enabling efficient performance.

Abstract: Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models. This paper introduces COD-VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality. COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency. First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency. Experimental results demonstrate that COD-VAE achieves 16x compression compared to the baseline while maintaining quality. This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation. The code is available at https://github.com/join16/COD-VAE.

[371] Chimera: Improving Generalist Model with Domain-Specific Experts

Tianshuo Peng, Mingsheng Li, Jiakang Yuan, Hongbin Zhou, Renqiu Xia, Renrui Zhang, Lei Bai, Song Mao, Bin Wang, Aojun Zhou, Botian Shi, Tao Chen, Bo Zhang, Xiangyu Yue

Main category: cs.CV

TL;DR: Chimera enhances Large Multi-modal Models (LMMs) by integrating domain-specific experts through a progressive training strategy and a novel masking mechanism, achieving top performance in specialized tasks.

Details

Motivation: Generalist LMMs lack specialized capabilities for domain-specific tasks due to training on natural images. Integrating expert models is challenging due to representational gaps and imbalanced optimization.

Method: Chimera uses a progressive training strategy to integrate expert features into LMMs and introduces a Generalist-Specialist Collaboration Masking (GSCM) mechanism to balance optimization.

Result: Chimera achieves state-of-the-art performance in multi-modal reasoning and visual content extraction tasks across specialized domains like charts, tables, math, and documents.

Conclusion: Chimera effectively bridges the gap between generalist LMMs and domain-specific experts, offering scalable and low-cost improvements for specialized tasks.

Abstract: Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, generalist models are primarily trained on web-scale datasets dominated by natural images, resulting in the sacrifice of specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. Moreover, directly integrating expert models tailored for specific domains is challenging due to the representational gap and imbalanced optimization between the generalist model and experts. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs.

[372] Are ECGs enough? Deep learning classification of pulmonary embolism using electrocardiograms

Joao D. S. Marques, Arlindo L. Oliveira

Main category: cs.CV

TL;DR: The study explores neural networks and transfer learning to improve ECG-based pulmonary embolism diagnosis, leveraging larger datasets to enhance performance on smaller, limited PE datasets.

Details

Motivation: Pulmonary embolism (PE) diagnosis via ECG is challenging due to limited public datasets. The study aims to optimize learning strategies for better generalization.

Method: Multiple neural networks are tested, and transfer learning is applied using larger ECG datasets (PTB-XL, CPSC18, MedalCare-XL) to improve performance on smaller PE datasets.

Result: The study evaluates the impact of transfer learning on learning efficiency and predictive performance for PE diagnosis.

Conclusion: Transfer learning can enhance ECG-based PE diagnosis by leveraging larger datasets, improving generalization on limited data.

Abstract: Pulmonary embolism is a leading cause of out of hospital cardiac arrest that requires fast diagnosis. While computed tomography pulmonary angiography is the standard diagnostic tool, it is not always accessible. Electrocardiography is an essential tool for diagnosing multiple cardiac anomalies, as it is affordable, fast and available in many settings. However, the availability of public ECG datasets, specially for PE, is limited and, in practice, these datasets tend to be small, making it essential to optimize learning strategies. In this study, we investigate the performance of multiple neural networks in order to assess the impact of various approaches. Moreover, we check whether these practices enhance model generalization when transfer learning is used to translate information learned in larger ECG datasets, such as PTB-XL, CPSC18 and MedalCare-XL, to a smaller, more challenging dataset for PE. By leveraging transfer learning, we analyze the extent to which we can improve learning efficiency and predictive performance on limited data. Code available at https://github.com/joaodsmarques/Are-ECGs-enough-Deep-Learning-Classifiers .

[373] TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions

Ilya A. Petrov, Riccardo Marin, Julian Chibane, Gerard Pons-Moll

Main category: cs.CV

TL;DR: TriDi is a unified model for 3D human-object interaction (HOI) that simultaneously generates human, object, and interaction modalities using a three-way diffusion process, outperforming specialized baselines.

Details

Motivation: Existing methods for 3D HOI are one-directional, limiting flexibility. TriDi aims to unify these approaches and extend capabilities.

Method: TriDi uses a transformer-based three-way diffusion process to model seven distributions with one network, incorporating text descriptions or contact maps for control.

Result: TriDi surpasses specialized baselines on GRAB and BEHAVE datasets in quality, diversity, and generalization to unseen objects.

Conclusion: TriDi provides a versatile, unified solution for 3D HOI, enabling applications like scene population and dataset generation.

Abstract: Modeling 3D human-object interaction (HOI) is a problem of great interest for computer vision and a key enabler for virtual and mixed-reality applications. Existing methods work in a one-way direction: some recover plausible human interactions conditioned on a 3D object; others recover the object pose conditioned on a human pose. Instead, we provide the first unified model - TriDi which works in any direction. Concretely, we generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process, allowing to model seven distributions with one network. We implement TriDi as a transformer attending to the various modalities’ tokens, thereby discovering conditional relations between them. The user can control the interaction either as a text description of HOI or a contact map. We embed these two representations into a shared latent space, combining the practicality of text descriptions with the expressiveness of contact maps. Using a single network, TriDi unifies all the special cases of prior work and extends to new ones, modeling a family of seven distributions. Remarkably, despite using a single model, TriDi generated samples surpass one-way specialized baselines on GRAB and BEHAVE in terms of both qualitative and quantitative metrics, and demonstrating better diversity. We show the applicability of TriDi to scene population, generating objects for human-contact datasets, and generalization to unseen object geometry. The project page is available at: https://virtualhumans.mpi-inf.mpg.de/tridi.

[374] Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

Jun Zheng, Jing Wang, Fuwei Zhao, Xujie Zhang, Xiaodan Liang

Main category: cs.CV

TL;DR: The paper introduces Dynamic Try-On, a video try-on framework using Diffusion Transformer (DiT) to address challenges like computational efficiency and temporal consistency in complex movements.

Details

Motivation: Previous video try-on methods struggle with complex poses and high computational costs due to additional garment encoders. The goal is to improve efficiency and consistency.

Method: The proposed framework uses DiT as the garment encoder, a dynamic feature fusion module, and a limb-aware dynamic attention module to enhance focus on limbs during denoising.

Result: Experiments show Dynamic Try-On generates stable, smooth results even for videos with complex postures.

Conclusion: The framework successfully balances computational efficiency and temporal consistency, outperforming prior methods in handling complex movements.

Abstract: Video try-on stands as a promising area for its tremendous real-world potential. Previous research on video try-on has primarily focused on transferring product clothing images to videos with simple human poses, while performing poorly with complex movements. To better preserve clothing details, those approaches are armed with an additional garment encoder, resulting in higher computational resource consumption. The primary challenges in this domain are twofold: (1) leveraging the garment encoder’s capabilities in video try-on while lowering computational requirements; (2) ensuring temporal consistency in the synthesis of human body parts, especially during rapid movements. To tackle these issues, we propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On. To reduce computational overhead, we adopt a straightforward approach by utilizing the DiT backbone itself as the garment encoder and employing a dynamic feature fusion module to store and integrate garment features. To ensure temporal consistency of human body parts, we introduce a limb-aware dynamic attention module that enforces the DiT backbone to focus on the regions of human limbs during the denoising process. Extensive experiments demonstrate the superiority of Dynamic Try-On in generating stable and smooth try-on results, even for videos featuring complicated human postures.

[375] Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

Yujie Zhang, Bingyang Cui, Qi Yang, Zhu Li, Yiling Xu

Main category: cs.CV

TL;DR: The paper introduces MATE-3D, a comprehensive benchmark for evaluating text-to-3D generation methods, and HyperScore, a novel multi-dimensional quality evaluator.

Details

Motivation: Existing benchmarks and metrics for text-to-3D generation lack fine-grained and multi-dimensional evaluation, limiting progress in the field.

Method: The authors propose MATE-3D, a benchmark with eight prompt categories and 1,280 generated meshes, and HyperScore, a hypernetwork-based evaluator for multi-dimensional quality assessment.

Result: HyperScore outperforms existing metrics on MATE-3D, demonstrating superior performance in multi-dimensional evaluation.

Conclusion: MATE-3D and HyperScore provide a robust framework for assessing and improving text-to-3D generation, addressing previous limitations in evaluation.

Abstract: Text-to-3D generation has achieved remarkable progress in recent years, yet evaluating these methods remains challenging for two reasons: i) Existing benchmarks lack fine-grained evaluation on different prompt categories and evaluation dimensions. ii) Previous evaluation metrics only focus on a single aspect (e.g., text-3D alignment) and fail to perform multi-dimensional quality assessment. To address these problems, we first propose a comprehensive benchmark named MATE-3D. The benchmark contains eight well-designed prompt categories that cover single and multiple object generation, resulting in 1,280 generated textured meshes. We have conducted a large-scale subjective experiment from four different evaluation dimensions and collected 107,520 annotations, followed by detailed analyses of the results. Based on MATE-3D, we propose a novel quality evaluator named HyperScore. Utilizing hypernetwork to generate specified mapping functions for each evaluation dimension, our metric can effectively perform multi-dimensional quality assessment. HyperScore presents superior performance over existing metrics on MATE-3D, making it a promising metric for assessing and improving text-to-3D generation. The project is available at https://mate-3d.github.io/.

[376] Aether: Geometric-Aware Unified World Modeling

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Tong He

Main category: cs.CV

TL;DR: Aether is a unified framework for geometry-aware reasoning in AI, combining 4D reconstruction, video prediction, and visual planning, with strong zero-shot generalization.

Details

Motivation: Addressing the challenge of integrating geometric reconstruction and generative modeling for human-like spatial reasoning in AI systems.

Method: Aether jointly optimizes 4D dynamic reconstruction, action-conditioned video prediction, and goal-conditioned visual planning through task-interleaved feature learning.

Result: Achieves zero-shot synthetic-to-real generalization, competitive reconstruction performance without real-world data, and effective action-conditioned tasks.

Conclusion: Aether advances physically-reasonable world modeling, inspiring further exploration in AI spatial reasoning applications.

Abstract: The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

[377] StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors

Xiaokun Sun, Zeyu Cai, Ying Tai, Jian Yang, Zhenyu Zhang

Main category: cs.CV

TL;DR: StrandHead is a text-driven method for generating 3D hair strands and disentangled head avatars, leveraging 2D generative models and human-centric priors for realistic results.

Details

Motivation: Existing avatar generation methods struggle with realistic hair modeling due to data limitations or entangled representations.

Method: Uses 2D generative models pre-trained on human mesh data, a meshing approach guided by strand geometry, and regularization by haircut features for stable optimization.

Result: Achieves state-of-the-art performance in text-to-strand generation and disentangled 3D head avatar modeling.

Conclusion: StrandHead enables realistic 3D hair generation for avatar editing and graphics applications.

Abstract: While haircut indicates distinct personality, existing avatar generation methods fail to model practical hair due to the data limitation or entangled representation. We propose StrandHead, a novel text-driven method capable of generating 3D hair strands and disentangled head avatars with strand-level attributes. Instead of using large-scale hair-text paired data for supervision, we demonstrate that realistic hair strands can be generated from prompts by distilling 2D generative models pre-trained on human mesh data. To this end, we propose a meshing approach guided by strand geometry to guarantee the gradient flow from the distillation objective to the neural strand representation. The optimization is then regularized by statistically significant haircut features, leading to stable updating of strands against unreasonable drifting. These employed 2D/3D human-centric priors contribute to text-aligned and realistic 3D strand generation. Extensive experiments show that StrandHead achieves the state-of-the-art performance on text to strand generation and disentangled 3D head avatar modeling. The generated 3D hair can be applied on avatars for strand-level editing, as well as implemented in the graphics engine for physical simulation or other applications. Project page: https://xiaokunsun.github.io/StrandHead.github.io/.

[378] Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning

Maomao Li, Lijian Lin, Yunfei Liu, Ye Zhu, Yu Li

Main category: cs.CV

TL;DR: Qffusion is a dual-frame-guided framework for portrait video editing, leveraging Stable Diffusion and a Quadrant-grid Arrangement (QGA) scheme for stable and efficient video generation.

Details

Motivation: To simplify portrait video editing by using a general animation framework trained from two still reference images, avoiding complex training stages or additional networks.

Method: Uses QGA for latent re-arrangement, fuses features of reference images and facial conditions, and employs self-attention for appearance and temporal learning. Introduces QGP for stable arbitrary-length video generation.

Result: Qffusion outperforms state-of-the-art techniques in portrait video editing, achieving stable results without extra networks.

Conclusion: Qffusion provides an efficient and stable solution for portrait video editing, leveraging modified Stable Diffusion inputs and recursive processing for arbitrary-length videos.

Abstract: This paper presents Qffusion, a dual-frame-guided framework for portrait video editing. Specifically, we consider a design principle of ``animation for editing’’, and train Qffusion as a general animation framework from two still reference images while we can use it for portrait video editing easily by applying modified start and end frames as references during inference. Leveraging the powerful generative power of Stable Diffusion, we propose a Quadrant-grid Arrangement (QGA) scheme for latent re-arrangement, which arranges the latent codes of two reference images and that of four facial conditions into a four-grid fashion, separately. Then, we fuse features of these two modalities and use self-attention for both appearance and temporal learning, where representations at different times are jointly modeled under QGA. Our Qffusion can achieve stable video editing without additional networks or complex training stages, where only the input format of Stable Diffusion is modified. Further, we propose a Quadrant-grid Propagation (QGP) inference strategy, which enjoys a unique advantage on stable arbitrary-length video generation by processing reference and condition frames recursively. Through extensive experiments, Qffusion consistently outperforms state-of-the-art techniques on portrait video editing. Project page: https://qffusion.github.io/page/.

[379] Dual Frequency Branch Framework with Reconstructed Sliding Windows Attention for AI-Generated Image Detection

Jiazhen Yan, Ziqiang Li, Fan Wang, Ziwen He, Zhangjie Fu

Main category: cs.CV

TL;DR: The paper proposes a method to detect AI-generated images by combining local feature reconstruction and dual frequency domain analysis, achieving improved accuracy over existing methods.

Details

Motivation: The rise of realistic AI-generated images poses societal risks like misinformation, necessitating better detection methods. Current approaches lack consideration for local element interdependencies and single-domain frequency analysis.

Method: Uses a sliding window for local attention and feature reconstruction, alongside a dual frequency domain framework (DWT and FFT) to capture forgery traces.

Result: Achieves a 2.13% accuracy improvement over state-of-the-art methods on diverse datasets.

Conclusion: The proposed method enhances detection of AI-generated images by addressing limitations in feature extraction and frequency domain analysis.

Abstract: The rapid advancement of Generative Adversarial Networks (GANs) and diffusion models has enabled the creation of highly realistic synthetic images, presenting significant societal risks, such as misinformation and deception. As a result, detecting AI-generated images has emerged as a critical challenge. Existing researches emphasize extracting fine-grained features to enhance detector generalization, yet they often lack consideration for the importance and interdependencies of internal elements within local regions and are limited to a single frequency domain, hindering the capture of general forgery traces. To overcome the aforementioned limitations, we first utilize a sliding window to restrict the attention mechanism to a local window, and reconstruct the features within the window to model the relationships between neighboring internal elements within the local region. Then, we design a dual frequency domain branch framework consisting of four frequency domain subbands of DWT and the phase part of FFT to enrich the extraction of local forgery features from different perspectives. Through feature enrichment of dual frequency domain branches and fine-grained feature extraction of reconstruction sliding window attention, our method achieves superior generalization detection capabilities on both GAN and diffusion model-based generative images. Evaluated on diverse datasets comprising images from 65 distinct generative models, our approach achieves a 2.13% improvement in detection accuracy over state-of-the-art methods.

[380] Spatiotemporal Multi-Camera Calibration using Freely Moving People

Sang-Eun Lee, Ko Nishino, Shohei Nobuhara

Main category: cs.CV

TL;DR: A novel method for spatiotemporal multi-camera calibration using freely moving people in multiview videos, solving rotation, time offset, and association jointly.

Details

Motivation: Calibrating multiple cameras and matching views are interdependent challenges; this method addresses them as a unified registration problem.

Method: Uses 3D human poses from a monocular estimator, transforms them into 3D points on a unit sphere, and solves rotation, time offset, and association alternatingly with a probabilistic approach.

Result: Effective and flexible marker-free calibration demonstrated on synthetic and real data.

Conclusion: The method provides a practical solution for spatiotemporal multi-camera calibration without markers.

Abstract: We propose a novel method for spatiotemporal multi-camera calibration using freely moving people in multiview videos. Since calibrating multiple cameras and finding matches across their views are inherently interdependent, performing both in a unified framework poses a significant challenge. We address these issues as a single registration problem of matching two sets of 3D points, leveraging human motion in dynamic multi-person scenes. To this end, we utilize 3D human poses obtained from an off-the-shelf monocular 3D human pose estimator and transform them into 3D points on a unit sphere, to solve the rotation, time offset, and the association alternatingly. We employ a probabilistic approach that can jointly solve both problems of aligning spatiotemporal data and establishing correspondences through soft assignment between two views. The translation is determined by applying coplanarity constraints. The pairwise registration results are integrated into a multiview setup, and then a nonlinear optimization method is used to improve the accuracy of the camera poses, temporal offsets, and multi-person associations. Extensive experiments on synthetic and real data demonstrate the effectiveness and flexibility of the proposed method as a practical marker-free calibration tool.

[381] RadMamba: Efficient Human Activity Recognition through Radar-based Micro-Doppler-Oriented Mamba State-Space Model

Yizhuo Wu, Francesco Fioranelli, Chang Gao

Main category: cs.CV

TL;DR: RadMamba, a lightweight Mamba SSM for radar-based HAR, achieves high accuracy with minimal parameters, outperforming existing models.

Details

Motivation: Existing radar-based HAR solutions are computationally heavy, limiting their use in resource-constrained scenarios. RadMamba aims to balance accuracy and efficiency.

Method: Introduces RadMamba, a parameter-efficient Mamba SSM tailored for radar micro-Doppler signals, leveraging transformer strengths while reducing complexity.

Result: Achieves 99.8% accuracy on Dataset DIAT with 1/400 parameters, 92.0% on Dataset CI4R with 1/10 parameters, and outperforms others by 3% on Dataset UoG2020 with only 6.7k parameters.

Conclusion: RadMamba offers a lightweight, high-accuracy solution for radar-based HAR, suitable for resource-constrained deployments.

Abstract: Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as Vision Transformer (ViT) and State-Space Model (SSM) architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model’s 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models’ 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: https://github.com/lab-emi/AIRHAR.

[382] MIGE: Mutually Enhanced Multimodal Instruction-Based Image Generation and Editing

Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen

Main category: cs.CV

TL;DR: MIGE is a unified framework for subject-driven generation and instruction-based editing, using multimodal instructions to standardize tasks and improve performance.

Details

Motivation: Existing methods treat subject-driven generation and instruction-based editing separately, facing challenges like limited data and poor generalization. Both tasks require visual consistency and input-output alignment.

Method: MIGE standardizes tasks using multimodal instructions, treats generation as creation and editing as modification, and uses a multimodal encoder for unified vision-language feature fusion.

Result: MIGE improves instruction adherence and visual consistency, generalizes to novel tasks like instruction-based subject-driven editing, and achieves state-of-the-art performance.

Conclusion: MIGE successfully unifies and enhances subject-driven generation and instruction-based editing, demonstrating superior performance and generalization.

Abstract: Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Inspired by this, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation, then introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism. This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: by leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.

[383] Distribution-aware Forgetting Compensation for Exemplar-Free Lifelong Person Re-identification

Shiben Liu, Huijie Fan, Qiang Wang, Baojie Fan, Yandong Tang, Liangqiong Qu

Main category: cs.CV

TL;DR: The paper proposes DAFC, a model for Lifelong Person Re-identification (LReID) that addresses forgetting issues without old exemplars or knowledge distillation, using text-driven prompts and domain-specific distribution integration.

Details

Motivation: Existing methods for LReID struggle with preserving old knowledge while adapting to new information, leading to forgetting. Rehearsal-based and rehearsal-free approaches have limitations.

Method: DAFC combines Text-driven Prompt Aggregation (TPA) for fine-grained representations and Distribution-based Awareness and Integration (DAI) for domain-specific distribution learning. A Knowledge Consolidation Mechanism (KCM) aids adaptive learning.

Result: DAFC outperforms state-of-the-art methods in experiments.

Conclusion: DAFC effectively mitigates catastrophic forgetting in LReID by leveraging cross-domain shared representation learning and domain-specific distribution integration.

Abstract: Lifelong Person Re-identification (LReID) suffers from a key challenge in preserving old knowledge while adapting to new information. The existing solutions include rehearsal-based and rehearsal-free methods to address this challenge. Rehearsal-based approaches rely on knowledge distillation, continuously accumulating forgetting during the distillation process. Rehearsal-free methods insufficiently learn the distribution of each domain, leading to forgetfulness over time. To solve these issues, we propose a novel Distribution-aware Forgetting Compensation (DAFC) model that explores cross-domain shared representation learning and domain-specific distribution integration without using old exemplars or knowledge distillation. We propose a Text-driven Prompt Aggregation (TPA) that utilizes text features to enrich prompt elements and guide the prompt model to learn fine-grained representations for each instance. This can enhance the differentiation of identity information and establish the foundation for domain distribution awareness. Then, Distribution-based Awareness and Integration (DAI) is designed to capture each domain-specific distribution by a dedicated expert network and adaptively consolidate them into a shared region in high-dimensional space. In this manner, DAI can consolidate and enhance cross-domain shared representation learning while alleviating catastrophic forgetting. Furthermore, we develop a Knowledge Consolidation Mechanism (KCM) that comprises instance-level discrimination and cross-domain consistency alignment strategies to facilitate model adaptive learning of new knowledge from the current domain and promote knowledge consolidation learning between acquired domain-specific distributions, respectively. Experimental results show that our DAFC outperforms state-of-the-art methods. Our code is available at https://github.com/LiuShiBen/DAFC.

[384] YOLO-PRO: Enhancing Instance-Specific Object Detection with Full-Channel Global Self-Attention

Lin Huang, Yujuan Tan, Weisheng Li, Shitai Shan, Linlin Shen, Jing Yu

Main category: cs.CV

TL;DR: The paper proposes two modules (ISB and ISADH) to improve object detection by addressing bottlenecks and decoupled heads, achieving state-of-the-art performance on MS-COCO.

Details

Motivation: To overcome limitations like diminished instance discriminability and computational redundancy in existing object detection frameworks.

Method: Introduces Instance-Specific Bottleneck (ISB) for global self-attention and Instance-Specific Asymmetric Decoupled Head (ISADH) for hierarchical feature integration.

Result: YOLO-PRO outperforms YOLOv8 and YOLO11 in AP on MS-COCO, with competitive efficiency.

Conclusion: The work offers practical insights for high-precision detectors on edge devices.

Abstract: This paper addresses the inherent limitations of conventional bottleneck structures (diminished instance discriminability due to overemphasis on batch statistics) and decoupled heads (computational redundancy) in object detection frameworks by proposing two novel modules: the Instance-Specific Bottleneck with full-channel global self-attention (ISB) and the Instance-Specific Asymmetric Decoupled Head (ISADH). The ISB module innovatively reconstructs feature maps to establish an efficient full-channel global attention mechanism through synergistic fusion of batch-statistical and instance-specific features. Complementing this, the ISADH module pioneers an asymmetric decoupled architecture enabling hierarchical multi-dimensional feature integration via dual-stream batch-instance representation fusion. Extensive experiments on the MS-COCO benchmark demonstrate that the coordinated deployment of ISB and ISADH in the YOLO-PRO framework achieves state-of-the-art performance across all computational scales. Specifically, YOLO-PRO surpasses YOLOv8 by 1.0-1.6% AP (N/S/M/L/X scales) and outperforms YOLO11 by 0.1-0.5% AP in critical N/M/L/X groups, while maintaining competitive computational efficiency. This work provides practical insights for developing high-precision detectors deployable on edge devices.

[385] BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Ruotong Wang, Mingli Zhu, Jiarong Ou, Rui Chen, Xin Tao, Pengfei Wan, Baoyuan Wu

Main category: cs.CV

TL;DR: BadVideo introduces a backdoor attack framework for text-to-video (T2V) models, exploiting redundancy to embed hidden harmful content with high stealthiness.

Details

Motivation: The adversarial vulnerabilities of T2V models are underexplored, and their inherent redundancy allows malicious exploitation.

Method: Uses Spatio-Temporal Composition and Dynamic Element Transformation to encode and convey malicious information.

Result: Achieves high attack success rates while preserving original semantics and evading content moderation.

Conclusion: Reveals T2V models’ adversarial risks, highlighting potential misuse.

Abstract: Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker’s malicious target seamlessly integrates with the user’s textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse. Our project page is at https://wrt2000.github.io/BadVideo2025/.

[386] Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

Zhumei Wang, Zechen Hu, Ruoxi Guo, Huaijin Pi, Ziyong Feng, Sida Peng, Xiaowei Zhou, Mingtao Pei, Siyuan Huang

Main category: cs.CV

TL;DR: Mocap-2-to-3 is a framework for recovering metrically accurate 3D human motion from monocular input by leveraging 2D data pre-training and multi-view synthesis, outperforming state-of-the-art methods.

Details

Motivation: Existing methods rely on limited 3D training data and struggle with metric-scale pose estimation from monocular input, limiting generalization.

Method: The framework decomposes 3D motion into multi-view syntheses, pre-trains a single-view diffusion model on 2D data, and fine-tunes a multi-view model with 3D data. It also introduces a novel motion representation for absolute pose recovery.

Result: The method achieves superior performance in camera-space motion realism and world-grounded positioning, with better generalization.

Conclusion: Mocap-2-to-3 effectively addresses the challenges of monocular 3D motion recovery, offering improved accuracy and generalization.

Abstract: Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input. To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions. To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data. Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference. Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability. Our code will be made publicly available.

[387] MemeBLIP2: A novel lightweight multimodal system to detect harmful memes

Jiaqi Liu, Ran Tong, Aowei Shen, Shuzheng Li, Changlin Yang, Lisha Xu

Main category: cs.CV

TL;DR: MemeBLIP2 is a lightweight multimodal system for detecting harmful memes by effectively combining image and text features, improving detection accuracy even for subtle or culturally specific content.

Details

Motivation: Some memes contain harmful messages like hate speech, necessitating a system to detect such content by leveraging both visual and textual cues.

Method: The system uses BLIP-2 as its core vision-language model, adding modules to align and fuse image and text representations in a shared space for better classification.

Result: Evaluated on the PrideMM dataset, MemeBLIP2 effectively captures subtle multimodal cues, enhancing harmful meme detection.

Conclusion: MemeBLIP2 demonstrates improved performance in identifying harmful memes, including those with irony or cultural specificity.

Abstract: Memes often merge visuals with brief text to share humor or opinions, yet some memes contain harmful messages such as hate speech. In this paper, we introduces MemeBLIP2, a light weight multimodal system that detects harmful memes by combining image and text features effectively. We build on previous studies by adding modules that align image and text representations into a shared space and fuse them for better classification. Using BLIP-2 as the core vision-language model, our system is evaluated on the PrideMM datasets. The results show that MemeBLIP2 can capture subtle cues in both modalities, even in cases with ironic or culturally specific content, thereby improving the detection of harmful material.

[388] Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Suhwan Cho, Seunghoon Lee, Minhyeok Lee, Jungho Lee, Sangyoun Lee

Main category: cs.CV

TL;DR: FindTrack decouples target identification and mask propagation in referring video object segmentation, improving accuracy and consistency.

Details

Motivation: Existing methods struggle with ambiguous target identification and inconsistent mask propagation in complex scenes.

Method: FindTrack selects a key frame for robust target reference and uses a propagation module for tracking.

Result: FindTrack outperforms existing methods on public benchmarks.

Conclusion: Decoupling target identification and propagation enhances performance in referring video object segmentation.

Abstract: Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.

[389] REGRACE: A Robust and Efficient Graph-based Re-localization Algorithm using Consistency Evaluation

Débora N. P. Oliveira, Joshua Knights, Sebastián Barbas Laina, Simon Boche, Wolfram Burgard, Stefan Leutenegger

Main category: cs.CV

TL;DR: REGRACE introduces a scalable, efficient LiDAR-based submap approach for loop closure detection, addressing viewpoint sensitivity and computational cost issues.

Details

Motivation: Current methods for loop closure detection are either computationally expensive (dense point clouds) or sensitive to viewpoint variations (object-centric approaches).

Method: Uses rotation-invariant features for labeled objects, enhanced with neighborhood context via a graph neural network, and a scalable bag-of-words approach for revisit identification.

Result: Achieves similar accuracy to state-of-the-art methods while being twice as fast.

Conclusion: REGRACE effectively balances scalability, efficiency, and accuracy in loop closure detection.

Abstract: Loop closures are essential for correcting odometry drift and creating consistent maps, especially in the context of large-scale navigation. Current methods using dense point clouds for accurate place recognition do not scale well due to computationally expensive scan-to-scan comparisons. Alternative object-centric approaches are more efficient but often struggle with sensitivity to viewpoint variation. In this work, we introduce REGRACE, a novel approach that addresses these challenges of scalability and perspective difference in re-localization by using LiDAR-based submaps. We introduce rotation-invariant features for each labeled object and enhance them with neighborhood context through a graph neural network. To identify potential revisits, we employ a scalable bag-of-words approach, pooling one learned global feature per submap. Additionally, we define a revisit with geometrical consistency cues rather than embedding distance, allowing us to recognize far-away loop closures. Our evaluations demonstrate that REGRACE achieves similar results compared to state-of-the-art place recognition and registration baselines while being twice as fast. Code and models are publicly available.

[390] RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang

Main category: cs.CV

TL;DR: RS2-SAM2 adapts SAM2 for Referring Remote Sensing Image Segmentation (RRSIS) by aligning visual-text features, using pseudo-mask prompts, and boundary constraints, achieving top performance.

Details

Motivation: Segment Anything Model 2 (SAM2) struggles with RRSIS due to challenges in understanding text-described RS scenes and generating effective prompts.

Method: Proposes RS2-SAM2 with a union encoder for visual-text alignment, bidirectional hierarchical fusion for scene adaptation, and a mask prompt generator for dense prompts.

Result: Achieves state-of-the-art performance on RRSIS benchmarks.

Conclusion: RS2-SAM2 effectively adapts SAM2 for RRSIS, addressing key challenges and improving segmentation accuracy.

Abstract: Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text descriptions. To address these issues, we propose RS2-SAM2, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features, providing pseudo-mask-based dense prompts, and enforcing boundary constraints. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model’s interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.

[391] Video Forgery Detection for Surveillance Cameras: A Review

Noor B. Tayfor, Tarik A. Rashid, Shko M. Qader, Bryar A. Hassan, Mohammed H. Abdalla, Jafar Majidpour, Aram M. Ahmed, Hussein M. Ali, Aso M. Aladdin, Abdulhady A. Abdullah, Ahmed S. Shamsaldin, Haval M. Sidqi, Abdulrahman Salih, Zaher M. Yaseen, Azad A. Ameen, Janmenjoy Nayak, Mahmood Yashar Hamza

Main category: cs.CV

TL;DR: The paper reviews forensic techniques for detecting video forgery in surveillance footage, emphasizing the need for robust methods to ensure authenticity and legal credibility.

Details

Motivation: The rise of video editing tools has made tampering with surveillance footage easier, threatening its integrity and judicial reliability.

Method: The study examines techniques like compression-based analysis, frame duplication detection, and machine learning approaches.

Result: Existing methods are effective but require enhancement to counter advanced forgery techniques.

Conclusion: Strengthening video forensic capabilities is crucial to maintain the credibility of surveillance recordings as legal evidence.

Abstract: The widespread availability of video recording through smartphones and digital devices has made video-based evidence more accessible than ever. Surveillance footage plays a crucial role in security, law enforcement, and judicial processes. However, with the rise of advanced video editing tools, tampering with digital recordings has become increasingly easy, raising concerns about their authenticity. Ensuring the integrity of surveillance videos is essential, as manipulated footage can lead to misinformation and undermine judicial decisions. This paper provides a comprehensive review of existing forensic techniques used to detect video forgery, focusing on their effectiveness in verifying the authenticity of surveillance recordings. Various methods, including compression-based analysis, frame duplication detection, and machine learning-based approaches, are explored. The findings highlight the growing necessity for more robust forensic techniques to counteract evolving forgery methods. Strengthening video forensic capabilities will ensure that surveillance recordings remain credible and admissible as legal evidence.

[392] “Principal Components” Enable A New Language of Images

Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, Xiaojuan Qi

Main category: cs.CV

TL;DR: A novel visual tokenization framework embeds PCA-like structure into latent tokens, ensuring interpretability and better downstream task performance.

Details

Motivation: Existing visual tokenizers focus on reconstruction but neglect latent space structure, which is crucial for interpretability and tasks.

Method: Generates 1D causal token sequences with decreasing explained variance, resolving semantic-spectrum coupling using a diffusion decoder.

Result: Achieves state-of-the-art reconstruction, better interpretability, and comparable autoregressive model performance with fewer tokens.

Conclusion: The framework improves tokenization by embedding structural properties, enhancing both performance and interpretability.

Abstract: We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space–a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, autoregressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference.

[393] Hoi2Threat: An Interpretable Threat Detection Method for Human Violence Scenarios Guided by Human-Object Interaction

Yuhan Wang, Cheng Liu, Daou Zhang, Zihan Zhao, Jinyang Chen, Purui Dong, Zuyuan Yu, Ziru Wang, Weichao Wu

Main category: cs.CV

TL;DR: The paper introduces Hoi2Threat, a threat detection method using human-object interaction pairs (HOI-pairs) to improve interpretability and semantic understanding, outperforming Gemma3 in key metrics.

Details

Motivation: Addressing the limitations of uninterpretable inference and biased semantic understanding in existing threat detection methods.

Method: Proposes Hoi2Threat, leveraging structured HOI tags from the TD-Hoi dataset to enhance semantic modeling and language generation.

Result: Hoi2Threat shows significant improvements in Correctness of Information (CoI), Behavioral Mapping Accuracy (BMA), and Threat Detailed Orientation (TDO) over Gemma3.

Conclusion: Hoi2Threat validates enhanced semantic understanding, behavior mapping, and interpretability in threat detection.

Abstract: In light of the mounting imperative for public security, the necessity for automated threat detection in high-risk scenarios is becoming increasingly pressing. However, existing methods generally suffer from the problems of uninterpretable inference and biased semantic understanding, which severely limits their reliability in practical deployment. In order to address the aforementioned challenges, this article proposes a threat detection method based on human-object interaction pairs (HOI-pairs), Hoi2Threat. This method is based on the fine-grained multimodal TD-Hoi dataset, enhancing the model’s semantic modeling ability for key entities and their behavioral interactions by using structured HOI tags to guide language generation. Furthermore, a set of metrics is designed for the evaluation of text response quality, with the objective of systematically measuring the model’s representation accuracy and comprehensibility during threat interpretation. The experimental results have demonstrated that Hoi2Threat attains substantial enhancement in several threat detection tasks, particularly in the core metrics of Correctness of Information (CoI), Behavioral Mapping Accuracy (BMA), and Threat Detailed Orientation (TDO), which are 5.08, 5.04, and 4.76, and 7.10%, 6.80%, and 2.63%, respectively, in comparison with the Gemma3 (4B). The aforementioned results provide comprehensive validation of the merits of this approach in the domains of semantic understanding, entity behavior mapping, and interpretability.

[394] Towards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression

Arianna Stropeni, Francesco Borsatti, Manuel Barusco, Davide Dalle Pezze, Marco Fabris, Gian Antonio Susto

Main category: cs.CV

TL;DR: The paper explores efficient Visual Anomaly Detection (VAD) for IoT edge devices, balancing compression techniques with minimal accuracy loss, achieving 80% faster inference.

Details

Motivation: Addressing the challenge of deploying deep learning models in resource-constrained IoT environments for cost-effective anomaly detection.

Method: Evaluates data compression techniques to optimize system latency and detection accuracy, tested on the MVTec AD benchmark.

Result: Achieves up to 80% reduction in end-to-end inference time with minimal performance loss in anomaly detection.

Conclusion: Compact, efficient processing strategies enable effective VAD in IoT settings without compromising accuracy.

Abstract: Visual Anomaly Detection (VAD) is a key task in industrial settings, where minimizing operational costs is essential. Deploying deep learning models within Internet of Things (IoT) environments introduces specific challenges due to limited computational power and bandwidth of edge devices. This study investigates how to perform VAD effectively under such constraints by leveraging compact, efficient processing strategies. We evaluate several data compression techniques, examining the tradeoff between system latency and detection accuracy. Experiments on the MVTec AD benchmark demonstrate that significant compression can be achieved with minimal loss in anomaly detection performance compared to uncompressed data. Current results show up to 80% reduction in end-to-end inference time, including edge processing, transmission, and server computation.

[395] MASQUE: A Text-Guided Diffusion-Based Framework for Localized and Customized Adversarial Makeup

Youngjin Kwon, Xiao Zhang

Main category: cs.CV

TL;DR: MASQUE is a diffusion-based framework for generating localized adversarial makeups to protect privacy in facial recognition, improving dodging success rates and adaptability.

Details

Motivation: Addressing privacy and civil rights concerns from facial recognition misuse, current anti-facial recognition methods have limitations like weak dodging success and visual artifacts.

Method: MASQUE uses diffusion-based techniques with null-text inversion, cross-attention fusion, and adversarial guidance to create localized makeups from text prompts.

Result: MASQUE outperforms baselines in dodging success rates, perceptual fidelity, and adaptability to diverse makeup prompts.

Conclusion: MASQUE offers a robust solution for privacy protection in facial recognition with improved performance and user satisfaction.

Abstract: As facial recognition is increasingly adopted for government and commercial services, its potential misuse has raised serious concerns about privacy and civil rights. To counteract, various anti-facial recognition techniques have been proposed for privacy protection by adversarially perturbing face images, among which generative makeup-based approaches are the most popular. However, these methods, designed primarily to impersonate specific target identities, can only achieve weak dodging success rates while increasing the risk of targeted abuse. In addition, they often introduce global visual artifacts or a lack of adaptability to accommodate diverse makeup prompts, compromising user satisfaction. To address the above limitations, we develop MASQUE, a novel diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts. Built upon precise null-text inversion, customized cross-attention fusion with masking, and a pairwise adversarial guidance mechanism using images of the same individual, MASQUE achieves robust dodging performance without requiring any external identity. Comprehensive evaluations on open-source facial recognition models and commercial APIs demonstrate that MASQUE significantly improves dodging success rates over all baselines, along with higher perceptual fidelity and stronger adaptability to various text makeup prompts.

Tianyi Zhao, Boyang Liu, Yanglei Gao, Yiming Sun, Maoxun Yuan, Xingxing Wei

Main category: cs.CV

TL;DR: The paper introduces M$^2$D-LIF, a framework addressing Fusion Degradation in RGB-IR object detection by improving mono-modality learning and feature fusion.

Details

Motivation: Current RGB-IR object detection methods neglect mono-modality insufficient learning, leading to Fusion Degradation, which hampers performance.

Method: Proposes M$^2$D-LIF, combining Mono-Modality Distillation (M$^2$D) and Local Illumination-aware Fusion (LIF) for better mono-modality learning and fusion.

Result: Outperforms SOTA detectors on three MMOD datasets, mitigating Fusion Degradation.

Conclusion: M$^2$D-LIF effectively addresses mono-modality learning and fusion, enhancing RGB-IR object detection performance.

Abstract: Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.

[397] GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Junyu Shi, Lijiang Liu, Yong Sun, Zhiyuan Zhang, Jinni Zhou, Qiang Nie

Main category: cs.CV

TL;DR: GenM³ is a framework for learning unified motion representations, combining MEVQ-VAE and MMT to handle data heterogeneity, achieving state-of-the-art results on benchmarks.

Details

Motivation: To address data heterogeneity in large-scale multi-source motion datasets and enhance motion generation capabilities.

Method: Proposes GenM³ with MEVQ-VAE for unified discrete motion representation and MMT for intra-modal and inter-modal alignment.

Result: Achieves FID of 0.035 on HumanML3D and strong zero-shot generalization on IDEA400.

Conclusion: GenM³ effectively handles diverse motion scenarios, outperforming existing methods.

Abstract: Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM(^3)), a comprehensive framework designed to learn unified motion representations. GenM(^3) comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM(^3) achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.

[398] DDB: Diffusion Driven Balancing to Address Spurious Correlations

Aryan Yazdan Parast, Basim Azam, Naveed Akhtar

Main category: cs.CV

TL;DR: The paper proposes Diffusion Driven Balancing (DDB) to mitigate spurious correlations in image classification by generating balanced training samples using text-to-image diffusion models, improving worst-group accuracy.

Details

Motivation: Deep neural networks trained with ERM often fail on out-of-distribution samples due to reliance on spurious correlations between labels and irrelevant image features.

Method: DDB uses textual inversion to identify causal components, generates new samples via diffusion models, prunes them based on model predictions, and retrains the ERM model.

Result: DDB achieves better worst-group accuracy than state-of-the-art methods across benchmarks.

Conclusion: DDB effectively reduces reliance on spurious correlations by leveraging carefully crafted samples, enhancing generalization.

Abstract: Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a Diffusion Driven Balancing (DDB) technique to generate training samples with text-to-image diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model’s reliance on spurious correlations by learning from carefully crafted samples in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves better worst-group accuracy than the existing state-of-the-art methods. Our code is available at https://github.com/ArianYp/DDB.

[399] 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Yuhan Zhang, Mengchen Zhang, Tong Wu, Tengfei Wang, Gordon Wetzstein, Dahua Lin, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces 3DGen-Arena and 3DGen-Bench to address the lack of human preference datasets in 3D generation, followed by the development of automated evaluation models 3DGen-Score and 3DGen-Eval.

Details

Motivation: The rapid progress in 3D generation lacks equitable automatic evaluation aligned with human perception, necessitating a comprehensive preference dataset.

Method: Developed 3DGen-Arena to gather human preferences, created 3DGen-Bench dataset, and trained CLIP-based 3DGen-Score and MLLM-based 3DGen-Eval for unified evaluation.

Result: The models show superior correlation with human rankings compared to existing metrics, demonstrating efficacy in predicting preferences.

Conclusion: The 3DGen-Bench dataset and automated evaluation system aim to foster equitable evaluation and advance 3D generative models.

Abstract: 3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications. Project page is available at https://zyh482.github.io/3DGen-Bench/.

[400] A Unified Image-Dense Annotation Generation Model for Underwater Scenes

Hongkai Lin, Dingkang Liang, Zhenghao Qi, Xiang Bai

Main category: cs.CV

TL;DR: TIDE is a method for generating realistic underwater images and dense annotations from text, addressing data scarcity in underwater dense prediction tasks.

Details

Motivation: High-quality underwater datasets with dense annotations are scarce due to complex environments and high data collection costs.

Method: TIDE unifies text-to-image and text-to-dense annotation generation in one model, using Implicit Layout Sharing (ILS) and Time Adaptive Normalization (TAN) for consistency.

Result: TIDE improves performance of underwater dense prediction models and mitigates data scarcity.

Conclusion: TIDE offers a solution for data scarcity in underwater tasks and potentially other fields.

Abstract: Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https://github.com/HongkLin/TIDE

[401] FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing

Jeongsol Kim, Yeobin Hong, Jonghyun Park, Jong Chul Ye

Main category: cs.CV

TL;DR: FlowAlign improves inversion-free flow-based image editing by introducing terminal point regularization for smoother, more consistent trajectories.

Details

Motivation: Existing inversion-free methods like FlowEdit suffer from unstable editing trajectories and poor source consistency.

Method: FlowAlign uses optimal control-based trajectory control with terminal point regularization to balance semantic alignment and structural consistency.

Result: FlowAlign outperforms existing methods in source preservation and editing controllability, supporting reverse editing.

Conclusion: FlowAlign offers a robust, reversible, and consistent framework for inversion-free flow-based image editing.

Abstract: Recent inversion-free, flow-based image editing methods such as FlowEdit leverages a pre-trained noise-to-image flow model such as Stable Diffusion 3, enabling text-driven manipulation by solving an ordinary differential equation (ODE). While the lack of exact latent inversion is a core advantage of these methods, it often results in unstable editing trajectories and poor source consistency. To address this limitation, we propose {\em FlowAlign}, a novel inversion-free flow-based framework for consistent image editing with optimal control-based trajectory control. Specifically, FlowAlign introduces source similarity at the terminal point as a regularization term to promote smoother and more consistent trajectories during the editing process. Notably, our terminal point regularization is shown to explicitly balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory. Furthermore, FlowAlign naturally supports reverse editing by simply reversing the ODE trajectory, highliting the reversible and consistent nature of the transformation. Extensive experiments demonstrate that FlowAlign outperforms existing methods in both source preservation and editing controllability.

[402] Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models

Xuran Ma, Yexin Liu, Yaofu Liu, Xianfeng Wu, Mingzhe Zheng, Zihao Wang, Ser-Nam Lim, Harry Yang

Main category: cs.CV

TL;DR: ProfilingDiT introduces an adaptive caching strategy for diffusion models, optimizing computational efficiency by distinguishing foreground and background blocks, achieving significant speedup without quality loss.

Details

Motivation: The computational intensity of diffusion models for video generation is a challenge, and existing caching methods overlook block significance, leading to inefficiency and degraded output.

Method: ProfilingDiT analyzes attention distributions to identify foreground and background preferences, then selectively caches static background features while computing dynamic foreground elements fully.

Result: The method achieves a 2.01 times speedup (e.g., for Wan2.1) while maintaining visual fidelity across quality metrics.

Conclusion: ProfilingDiT provides a viable solution for efficient video generation by balancing computational overhead and output quality.

Abstract: Recent advances in diffusion models have demonstrated remarkable capabilities in video generation. However, the computational intensity remains a significant challenge for practical applications. While feature caching has been proposed to reduce the computational burden of diffusion models, existing methods typically overlook the heterogeneous significance of individual blocks, resulting in suboptimal reuse and degraded output quality. To this end, we address this gap by introducing ProfilingDiT, a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks. Through a systematic analysis of attention distributions in diffusion models, we reveal a key observation: 1) Most layers exhibit a consistent preference for either foreground or background regions. 2) Predicted noise shows low inter-step similarity initially, which stabilizes as denoising progresses. This finding inspires us to formulate a selective caching strategy that preserves full computation for dynamic foreground elements while efficiently caching static background features. Our approach substantially reduces computational overhead while preserving visual fidelity. Extensive experiments demonstrate that our framework achieves significant acceleration (e.g., 2.01 times speedup for Wan2.1) while maintaining visual fidelity across comprehensive quality metrics, establishing a viable method for efficient video generation.

[403] Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering

Md Intisar Chowdhury, Kittinun Aukkapinyo, Hiroshi Fujimura, Joo Ann Woo, Wasu Wasusatein, Fadoua Ghourabi

Main category: cs.CV

TL;DR: Proposes Grid-LoGAT for VideoQA, using VLM for text extraction and LLM for answer generation, ensuring privacy and improving accuracy with grid-based prompting.

Details

Motivation: To enhance VideoQA accuracy while addressing privacy concerns by separating VLM (edge) and LLM (cloud) processing.

Method: Two-phase system: VLM extracts transcripts from video frames, LLM processes questions. Grid-based visual prompting improves transcript quality.

Result: Outperforms state-of-the-art on NExT-QA (65.9%) and STAR-QA (50.11%), and surpasses non-grid version by 24 points on localization questions.

Conclusion: Grid-LoGAT is effective for VideoQA, balancing privacy and performance, with significant accuracy improvements.

Abstract: In this paper, we propose a Grid-based Local and Global Area Transcription (Grid-LoGAT) system for Video Question Answering (VideoQA). The system operates in two phases. First, extracting text transcripts from video frames using a Vision-Language Model (VLM). Next, processing questions using these transcripts to generate answers through a Large Language Model (LLM). This design ensures image privacy by deploying the VLM on edge devices and the LLM in the cloud. To improve transcript quality, we propose grid-based visual prompting, which extracts intricate local details from each grid cell and integrates them with global information. Evaluation results show that Grid-LoGAT, using the open-source VLM (LLaVA-1.6-7B) and LLM (Llama-3.1-8B), outperforms state-of-the-art methods with similar baseline models on NExT-QA and STAR-QA datasets with an accuracy of 65.9% and 50.11% respectively. Additionally, our method surpasses the non-grid version by 24 points on localization-based questions we created using NExT-QA. (This paper is accepted by IEEE ICIP 2025.)

[404] DSwinIR: Rethinking Window-based Attention for Image Restoration

Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie

Main category: cs.CV

TL;DR: DSwinIR introduces Deformable Sliding Window Attention to overcome limitations of rigid window partitioning in Transformer-based image restoration, achieving state-of-the-art results.

Details

Motivation: Existing window-based self-attention in Transformers has limitations like insufficient cross-window feature interaction and content-agnostic receptive fields, which hinder performance.

Method: Proposes Deformable Sliding Window (DSwin) Attention, replacing rigid partitioning with token-centric sliding windows and content-aware deformable sampling for adaptive receptive fields.

Result: DSwinIR outperforms GridFormer by 0.53 dB on three-task and 0.86 dB on five-task benchmarks, setting new state-of-the-art results.

Conclusion: DSwinIR effectively addresses the root causes of limitations in existing methods, offering superior performance in image restoration tasks.

Abstract: Image restoration has witnessed significant advancements with the development of deep learning models. Especially Transformer-based models, particularly those leveraging window-based self-attention, have become a dominant force in image restoration. However, their performance is fundamentally constrained by the rigid, non-overlapping window partitioning scheme, which leads to two critical limitations: insufficient feature interaction across window boundaries and content-agnostic receptive fields that cannot adapt to diverse image structures. Existing methods often rely on heuristic patterns to mitigate these issues, rather than addressing the root cause. In this paper, we propose the Deformable Sliding Window Transformer (DSwinIR), a new foundational backbone architecture that systematically overcomes these limitations. At the heart of DSwinIR is the proposed novel Deformable Sliding Window (DSwin) Attention. This mechanism introduces two fundamental innovations. First, it replaces the rigid partitioning with a token-centric sliding window paradigm, ensuring seamless cross-window information flow and effectively eliminating boundary artifacts. Second, it incorporates a content-aware deformable sampling strategy, which allows the attention mechanism to learn data-dependent offsets and dynamically shape its receptive fields to focus on the most informative image regions. This synthesis endows the model with both strong locality-aware inductive biases and powerful, adaptive long-range modeling capabilities. Extensive experiments show that DSwinIR sets a new state-of-the-art across a wide spectrum of image restoration tasks. For instance, in all-in-one restoration, our DSwinIR surpasses the most recent backbone GridFormer by over 0.53 dB on the three-task benchmark and a remarkable 0.86 dB on the five-task benchmark.

[405] VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng

Main category: cs.CV

TL;DR: VisualCloze is a universal image generation framework addressing limitations of task-specific and universal models by integrating visual in-context learning, a graph-structured dataset (Graph200K), and leveraging pre-trained infilling models.

Details

Motivation: Current task-specific models lack efficiency for diverse needs, while universal models struggle with generalizable task instruction, task distributions, and unified design.

Method: Proposes VisualCloze, using visual in-context learning for task identification, Graph200K for task density, and unified image generation with pre-trained infilling models.

Result: Supports diverse in-domain tasks, generalization to unseen tasks, task unification, and reverse generation.

Conclusion: VisualCloze effectively addresses challenges in universal image generation, offering a scalable and efficient solution.

Abstract: Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.

[406] EventVAD: Training-Free Event-Aware Video Anomaly Detection

Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, Shuyan Li

Main category: cs.CV

TL;DR: EventVAD combines dynamic graph architectures and multimodal LLMs for video anomaly detection, achieving SOTA performance in training-free settings.

Details

Motivation: Supervised methods lack generalization, while training-free methods struggle with fine-grained localization. EventVAD addresses these gaps.

Method: Uses dynamic spatiotemporal graphs, noise filtering, and hierarchical prompting with MLLMs for event-aware anomaly detection.

Result: Outperforms baselines on UCF-Crime and XD-Violence datasets, even with smaller MLLMs.

Conclusion: EventVAD effectively bridges the gap in training-free VAD by leveraging event-aware reasoning and MLLMs.

Abstract: Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.

Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: The paper introduces CrossInject, a cross-modal prompt injection attack framework targeting multimodal agents, exploiting vulnerabilities by aligning adversarial perturbations across modalities to hijack decision-making.

Details

Motivation: To address the overlooked security vulnerability in multimodal agents, where attackers can manipulate both visual and textual inputs to execute unauthorized tasks.

Method: Proposes CrossInject with Visual Latent Alignment (optimizing adversarial features in visual space) and Textual Guidance Enhancement (using a large language model to craft malicious commands).

Result: Achieves a +30.1% increase in attack success rates and demonstrates effectiveness in real-world autonomous agents.

Conclusion: Highlights the critical security risks in multimodal agents and the need for robust defenses against cross-modal attacks.

Abstract: The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agent’s decision-making process and execute unauthorized tasks. Our approach incorporates two key coordinated components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to construct the black-box defensive system prompt through adversarial meta prompting and generate an malicious textual command that steers the agent’s output toward better compliance with attackers’ requests. Extensive experiments demonstrate that our method outperforms state-of-the-art attacks, achieving at least a +30.1% increase in attack success rates across diverse tasks. Furthermore, we validate our attack’s effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications.

[408] NSegment : Label-specific Deformations for Remote Sensing Image Segmentation

Yechan Kim, DongHo Yoon, SooYeon Kim, Moongu Jeon

Main category: cs.CV

TL;DR: NSegment is a simple data augmentation method for RS image segmentation that addresses labeling errors by applying elastic transformations to labels, improving model performance.

Details

Motivation: Labeling errors in RS datasets are common due to ambiguous boundaries, mixed pixels, and subjective bias, complicating noise-robust model training.

Method: Proposes NSegment, which applies elastic transformations to segmentation labels with varying intensity per sample in each epoch.

Result: Improves performance of RS image segmentation across state-of-the-art models.

Conclusion: NSegment effectively mitigates labeling inconsistencies without increasing training complexity.

Abstract: Labeling errors in remote sensing (RS) image segmentation datasets often remain implicit and subtle due to ambiguous class boundaries, mixed pixels, shadows, complex terrain features, and subjective annotator bias. Furthermore, the scarcity of annotated RS data due to the high cost of labeling complicates training noise-robust models. While sophisticated mechanisms such as label selection or noise correction might address the issue mentioned above, they tend to increase training time and add implementation complexity. In this paper, we propose NSegment-a simple yet effective data augmentation solution to mitigate this issue. Unlike traditional methods, it applies elastic transformations only to segmentation labels, varying deformation intensity per sample in each training epoch to address annotation inconsistencies. Experimental results demonstrate that our approach improves the performance of RS image segmentation over various state-of-the-art models.

[409] Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu

Main category: cs.CV

TL;DR: The Mcity Data Engine addresses the challenge of selecting and labeling rare classes in large datasets for ITS by providing an open-source system for the entire data-based development cycle.

Details

Motivation: The difficulty in detecting long-tail classes in unlabeled ITS data and the lack of open-source tools for iterative data selection and model training motivated the development of the Mcity Data Engine.

Method: The system includes modules for data acquisition, open-vocabulary data selection (focusing on rare classes), and model deployment.

Result: The Mcity Data Engine is publicly available on GitHub under an MIT license, offering a solution for researchers and the open-source community.

Conclusion: The Mcity Data Engine fills a gap in open-source tools for ITS data processing, particularly for rare and novel classes.

Abstract: With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine

[410] Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

Seong Hyeon Park, Jinwoo Shin

Main category: cs.CV

TL;DR: MMP is a new model for estimating 3D geometry in dynamic scenes from monocular videos, improving expressiveness and reducing errors.

Details

Motivation: Existing models struggle with noisy partial attributes and costly optimizations in dynamic scene geometry estimation.

Method: MMP uses a Siamese architecture with a trajectory encoding module to project point-wise dynamics for improved expressiveness.

Result: MMP achieves a 15.1% reduction in regression error, outperforming state-of-the-art methods.

Conclusion: MMP offers a feed-forward solution for accurate and efficient dynamic scene geometry estimation.

Abstract: In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes are inherently noisy under multiple frames, test-time global optimizations are often employed to fully recover the geometry, which is liable to failure and incurs heavy inference costs. To address the challenge, we present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce a new trajectory encoding module to project point-wise dynamics on the representation for each frame, which can provide significantly improved expressiveness for dynamic scenes. In our experiments, we find MMP can achieve state-of-the-art quality in feed-forward pointmap prediction, e.g., 15.1% enhancement in the regression error.

[411] Crop Pest Classification Using Deep Learning Techniques: A Review

Muhammad Hassam Ejaz, Muhammad Bilal, Usman Habib

Main category: cs.CV

TL;DR: A review of 37 studies (2018-2025) on AI-based pest detection, highlighting trends from CNNs to hybrid/transformer models, key challenges, and future directions.

Details

Motivation: Traditional pest monitoring is slow and manual; AI offers scalable, automated solutions.

Method: Analyzed 37 studies by crop type, pest species, model architecture, dataset usage, and technical challenges.

Result: Shift from CNNs to hybrid/transformer models improves accuracy but faces challenges like imbalanced datasets and deployment issues.

Conclusion: AI-based pest monitoring shows promise but needs work on generalizability, small pest detection, and edge deployment.

Abstract: Insect pests continue to bring a serious threat to crop yields around the world, and traditional methods for monitoring them are often slow, manual, and difficult to scale. In recent years, deep learning has emerged as a powerful solution, with techniques like convolutional neural networks (CNNs), vision transformers (ViTs), and hybrid models gaining popularity for automating pest detection. This review looks at 37 carefully selected studies published between 2018 and 2025, all focused on AI-based pest classification. The selected research is organized by crop type, pest species, model architecture, dataset usage, and key technical challenges. The early studies relied heavily on CNNs but latest work is shifting toward hybrid and transformer-based models that deliver higher accuracy and better contextual understanding. Still, challenges like imbalanced datasets, difficulty in detecting small pests, limited generalizability, and deployment on edge devices remain significant hurdles. Overall, this review offers a structured overview of the field, highlights useful datasets, and outlines the key challenges and future directions for AI-based pest monitoring systems.

[412] Decentralized LoRA Augmented Transformer with Context-aware Multi-scale Feature Learning for Secured Eye Diagnosis

Md. Naimur Asif Borno, Md Sakib Hossain Shovon, MD Hanif Sikder, Iffat Firozy Rimi, Tahani Jaser Alahmadi, Mohammad Ali Moni

Main category: cs.CV

TL;DR: A novel DeiT-based framework integrates multiscale patch embedding, LoRA, knowledge distillation, and federated learning to address challenges in ophthalmic disease diagnosis, achieving superior performance and interpretability.

Details

Motivation: Addressing data imbalance, privacy concerns, spatial feature diversity, and interpretability in ophthalmic disease diagnosis using deep learning.

Method: Proposes a DeiT-based framework with multiscale patch embedding, LoRA for efficiency, federated learning for privacy, and knowledge distillation for generalization.

Result: Outperforms CNNs and transformers in AUC, F1 score, and precision on OCTDL and Eye Disease Image Dataset, with interpretable Grad-CAM++ visualizations.

Conclusion: Establishes a scalable, secure, and explainable AI foundation for ophthalmic diagnostics.

Abstract: Accurate and privacy-preserving diagnosis of ophthalmic diseases remains a critical challenge in medical imaging, particularly given the limitations of existing deep learning models in handling data imbalance, data privacy concerns, spatial feature diversity, and clinical interpretability. This paper proposes a novel Data efficient Image Transformer (DeiT) based framework that integrates context aware multiscale patch embedding, Low-Rank Adaptation (LoRA), knowledge distillation, and federated learning to address these challenges in a unified manner. The proposed model effectively captures both local and global retinal features by leveraging multi scale patch representations with local and global attention mechanisms. LoRA integration enhances computational efficiency by reducing the number of trainable parameters, while federated learning ensures secure, decentralized training without compromising data privacy. A knowledge distillation strategy further improves generalization in data scarce settings. Comprehensive evaluations on two benchmark datasets OCTDL and the Eye Disease Image Dataset demonstrate that the proposed framework consistently outperforms both traditional CNNs and state of the art transformer architectures across key metrics including AUC, F1 score, and precision. Furthermore, Grad-CAM++ visualizations provide interpretable insights into model predictions, supporting clinical trust. This work establishes a strong foundation for scalable, secure, and explainable AI applications in ophthalmic diagnostics.

Sangbum Choi, Kyeongryeol Go, Taewoong Jang

Main category: cs.CV

TL;DR: ZERO is a vision foundation model designed for zero-shot industrial applications, leveraging multi-modal prompting and trained on a compact dataset to outperform existing models.

Details

Motivation: Addressing the lack of high-quality, domain-specific datasets for zero-shot deployment of foundation models in industrial settings.

Method: Uses multi-modal prompting (textual and visual) and is trained on 0.9 million annotated samples from a proprietary billion-scale dataset.

Result: Competitive performance on academic benchmarks (LVIS-Val) and outperforms models on 37 industrial datasets; ranked 2nd and 4th in CVPR 2025 challenges.

Conclusion: ZERO is the first vision foundation model built for domain-specific, zero-shot industrial use, demonstrating practical deployability and generalizability.

Abstract: Foundation models have revolutionized AI, yet they struggle with zero-shot deployment in real-world industrial settings due to a lack of high-quality, domain-specific datasets. To bridge this gap, Superb AI introduces ZERO, an industry-ready vision foundation model that leverages multi-modal prompting (textual and visual) for generalization without retraining. Trained on a compact yet representative 0.9 million annotated samples from a proprietary billion-scale industrial dataset, ZERO demonstrates competitive performance on academic benchmarks like LVIS-Val and significantly outperforms existing models across 37 diverse industrial datasets. Furthermore, ZERO achieved 2nd place in the CVPR 2025 Object Instance Detection Challenge and 4th place in the Foundational Few-shot Object Detection Challenge, highlighting its practical deployability and generalizability with minimal adaptation and limited data. To the best of our knowledge, ZERO is the first vision foundation model explicitly built for domain-specific, zero-shot industrial applications.

[414] Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance

Mingfang Zhang, Ryo Yonetani, Yifei Huang, Liangyang Ouyang, Ruicong Liu, Yoichi Sato

Main category: cs.CV

TL;DR: The paper introduces EAIL, a framework using head-mounted IMU signals and egocentric action cues to localize individuals in 3D point clouds, addressing drift and action diversity challenges.

Details

Motivation: Human inertial localization is hindered by IMU sensor noise causing drift and diverse human actions complicating signal processing. EAIL leverages action-environment correlations to mitigate drift.

Method: EAIL learns correlations between IMU signals and environmental features via hierarchical multimodal alignment, enhanced by vision-language guidance, and uses encoders for localization and action recognition.

Result: EAIL outperforms state-of-the-art methods in inertial localization and action recognition, demonstrating its effectiveness.

Conclusion: EAIL effectively addresses drift and action diversity in inertial localization, with added benefits for action recognition, validated by extensive experiments.

Abstract: This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from head-mounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions captured by the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment with vision-language guidance. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. The learning process is enhanced using concurrently collected vision and language signals to improve multimodal alignment. The learned encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art inertial localization and inertial action recognition baselines.

[415] Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan

Main category: cs.CV

TL;DR: BAGEL is an open-source foundational model unifying multimodal understanding and generation, outperforming existing open-source models in benchmarks and showcasing advanced reasoning abilities.

Details

Motivation: To create an open-source alternative to proprietary systems for multimodal understanding and generation, fostering further research.

Method: BAGEL is a unified, decoder-only model pretrained on trillions of tokens from diverse interleaved text, image, video, and web data.

Result: BAGEL excels in multimodal generation, understanding, and reasoning, outperforming benchmarks and demonstrating capabilities like image manipulation and future frame prediction.

Conclusion: BAGEL advances multimodal research by providing open-source tools, data protocols, and checkpoints, encouraging community collaboration.

Abstract: Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

[416] FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding

Chenlu Zhan, Yufei Zhang, Gaoang Wang, Hongwei Wang

Main category: cs.CV

TL;DR: FreeQ-Graph enables free-form querying in 3D scenes using a semantic-consistent scene graph, overcoming limitations of predefined vocabularies and LLM inconsistencies.

Details

Motivation: Existing methods rely on predefined vocabularies or LLMs, hindering free-form semantic querying and lacking 3D scene-level consistency.

Method: Constructs a 3D scene graph with LLM/LVLM guidance, aligns nodes with semantic labels via superpoints, and uses an LLM-based reasoning algorithm for querying.

Result: Outperforms in 3D semantic grounding, segmentation, and complex querying across 6 datasets.

Conclusion: FreeQ-Graph advances free-form semantic querying in 3D scenes with improved consistency and accuracy.

Abstract: Semantic querying in complex 3D scenes through free-form language presents a significant challenge. Existing 3D scene understanding methods use large-scale training data and CLIP to align text queries with 3D semantic features. However, their reliance on predefined vocabulary priors from training data hinders free-form semantic querying. Besides, recent advanced methods rely on LLMs for scene understanding but lack comprehensive 3D scene-level information and often overlook the potential inconsistencies in LLM-generated outputs. In our paper, we propose FreeQ-Graph, which enables Free-form Querying with a semantic consistent scene Graph for 3D scene understanding. The core idea is to encode free-form queries from a complete and accurate 3D scene graph without predefined vocabularies, and to align them with 3D consistent semantic labels, which accomplished through three key steps. We initiate by constructing a complete and accurate 3D scene graph that maps free-form objects and their relations through LLM and LVLM guidance, entirely free from training data or predefined priors. Most importantly, we align graph nodes with accurate semantic labels by leveraging 3D semantic aligned features from merged superpoints, enhancing 3D semantic consistency. To enable free-form semantic querying, we then design an LLM-based reasoning algorithm that combines scene-level and object-level information to intricate reasoning. We conducted extensive experiments on 3D semantic grounding, segmentation, and complex querying tasks, while also validating the accuracy of graph generation. Experiments on 6 datasets show that our model excels in both complex free-form semantic queries and intricate relational reasoning.

[417] A Memory-Efficient Framework for Deformable Transformer with Neural Architecture Search

Wendong Mao, Mingfan Zhao, Jianfeng Guan, Qiwei Dong, Zhongfeng Wang

Main category: cs.CV

TL;DR: A hardware-friendly optimization framework for Deformable Attention Transformers (DAT) is proposed, using NAS-based slicing and FPGA verification to reduce memory conflicts and maintain accuracy.

Details

Motivation: DAT's irregular memory access patterns hinder efficient hardware deployment, and existing methods compromise accuracy or incur high overhead.

Method: Proposes a NAS-based method to slice input features uniformly and an FPGA system for verification, optimizing hardware cost and accuracy.

Result: Achieves only 0.2% accuracy drop on ImageNet-1K and reduces DRAM access to 18% of existing methods on FPGA.

Conclusion: The framework effectively balances hardware efficiency and model accuracy for DAT deployment.

Abstract: Deformable Attention Transformers (DAT) have shown remarkable performance in computer vision tasks by adaptively focusing on informative image regions. However, their data-dependent sampling mechanism introduces irregular memory access patterns, posing significant challenges for efficient hardware deployment. Existing acceleration methods either incur high hardware overhead or compromise model accuracy. To address these issues, this paper proposes a hardware-friendly optimization framework for DAT. First, a neural architecture search (NAS)-based method with a new slicing strategy is proposed to automatically divide the input feature into uniform patches during the inference process, avoiding memory conflicts without modifying model architecture. The method explores the optimal slice configuration by jointly optimizing hardware cost and inference accuracy. Secondly, an FPGA-based verification system is designed to test the performance of this framework on edge-side hardware. Algorithm experiments on the ImageNet-1K dataset demonstrate that our hardware-friendly framework can maintain have only 0.2% accuracy drop compared to the baseline DAT. Hardware experiments on Xilinx FPGA show the proposed method reduces DRAM access times to 18% compared with existing DAT acceleration methods.

[418] End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards

AmirHossein Zamani, Tianhao Xie, Amir G. Aghdam, Tiberiu Popa, Eugene Belilovsky

Main category: cs.CV

TL;DR: The paper proposes an end-to-end differentiable framework for 3D texture generation that incorporates human feedback via reward functions, improving alignment with 3D structure and user preferences.

Details

Motivation: Existing 3D generative models often miss human preferences and task-specific needs, relying on 2D text-to-image models that lack 3D understanding.

Method: The framework integrates differentiable reward functions into the 3D texture synthesis pipeline, enabling back-propagation of preference signals through geometric and appearance modules.

Result: The method outperforms state-of-the-art approaches in qualitative, quantitative, and user-preference evaluations, demonstrating better alignment with 3D structure and desired criteria.

Conclusion: The proposed framework offers a controllable and interpretable way to generate high-quality 3D textures from natural language, with plans to release the implementation code.

Abstract: While recent 3D generative models can produce high-quality texture images, they often fail to capture human preferences or meet task-specific requirements. Moreover, a core challenge in the 3D texture generation domain is that most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To alleviate these issues, we propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture synthesis pipeline. By back-propagating preference signals through both geometric and appearance modules of the proposed framework, our method generates textures that respect the 3D geometry structure and align with desired criteria. To demonstrate its versatility, we introduce three novel geometry-aware reward functions, which offer a more controllable and interpretable pathway for creating high-quality 3D content from natural language. By conducting qualitative, quantitative, and user-preference evaluations against state-of-the-art methods, we demonstrate that our proposed strategy consistently outperforms existing approaches. We will make our implementation code publicly available upon acceptance of the paper.

[419] ZeroReg3D: A Zero-shot Registration Pipeline for 3D Consecutive Histopathology Image Reconstruction

Juming Xiong, Ruining Deng, Jialin Yue, Siqi Lu, Junlin Guo, Marilyn Lionts, Tianyuan Yao, Can Cui, Junchao Zhu, Chongyu Qu, Mengmeng Yin, Haichun Yang, Yuankai Huo

Main category: cs.CV

TL;DR: ZeroReg3D is a zero-shot registration pipeline for accurate 3D reconstruction from 2D histological sections, addressing challenges like tissue deformation and artifacts without retraining.

Details

Motivation: Existing 2D registration methods struggle with preserving 3D spatial relationships and face issues like tissue deformation and variability in imaging techniques.

Method: Combines zero-shot deep learning-based keypoint matching with optimization-based affine and non-rigid registration techniques.

Result: Effectively addresses tissue deformation, sectioning artifacts, staining variability, and inconsistent illumination without retraining.

Conclusion: ZeroReg3D offers a robust solution for 3D histological reconstruction, balancing accuracy and generalizability.

Abstract: Histological analysis plays a crucial role in understanding tissue structure and pathology. While recent advancements in registration methods have improved 2D histological analysis, they often struggle to preserve critical 3D spatial relationships, limiting their utility in both clinical and research applications. Specifically, constructing accurate 3D models from 2D slices remains challenging due to tissue deformation, sectioning artifacts, variability in imaging techniques, and inconsistent illumination. Deep learning-based registration methods have demonstrated improved performance but suffer from limited generalizability and require large-scale training data. In contrast, non-deep-learning approaches offer better generalizability but often compromise on accuracy. In this study, we introduced ZeroReg3D, a novel zero-shot registration pipeline tailored for accurate 3D reconstruction from serial histological sections. By combining zero-shot deep learning-based keypoint matching with optimization-based affine and non-rigid registration techniques, ZeroReg3D effectively addresses critical challenges such as tissue deformation, sectioning artifacts, staining variability, and inconsistent illumination without requiring retraining or fine-tuning. The code has been made publicly available at https://github.com/hrlblab/ZeroReg3D

[420] Learning from Heterogeneity: Generalizing Dynamic Facial Expression Recognition via Distributionally Robust Optimization

Feng-Qi Cui, Anyang Tong, Jinyang Huang, Jie Zhang, Dan Guo, Zhi Liu, Meng Wang

Main category: cs.CV

TL;DR: The paper introduces HDF, a framework for Dynamic Facial Expression Recognition, addressing sample heterogeneity with two modules: DAM for time-frequency modeling and DSM for optimization balance, achieving improved accuracy and robustness.

Details

Motivation: Existing methods for DFER suffer from performance degradation due to sample heterogeneity from multi-source data and individual variability.

Method: Proposes HDF with two modules: DAM for dual-branch attention in time-frequency modeling and DSM for dynamic loss balancing.

Result: HDF outperforms on DFEW and FERV39k datasets, improving WAR and UAR with strong generalization.

Conclusion: HDF effectively addresses heterogeneity in DFER, enhancing accuracy and robustness, with code publicly available.

Abstract: Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at https://github.com/QIcita/HDF_DFER.

[421] Coherent Online Road Topology Estimation and Reasoning with Standard-Definition Maps

Khanh Son Pham, Christian Witte, Jens Behley, Johannes Betz, Cyrill Stachniss

Main category: cs.CV

TL;DR: The paper proposes a method for coherent online HD map construction using prior SD map information, outperforming previous methods.

Details

Motivation: Autonomous cars rely on HD maps, but current methods struggle with coherent online construction due to road topology complexity.

Method: A network architecture using hybrid lane segment encodings, prior map information, denoising techniques, and temporal consistency from past frames.

Result: The approach significantly outperforms previous methods, demonstrating the effectiveness of the modeling scheme.

Conclusion: The proposed method effectively addresses the challenge of coherent HD map construction, leveraging prior SD maps for improved performance.

Abstract: Most autonomous cars rely on the availability of high-definition (HD) maps. Current research aims to address this constraint by directly predicting HD map elements from onboard sensors and reasoning about the relationships between the predicted map and traffic elements. Despite recent advancements, the coherent online construction of HD maps remains a challenging endeavor, as it necessitates modeling the high complexity of road topologies in a unified and consistent manner. To address this challenge, we propose a coherent approach to predict lane segments and their corresponding topology, as well as road boundaries, all by leveraging prior map information represented by commonly available standard-definition (SD) maps. We propose a network architecture, which leverages hybrid lane segment encodings comprising prior information and denoising techniques to enhance training stability and performance. Furthermore, we facilitate past frames for temporal consistency. Our experimental evaluation demonstrates that our approach outperforms previous methods by a large margin, highlighting the benefits of our modeling scheme.

[422] A Lightweight Face Quality Assessment Framework to Improve Face Verification Performance in Real-Time Screening Applications

Ahmed Aman Ibrahim, Hamad Mansour Alawar, Abdulnasser Abbas Zehi, Ahmed Mohammad Alkendi, Bilal Shafi Ashfaq Ahmed Mirza, Shan Ullah, Ismail Lujain Jaleel, Hassan Ugail

Main category: cs.CV

TL;DR: A lightweight framework for face quality assessment improves verification accuracy by filtering low-quality images, achieving 96.67% accuracy and reducing false rejection rates by 99.7%.

Details

Motivation: Low-quality face images degrade face verification performance, necessitating a pre-filtering solution for real-time applications like surveillance and access control.

Method: Uses normalized facial landmarks and a Random Forest Regression classifier to assess face image quality.

Result: Achieves 96.67% accuracy, reduces false rejection rates by 99.7%, and enhances cosine similarity scores with ArcFace.

Conclusion: The framework effectively mitigates poor-quality image impact, outperforms existing methods, and addresses real-world challenges like resolution and pose variations.

Abstract: Face image quality plays a critical role in determining the accuracy and reliability of face verification systems, particularly in real-time screening applications such as surveillance, identity verification, and access control. Low-quality face images, often caused by factors such as motion blur, poor lighting conditions, occlusions, and extreme pose variations, significantly degrade the performance of face recognition models, leading to higher false rejection and false acceptance rates. In this work, we propose a lightweight yet effective framework for automatic face quality assessment, which aims to pre-filter low-quality face images before they are passed to the verification pipeline. Our approach utilises normalised facial landmarks in conjunction with a Random Forest Regression classifier to assess image quality, achieving an accuracy of 96.67%. By integrating this quality assessment module into the face verification process, we observe a substantial improvement in performance, including a comfortable 99.7% reduction in the false rejection rate and enhanced cosine similarity scores when paired with the ArcFace face verification model. To validate our approach, we have conducted experiments on a real-world dataset collected comprising over 600 subjects captured from CCTV footage in unconstrained environments within Dubai Police. Our results demonstrate that the proposed framework effectively mitigates the impact of poor-quality face images, outperforming existing face quality assessment techniques while maintaining computational efficiency. Moreover, the framework specifically addresses two critical challenges in real-time screening: variations in face resolution and pose deviations, both of which are prevalent in practical surveillance scenarios.

[423] ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

Sherry X. Chen, Yi Wei, Luowei Zhou, Suren Kumar

Main category: cs.CV

TL;DR: ADIEE introduces an automated dataset creation approach to train a scoring model for evaluating instruction-guided image editing, outperforming existing models in benchmarks.

Details

Motivation: The need for effective automated evaluation in instruction-guided image editing, addressing limitations of current VLMs (alignment issues, lack of transparency, and cost inefficiency).

Method: ADIEE generates a large-scale dataset (100K+ samples) to fine-tune a modified LLaVA-NeXT-8B model, decoding numeric scores from custom tokens.

Result: The scorer outperforms open-source VLMs and Gemini-Pro 1.5, improving score correlation with human ratings and pair-wise comparison accuracy. It also boosts MagicBrush’s evaluation score by 8.98%.

Conclusion: ADIEE provides a transparent, cost-efficient solution for automated evaluation, enabling better edit selection and model fine-tuning.

Abstract: Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model’s average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%). Our code and models are available at https://github.com/SherryXTChen/ADIEE.git.

[424] Swin-TUNA : A Novel PEFT Approach for Accurate Food Image Segmentation

Haotian Chen, Zhiyong Xiao

Main category: cs.CV

TL;DR: Swin-TUNA introduces a parameter-efficient fine-tuning method for food image segmentation, reducing parameters by 98.7% while outperforming FoodSAM.

Details

Motivation: Existing Transformer-based models like FoodSAM are impractical due to high computational demands. Swin-TUNA aims to address this with efficient parameter usage.

Method: Integrates multiscale trainable adapters into Swin Transformer, using hierarchical feature adaptation and dynamic balancing for task-agnostic and task-specific features.

Result: Achieves mIoU of 50.56% and 74.94% on FoodSeg103 and UECFoodPix Complete, surpassing FoodSAM with only 8.13M parameters.

Conclusion: Swin-TUNA offers a lightweight, efficient solution for food image segmentation with faster convergence and better generalization.

Abstract: In the field of food image processing, efficient semantic segmentation techniques are crucial for industrial applications. However, existing large-scale Transformer-based models (such as FoodSAM) face challenges in meeting practical deploymentrequirements due to their massive parameter counts and high computational resource demands. This paper introduces TUNable Adapter module (Swin-TUNA), a Parameter Efficient Fine-Tuning (PEFT) method that integrates multiscale trainable adapters into the Swin Transformer architecture, achieving high-performance food image segmentation by updating only 4% of the parameters. The core innovation of Swin-TUNA lies in its hierarchical feature adaptation mechanism: it designs separable convolutions in depth and dimensional mappings of varying scales to address the differences in features between shallow and deep networks, combined with a dynamic balancing strategy for tasks-agnostic and task-specific features. Experiments demonstrate that this method achieves mIoU of 50.56% and 74.94% on the FoodSeg103 and UECFoodPix Complete datasets, respectively, surpassing the fully parameterized FoodSAM model while reducing the parameter count by 98.7% (to only 8.13M). Furthermore, Swin-TUNA exhibits faster convergence and stronger generalization capabilities in low-data scenarios, providing an efficient solution for assembling lightweight food image.

[425] Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection

Subhajit Maity, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Aneeshan Sain, Yi-Zhe Song

Main category: cs.CV

TL;DR: A framework for few-shot keypoint detection using sketches, overcoming cross-modal and style challenges with prototypical domain adaptation.

Details

Motivation: Addressing the lack of source data in few-shot keypoint detection by leveraging sketches as a source-free alternative.

Method: Uses a prototypical setup with a grid-based locator and prototypical domain adaptation to handle cross-modal embeddings and user-specific sketch styles.

Result: Demonstrates successful few-shot convergence across novel keypoints and classes in experiments.

Conclusion: The proposed framework effectively addresses challenges in few-shot keypoint detection using sketches.

Abstract: Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.

[426] Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine

Kongwu Huang, Shiyi Mu, Jun Jiang, Yuan Gao, Shugong Xu

Main category: cs.CV

TL;DR: Great-X is a multimodal data twin platform integrating ray-tracing and autonomous driving tools for synchronized simulation. It produces the Great-MSD dataset and a CSI-based UAV 3D localization algorithm.

Details

Motivation: To explore scaling laws' potential in ISAC research by creating a unified platform for multimodal data simulation.

Method: Reconstructs Sionna’s ray-tracing in Unreal Engine, integrates autonomous driving tools, and simulates CSI, RGB, Radar, and LiDAR data.

Result: Developed Great-MSD dataset and a baseline CSI-based UAV 3D localization algorithm, showing feasibility across CSI engines.

Conclusion: Great-X and Great-MSD advance ISAC research with open-source tools and datasets, demonstrating scalability and generalizability.

Abstract: Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset will be made available at: https://github.com/hkw-xg/Great-MCD.

[427] TextSAM-EUS: Text Prompt Learning for SAM to Accurately Segment Pancreatic Tumor in Endoscopic Ultrasound

Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S. Miller, Hassan Rivaz, Marta Kersten-Oertel, Yiming Xiao

Main category: cs.CV

TL;DR: TextSAM-EUS is a text-driven adaptation of SAM for pancreatic tumor segmentation in EUS, requiring no manual prompts and outperforming SOTA models.

Details

Motivation: Challenges in EUS segmentation due to noise, low contrast, and reliance on expert annotations.

Method: Uses BiomedCLIP text encoder and LoRA-based SAM adaptation for automatic segmentation, tuning minimal parameters.

Result: Achieves 82.69% Dice and 85.28% NSD with automatic prompts, surpassing SOTA models.

Conclusion: TextSAM-EUS is efficient and robust for EUS segmentation, pioneering prompt learning in SAM-based medical imaging.

Abstract: Pancreatic cancer carries a poor prognosis and relies on endoscopic ultrasound (EUS) for targeted biopsy and radiotherapy. However, the speckle noise, low contrast, and unintuitive appearance of EUS make segmentation of pancreatic tumors with fully supervised deep learning (DL) models both error-prone and dependent on large, expert-curated annotation datasets. To address these challenges, we present TextSAM-EUS, a novel, lightweight, text-driven adaptation of the Segment Anything Model (SAM) that requires no manual geometric prompts at inference. Our approach leverages text prompt learning (context optimization) through the BiomedCLIP text encoder in conjunction with a LoRA-based adaptation of SAM’s architecture to enable automatic pancreatic tumor segmentation in EUS, tuning only 0.86% of the total parameters. On the public Endoscopic Ultrasound Database of the Pancreas, TextSAM-EUS with automatic prompts attains 82.69% Dice and 85.28% normalized surface distance (NSD), and with manual geometric prompts reaches 83.10% Dice and 85.70% NSD, outperforming both existing state-of-the-art (SOTA) supervised DL models and foundation models (e.g., SAM and its variants). As the first attempt to incorporate prompt learning in SAM-based medical image segmentation, TextSAM-EUS offers a practical option for efficient and robust automatic EUS segmentation.

[428] OpenHuman4D: Open-Vocabulary 4D Human Parsing

Keito Suzuki, Bang Du, Runfa Blark Li, Kunyao Chen, Lei Wang, Peng Liu, Ning Bi, Truong Nguyen

Main category: cs.CV

TL;DR: A 4D human parsing framework is introduced to reduce inference time and enable open-vocabulary capabilities, improving dynamic 3D human representation.

Details

Motivation: Existing human part segmentation methods are limited by closed-set datasets and slow inference, hindering their practical use in virtual and extended reality.

Method: The framework uses mask-based video tracking, a Mask Validation module, and a 4D Mask Fusion module to enhance efficiency and robustness.

Result: The method achieves up to 93.3% faster inference than prior state-of-the-art, while handling open-vocabulary tasks.

Conclusion: The proposed framework effectively addresses limitations in dynamic 3D human parsing, offering significant speed and flexibility improvements.

Abstract: Understanding dynamic 3D human representation has become increasingly critical in virtual and extended reality applications. However, existing human part segmentation methods are constrained by reliance on closed-set datasets and prolonged inference times, which significantly restrict their applicability. In this paper, we introduce the first 4D human parsing framework that simultaneously addresses these challenges by reducing the inference time and introducing open-vocabulary capabilities. Building upon state-of-the-art open-vocabulary 3D human parsing techniques, our approach extends the support to 4D human-centric video with three key innovations: 1) We adopt mask-based video object tracking to efficiently establish spatial and temporal correspondences, avoiding the necessity of segmenting all frames. 2) A novel Mask Validation module is designed to manage new target identification and mitigate tracking failures. 3) We propose a 4D Mask Fusion module, integrating memory-conditioned attention and logits equalization for robust embedding fusion. Extensive experiments demonstrate the effectiveness and flexibility of the proposed method on 4D human-centric parsing tasks, achieving up to 93.3% acceleration compared to the previous state-of-the-art method, which was limited to parsing fixed classes.

[429] PatchTraj: Dynamic Patch Representation Learning for Time-Frequency Trajectory Prediction

Yanghong Liu, Xingping Dong, Ming Li, Weixing Zhang, Yidong Lou

Main category: cs.CV

TL;DR: PatchTraj is a dynamic patch-based framework for pedestrian trajectory prediction, unifying time and frequency domains to address limitations in existing methods.

Details

Motivation: Existing methods inadequately model human motion dynamics and lack frequency-domain interaction in time representation.

Method: Decomposes trajectories into time and frequency components, uses dynamic patch partitioning, adaptive embedding, hierarchical feature aggregation, and cross-modal attention for fusion.

Result: Achieves state-of-the-art performance on ETH-UCY, SDD, NBA, and JRDB datasets.

Conclusion: PatchTraj effectively balances local and long-range dependencies, improving trajectory prediction accuracy and efficiency.

Abstract: Pedestrian trajectory prediction is crucial for autonomous driving and robotics. While existing point-based and grid-based methods expose two key limitations: insufficiently modeling human motion dynamics, as they fail to balance local motion details with long-range spatiotemporal dependencies, and the time representation lacks interaction with the frequency domain in modeling trajectory sequences. To address these challenges, we propose PatchTraj, a dynamic patch-based trajectory prediction framework that unifies time-domain and frequency-domain representations. Specifically, we decompose the trajectory into raw time sequences and frequency components, employing dynamic patch partitioning for multi-scale trajectory segmentation to capture hierarchical motion patterns. Each patch is processed by an adaptive embedding layer with scale-aware feature extraction, followed by hierarchical feature aggregation to model both fine-grained and long-range dependencies. The outputs of two branches interact via cross-modal attention, enabling complementary fusion of temporal and spectral cues. Finally, a Transformer encoder-decoder integrates both modalities to autoregressively predict future trajectories. Extensive experiments on ETH-UCY, SDD, NBA, and JRDB datasets demonstrate that our method achieves state-of-the-art performance with high efficiency.

[430] GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space

David G. Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, Mubarak Shah

Main category: cs.CV

TL;DR: GT-Loc is a retrieval-based method that jointly predicts image capture time (hour/month) and geo-location (GPS) using separate encoders and a shared feature space, outperforming previous methods.

Details

Motivation: Timestamp prediction is crucial for metadata correction, retrieval, and forensics but is interdependent with geo-localization due to visual cues like brightness and shadows.

Method: GT-Loc uses separate encoders for images, time, and location, aligning embeddings in a shared space. It employs temporal metric learning on a cyclical toroidal surface for soft targets.

Result: GT-Loc surpasses previous time prediction methods, even without ground-truth geo-location, and achieves competitive geo-localization results.

Conclusion: The unified embedding space enables compositional and text-based retrieval, demonstrating the effectiveness of joint optimization for timestamp and location prediction.

Abstract: Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, instead of conventional contrastive learning with hard positives and negatives, we propose a temporal metric-learning objective providing soft targets by modeling pairwise time differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses previous time prediction methods, even those using the ground-truth geo-location as an input during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, and the unified embedding space facilitates compositional and text-based image retrieval.

[431] ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, Shengfeng He

Main category: cs.CV

TL;DR: ViewSRD improves 3D visual grounding by decomposing complex queries into simpler statements and integrating multi-view textual-scene interactions.

Details

Motivation: Existing methods struggle with complex multi-anchor queries and perspective inconsistencies in 3D visual grounding.

Method: ViewSRD uses Simple Relation Decoupling (SRD) to simplify queries, Multi-view Textual-Scene Interaction (Multi-TSI) for cross-modal feature integration, and Textual-Scene Reasoning for unified predictions.

Result: ViewSRD outperforms state-of-the-art methods, especially in complex spatial queries.

Conclusion: ViewSRD effectively addresses challenges in 3D visual grounding by structured multi-view decomposition and cross-modal integration.

Abstract: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github.com/visualjason/ViewSRD.

[432] Implementing Adaptations for Vision AutoRegressive Model

Kaif Shaikh, Franziska Boenisch, Adam Dziedzic

Main category: cs.CV

TL;DR: VAR outperforms DMs in non-DP image generation adaptations but struggles with DP adaptations, highlighting a need for further research in private VAR adaptations.

Details

Motivation: To explore and benchmark adaptation strategies for VAR in downstream tasks, comparing them to DM adaptations, especially in differentially private settings.

Method: Implemented and benchmarked various adaptation strategies for VAR, comparing them to state-of-the-art DM adaptation techniques.

Result: VAR performs better than DMs for non-DP adaptations but underperforms in DP settings.

Conclusion: Further research is needed to improve differentially private adaptations for VAR.

Abstract: Vision AutoRegressive model (VAR) was recently introduced as an alternative to Diffusion Models (DMs) in image generation domain. In this work we focus on its adaptations, which aim to fine-tune pre-trained models to perform specific downstream tasks, like medical data generation. While for DMs there exist many techniques, adaptations for VAR remain underexplored. Similarly, differentially private (DP) adaptations-ones that aim to preserve privacy of the adaptation data-have been extensively studied for DMs, while VAR lacks such solutions. In our work, we implement and benchmark many strategies for VAR, and compare them to state-of-the-art DM adaptation strategies. We observe that VAR outperforms DMs for non-DP adaptations, however, the performance of DP suffers, which necessitates further research in private adaptations for VAR. Code is available at https://github.com/sprintml/finetuning_var_dp.

[433] Mitigating Object Hallucinations via Sentence-Level Early Intervention

Shangpin Peng, Senqiao Yang, Li Jiang, Zhuotao Tian

Main category: cs.CV

TL;DR: SENTINEL reduces hallucinations in MLLMs by early intervention using in-domain preference learning, achieving a 90% reduction without human annotations.

Details

Motivation: Hallucinations in MLLMs persist despite existing methods, often due to early-stage text generation errors.

Method: SENTINEL bootstraps in-domain preference pairs, validates object existence, and trains models with context-aware preference loss (C-DPO).

Result: SENTINEL reduces hallucinations by 90%, outperforming prior methods on benchmarks.

Conclusion: SENTINEL is superior and generalizable, with open-sourced models, datasets, and code.

Abstract: Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.

[434] One Last Attention for Your Vision-Language Model

Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen

Main category: cs.CV

TL;DR: RAda is a method for fine-tuning VLMs by dynamically calibrating fused representations, improving performance with minimal changes.

Details

Motivation: Current adaptation methods neglect fused representations in VLMs, limiting their downstream potential.

Method: RAda uses a learned mask from a lightweight attention layer to adjust cross-modal interactions in the rational matrix.

Result: RAda improves baseline performance and matches state-of-the-art methods in various settings.

Conclusion: RAda is a versatile and efficient fine-tuning technique for VLMs.

Abstract: Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.

[435] SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting

Zihui Gao, Jia-Wang Bian, Guosheng Lin, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: A hybrid method combining SDF and 3DGS improves surface reconstruction and novel view rendering by leveraging coarse geometry and fine details.

Details

Motivation: Addressing the limitations of SDF (lacking fine details) and 3DGS (lacking global coherence) in sparse-view image tasks.

Method: Combines SDF for coarse geometry and 3DGS for fine details, refining each other iteratively.

Result: Outperforms state-of-the-art methods on DTU and MobileBrick datasets.

Conclusion: The hybrid approach effectively balances geometry and detail, advancing sparse-view reconstruction and rendering.

Abstract: Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines the strengths of both approaches: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine the details of SDF for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets. Code will be released at https://github.com/aim-uofa/SurfaceSplat.

[436] M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision

Kailai Zhou, Fuqiang Yang, Shixian Wang, Bihan Wen, Chongde Zi, Linsen Chen, Qiu Shen, Xun Cao

Main category: cs.CV

TL;DR: The paper introduces M-SpecGene, a generalized RGB-Thermal foundation model, addressing modality bias and data bottlenecks by learning modality-invariant representations self-supervised. It uses CMSS and GMM-CMSS for pre-training, validated across 11 datasets.

Details

Motivation: Current RGBT tasks rely on task-specific models with artificial biases and data limitations. A unified, generalized approach is needed.

Method: Develops M-SpecGene with Cross-Modality Structural Sparsity (CMSS) and GMM-CMSS masking for self-supervised pre-training.

Result: Validated on 11 datasets for four RGBT tasks, showing strong generalizability.

Conclusion: M-SpecGene offers a unified paradigm for RGBT tasks, overcoming prior limitations with scalable, self-supervised learning.

Abstract: RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene’s generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at https://github.com/CalayZhou/M-SpecGene.

[437] ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

Duong T. Tran, Trung-Kien Tran, Manfred Hauswirth, Danh Le Phuoc

Main category: cs.CV

TL;DR: A new dataset, ReasonVQA, is introduced for Visual Question Answering (VQA), integrating structured knowledge and generating complex questions. It challenges state-of-the-art models and scales easily.

Details

Motivation: To address the need for a dataset that combines visual and structured knowledge for complex reasoning in VQA tasks.

Method: Automatically integrates encyclopedic knowledge and uses a low-cost framework to generate multi-hop questions.

Result: ReasonVQA challenges existing VQA models and surpasses the largest datasets in size and complexity.

Conclusion: ReasonVQA is a scalable, challenging dataset with potential to advance VQA research.

Abstract: In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude.

[438] Vec2Face+ for Face Dataset Generation

Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer

Main category: cs.CV

TL;DR: Vec2Face+ is a generative model for synthesizing high-quality face recognition training data with controlled identity and attribute variations, outperforming real-world datasets in accuracy.

Details

Motivation: Existing methods for synthesizing face recognition data overlook intra-class identity consistency while increasing intra-class variation, leading to suboptimal datasets.

Method: Vec2Face+ generates images from features, using three strategies: sampling distinct vectors, AttrOP for attribute variation, and LoRA-based pose control for identity-preserving profile poses.

Result: Vec2Face+ produces datasets (VFace10K, VFace100K, VFace300K) that achieve state-of-the-art accuracy on real-world test sets, surpassing CASIA-WebFace.

Conclusion: Synthetic datasets can outperform real ones in accuracy, but challenges like bias and twin verification performance remain for future work.

Abstract: When synthesizing identities as face recognition training data, it is generally believed that large inter-class separability and intra-class attribute variation are essential for synthesizing a quality dataset. % This belief is generally correct, and this is what we aim for. However, when increasing intra-class variation, existing methods overlook the necessity of maintaining intra-class identity consistency. % To address this and generate high-quality face training data, we propose Vec2Face+, a generative model that creates images directly from image features and allows for continuous and easy control of face identities and attributes. Using Vec2Face+, we obtain datasets with proper inter-class separability and intra-class variation and identity consistency using three strategies: 1) we sample vectors sufficiently different from others to generate well-separated identities; 2) we propose an AttrOP algorithm for increasing general attribute variations; 3) we propose LoRA-based pose control for generating images with profile head poses, which is more efficient and identity-preserving than AttrOP. % Our system generates VFace10K, a synthetic face dataset with 10K identities, which allows an FR model to achieve state-of-the-art accuracy on seven real-world test sets. Scaling the size to 4M and 12M images, the corresponding VFace100K and VFace300K datasets yield higher accuracy than the real-world training dataset, CASIA-WebFace, on five real-world test sets. This is the first time a synthetic dataset beats the CASIA-WebFace in average accuracy. In addition, we find that only 1 out of 11 synthetic datasets outperforms random guessing (\emph{i.e., 50%}) in twin verification and that models trained with synthetic identities are more biased than those trained with real identities. Both are important aspects for future investigation. Code is available at https://github.com/HaiyuWu/Vec2Face_plus

[439] PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image

Hyeongjin Nam, Donghwan Kim, Gyeongsik Moon, Kyoung Mu Lee

Main category: cs.CV

TL;DR: PARTE improves 3D human reconstruction by using part segmentation to align textures, avoiding blending issues.

Details

Motivation: Existing methods misalign textures across human parts; PARTE leverages part segmentation for better texture coherence.

Method: Uses a PartSegmenter for 3D part segmentation and a PartTexturer for part-guided texture reconstruction.

Result: Achieves state-of-the-art quality in 3D human reconstruction.

Conclusion: PARTE effectively addresses texture misalignment by integrating part segmentation priors.

Abstract: The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which utilizes 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Extensive experiments demonstrate that our framework achieves state-of-the-art quality in 3D human reconstruction. The project page is available at https://hygenie1228.github.io/PARTE/.

[440] RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction

Yuqing Lan, Chenyang Zhu, Shuaifeng Zhi, Jiazhao Zhang, Zhoufeng Wang, Renjiao Yi, Yijie Wang, Kai Xu

Main category: cs.CV

TL;DR: RemixFusion introduces a residual-based mixed representation combining explicit TSDF grids and implicit neural modules for high-quality, large-scale online RGB-D reconstruction, outperforming existing methods in accuracy and efficiency.

Details

Motivation: Neural implicit representations improve mapping completeness and memory efficiency but lack detail and are time-consuming for large-scale online reconstruction.

Method: Proposes a residual-based map with explicit TSDF grids and implicit neural modules for fine details, plus adaptive gradient amplification and local moving volume for efficient online learning.

Result: Surpasses state-of-the-art methods in mapping and tracking accuracy on large-scale scenes.

Conclusion: RemixFusion enables detail-rich, efficient online reconstruction, advancing neural-based methods for practical applications.

Abstract: The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.

[441] Deformable Convolution Module with Globally Learned Relative Offsets for Fundus Vessel Segmentation

Lexuan Zhu, Yuxuan Li, Yuning Ren

Main category: cs.CV

TL;DR: A novel deformable convolutional module using attention and feedforward networks improves global feature capture and achieves state-of-the-art performance in fundus blood vessel segmentation.

Details

Motivation: To address complex shape features in tasks like fundus blood vessel segmentation, which require capturing long-distance global features.

Method: Proposes a plug-and-play deformable convolutional module that learns sub-pixel displacement fields and warps feature maps adaptively across channels.

Result: GDCUnet, a model using this module, achieves state-of-the-art performance on public datasets, with ablation studies confirming its effectiveness.

Conclusion: The module enhances representation and generalization, and is recommended for tasks with complex global self-similar features.

Abstract: Deformable convolution can adaptively change the shape of convolution kernel by learning offsets to deal with complex shape features. We propose a novel plug and play deformable convolutional module that uses attention and feedforward networks to learn offsets, so that the deformable patterns can capture long-distance global features. Compared with previously existing deformable convolutions, the proposed module learns the sub pixel displacement field and adaptively warps the feature maps across all channels rather than directly deforms the convolution kernel , which is equivalent to a relative deformation of the kernel sampling grids, achieving global feature deformation and the decoupling of kernel size and learning network. Considering that the fundus blood vessels have globally self similar complex edges, we design a deep learning model for fundus blood vessel segmentation, GDCUnet, based on the proposed convolutional module. Empirical evaluations under the same configuration and unified framework show that GDCUnet has achieved state of the art performance on public datasets. Further ablation experiments demonstrated that the proposed deformable convolutional module could more significantly learn the complex features of fundus blood vessels, enhancing the model representation and generalization capabilities. The proposed module is similar to the interface of conventional convolution, we suggest applying it to more machine vision tasks with complex global self similar features.

[442] Facial Demorphing from a Single Morph Using a Latent Conditional GAN

Nitish Shukla, Arun Ross

Main category: cs.CV

TL;DR: A method for demorphing face images that overcomes limitations of existing techniques by decomposing morphs in latent space, enabling detection of unseen morph techniques and styles.

Details

Motivation: Existing demorphing methods either replicate the morph or require identical morph techniques for training and testing, limiting their effectiveness.

Method: Decomposes morphs in latent space, trained on synthetic faces and tested on real faces with different morph techniques.

Result: Outperforms existing methods significantly, producing high-fidelity demorphed images.

Conclusion: The proposed method effectively demorphs images from unseen techniques and styles, providing robust evidence for morph attacks.

Abstract: A morph is created by combining two (or more) face images from two (or more) identities to create a composite image that is highly similar to all constituent identities, allowing the forged morph to be biometrically associated with more than one individual. Morph Attack Detection (MAD) can be used to detect a morph, but does not reveal the constituent images. Demorphing

the process of deducing the constituent images - is thus vital to provide additional evidence about a morph. Existing demorphing methods suffer from the morph replication problem, where the outputs tend to look very similar to the morph itself, or assume that train and test morphs are generated using the same morph technique. The proposed method overcomes these issues. The method decomposes a morph in latent space allowing it to demorph images created from unseen morph techniques and face styles. We train our method on morphs created from synthetic faces and test on morphs created from real faces using different morph techniques. Our method outperforms existing methods by a considerable margin and produces high fidelity demorphed face images.

[443] VGS-ATD: Robust Distributed Learning for Multi-Label Medical Image Classification Under Heterogeneous and Imbalanced Conditions

Zehui Zhao, Laith Alzubaidi, Haider A. Alwzwazy, Jinglan Zhang, Yuantong Gu

Main category: cs.CV

TL;DR: VGS-ATD is a novel distributed learning framework addressing privacy, data heterogeneity, and scalability in medical imaging, outperforming centralized and decentralized methods.

Details

Motivation: Traditional centralized and decentralized learning methods in medical imaging face privacy risks, inefficiencies, and scalability issues, especially with heterogeneous data and system expansion.

Method: Proposes VGS-ATD, a distributed learning framework, tested on 30 datasets and 80 labels across nodes.

Result: Achieved 92.7% accuracy, outperforming centralized (84.9%) and swarm learning (72.99%), with 1% accuracy drop post-expansion and 50% lower computational costs.

Conclusion: VGS-ATD offers superior privacy, efficiency, and scalability, making it a robust solution for dynamic clinical environments.

Abstract: In recent years, advanced deep learning architectures have shown strong performance in medical imaging tasks. However, the traditional centralized learning paradigm poses serious privacy risks as all data is collected and trained on a single server. To mitigate this challenge, decentralized approaches such as federated learning and swarm learning have emerged, allowing model training on local nodes while sharing only model weights. While these methods enhance privacy, they struggle with heterogeneous and imbalanced data and suffer from inefficiencies due to frequent communication and the aggregation of weights. More critically, the dynamic and complex nature of clinical environments demands scalable AI systems capable of continuously learning from diverse modalities and multilabels. Yet, both centralized and decentralized models are prone to catastrophic forgetting during system expansion, often requiring full model retraining to incorporate new data. To address these limitations, we propose VGS-ATD, a novel distributed learning framework. To validate VGS-ATD, we evaluate it in experiments spanning 30 datasets and 80 independent labels across distributed nodes, VGS-ATD achieved an overall accuracy of 92.7%, outperforming centralized learning (84.9%) and swarm learning (72.99%), while federated learning failed under these conditions due to high requirements on computational resources. VGS-ATD also demonstrated strong scalability, with only a 1% drop in accuracy on existing nodes after expansion, compared to a 20% drop in centralized learning, highlighting its resilience to catastrophic forgetting. Additionally, it reduced computational costs by up to 50% relative to both centralized and swarm learning, confirming its superior efficiency and scalability.

[444] Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

Ruixing Zhang, Yang Zhang, Tongyu Zhu, Leilei Sun, Weifeng Lv

Main category: cs.CV

TL;DR: The paper introduces a human-like approach for next-location prediction using Vision-Language Models (VLMs), proposing VGLS and VLMLocPredictor, achieving SOTA performance.

Details

Motivation: Existing models lack human-like reasoning over maps for trajectory prediction, prompting the use of VLMs for visual reasoning.

Method: Proposes VGLS to test VLM capabilities, then VLMLocPredictor with SFT tasks and reinforcement learning for self-improvement.

Result: Achieves SOTA performance and superior cross-city generalization on datasets from four cities.

Conclusion: VLMs can effectively mimic human reasoning for next-location prediction, offering a novel and high-performing approach.

Abstract: Next Location Prediction is a fundamental task in the study of human mobility, with wide-ranging applications in transportation planning, urban governance, and epidemic forecasting. In practice, when humans attempt to predict the next location in a trajectory, they often visualize the trajectory on a map and reason based on road connectivity and movement trends. However, the vast majority of existing next-location prediction models do not reason over maps \textbf{in the way that humans do}. Fortunately, the recent development of Vision-Language Models (VLMs) has demonstrated strong capabilities in visual perception and even visual reasoning. This opens up a new possibility: by rendering both the road network and trajectory onto an image and leveraging the reasoning abilities of VLMs, we can enable models to perform trajectory inference in a human-like manner. To explore this idea, we first propose a method called Vision-Guided Location Search (VGLS), which evaluates whether a general-purpose VLM is capable of trajectory-based reasoning without modifying any of its internal parameters. Based on insights from the VGLS results, we further propose our main approach: VLMLocPredictor, which is composed of two stages: In the first stage, we design two Supervised Fine-Tuning (SFT) tasks that help the VLM understand road network and trajectory structures and acquire basic reasoning ability on such visual inputs. In the second stage, we introduce Reinforcement Learning from Visual Map Feedback, enabling the model to self-improve its next-location prediction ability through interaction with the environment. Experiments conducted on datasets from four different cities show that our method achieves state-of-the-art (SOTA) performance and exhibits superior cross-city generalization compared to other LLM-based approaches.

[445] Synthetic-to-Real Camouflaged Object Detection

Zhihao Luo, Luojun Lin, Zheng Lin

Main category: cs.CV

TL;DR: The paper introduces Syn-to-Real Camouflaged Object Detection (S2R-COD) to address limited real-world data by leveraging synthetic datasets and unannotated real images, proposing the CSRDA framework for domain adaptation.

Details

Motivation: Limited datasets for camouflaged object detection (COD) due to high labeling costs, especially for specialized categories, and performance degradation when using synthetic data directly.

Method: Proposes the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a student-teacher model using pseudo labeling and consistency regularization to adapt synthetic data to real-world scenarios.

Result: CSRDA effectively bridges the gap between synthetic and real domains, improving model performance in real-world COD tasks with limited data.

Conclusion: The CSRDA framework mitigates data scarcity and annotation challenges in COD, demonstrating practical utility through extensive experiments.

Abstract: Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at: https://github.com/Muscape/S2R-COD.

cs.AI

[446] MAIA: A Collaborative Medical AI Platform for Integrated Healthcare Innovation

Simone Bendazzoli, Sanna Persson, Mehdi Astaraki, Sebastian Pettersson, Vitali Grozman, Rodrigo Moreno

Main category: cs.AI

TL;DR: MAIA is an open-source platform for integrating AI into clinical workflows, enabling collaboration among clinicians, researchers, and developers.

Details

Motivation: To bridge the gap between AI innovation and practical healthcare applications by fostering interdisciplinary collaboration.

Method: Built on Kubernetes, MAIA provides modular, scalable tools for data management, model development, deployment, and clinical feedback.

Result: MAIA supports real-world medical imaging AI use cases in academic and clinical settings, enhancing reproducibility and transparency.

Conclusion: MAIA accelerates AI research translation into clinical solutions, promoting collaboration, interoperability, and user-centered design.

Abstract: The integration of Artificial Intelligence (AI) into clinical workflows requires robust collaborative platforms that are able to bridge the gap between technical innovation and practical healthcare applications. This paper introduces MAIA (Medical Artificial Intelligence Assistant), an open-source platform designed to facilitate interdisciplinary collaboration among clinicians, researchers, and AI developers. Built on Kubernetes, MAIA offers a modular, scalable environment with integrated tools for data management, model development, annotation, deployment, and clinical feedback. Key features include project isolation, CI/CD automation, integration with high-computing infrastructures and in clinical workflows. MAIA supports real-world use cases in medical imaging AI, with deployments in both academic and clinical environments. By promoting collaborations and interoperability, MAIA aims to accelerate the translation of AI research into impactful clinical solutions while promoting reproducibility, transparency, and user-centered design. We showcase the use of MAIA with different projects, both at KTH Royal Institute of Technology and Karolinska University Hospital.

[447] Agent WARPP: Workflow Adherence via Runtime Parallel Personalization

Maria Emilia Mazzolenis, Ruirui Zhang

Main category: cs.AI

TL;DR: WARPP is a training-free framework for LLM-based TOD systems, improving workflow adherence via multi-agent orchestration and runtime personalization.

Details

Motivation: LLMs struggle with long, conditional workflows in TOD systems, especially when involving external tools and user-specific info.

Method: WARPP uses multi-agent orchestration and runtime personalization to dynamically prune conditional branches and tailor execution paths.

Result: WARPP outperforms non-personalized and ReAct baselines, improving parameter fidelity, tool accuracy, and reducing token usage.

Conclusion: WARPP effectively enhances LLM-based TOD systems without additional training, especially for complex workflows.

Abstract: Large language models (LLMs) are increasingly applied in task-oriented dialogue (TOD) systems but often struggle with long, conditional workflows that involve external tool calls and depend on user-specific information. We present Workflow Adherence via Runtime Parallel Personalization, or WARPP, a training-free, modular framework that combines multi-agent orchestration with runtime personalization to improve workflow adherence in LLM-based systems. By dynamically pruning conditional branches based on user attributes, the framework reduces reasoning overhead and narrows tool selection at runtime. WARPP deploys a parallelized architecture where a dedicated Personalizer agent operates alongside modular, domain-specific agents to dynamically tailor execution paths in real time. The framework is evaluated across five representative user intents of varying complexity within three domains: banking, flights, and healthcare. Our evaluation leverages synthetic datasets and LLM-powered simulated users to test scenarios with conditional dependencies. Our results demonstrate that WARPP outperforms both the non-personalized method and the ReAct baseline, achieving increasingly larger gains in parameter fidelity and tool accuracy as intent complexity grows, while also reducing average token usage, without any additional training.

[448] Hypergames: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems

Vince Trencsenyi, Agnieszka Mensfelt, Kostas Stathis

Main category: cs.AI

TL;DR: A review of hypergame theory applications in multi-agent systems (MAS), highlighting its use in modeling subjective perceptions and addressing gaps like limited HNF adoption and lack of formal languages.

Details

Motivation: To address the limitations of classical game theory (e.g., rational agents, complete information) in real-world MAS with uncertainty and misaligned beliefs.

Method: Systematic review of 44 studies, introducing hypergame theory, its extensions (hierarchical hypergames and HNF), and developing agent-compatibility criteria and a classification framework.

Result: Identified trends (e.g., hierarchical models in deception) and gaps (e.g., limited HNF use, no formal hypergame languages).

Conclusion: Provides a roadmap for enhancing strategic modeling in dynamic MAS using hypergame theory, addressing open challenges.

Abstract: Classical game-theoretic models typically assume rational agents, complete information, and common knowledge of payoffs - assumptions that are often violated in real-world MAS characterized by uncertainty, misaligned perceptions, and nested beliefs. To overcome these limitations, researchers have proposed extensions that incorporate models of cognitive constraints, subjective beliefs, and heterogeneous reasoning. Among these, hypergame theory extends the classical paradigm by explicitly modeling agents’ subjective perceptions of the strategic scenario, known as perceptual games, in which agents may hold divergent beliefs about the structure, payoffs, or available actions. We present a systematic review of agent-compatible applications of hypergame theory, examining how its descriptive capabilities have been adapted to dynamic and interactive MAS contexts. We analyze 44 selected studies from cybersecurity, robotics, social simulation, communications, and general game-theoretic modeling. Building on a formal introduction to hypergame theory and its two major extensions - hierarchical hypergames and HNF - we develop agent-compatibility criteria and an agent-based classification framework to assess integration patterns and practical applicability. Our analysis reveals prevailing tendencies, including the prevalence of hierarchical and graph-based models in deceptive reasoning and the simplification of extensive theoretical frameworks in practical applications. We identify structural gaps, including the limited adoption of HNF-based models, the lack of formal hypergame languages, and unexplored opportunities for modeling human-agent and agent-agent misalignment. By synthesizing trends, challenges, and open research directions, this review provides a new roadmap for applying hypergame theory to enhance the realism and effectiveness of strategic modeling in dynamic multi-agent environments.

[449] DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference

Jiawen Qi, Chang Gao, Zhaochun Ren, Qinyu Chen

Main category: cs.AI

TL;DR: DeltaLLM is a training-free framework for efficient LLM inference on edge devices by exploiting temporal sparsity in attention patterns, achieving up to 60% sparsity with minimal accuracy loss.

Details

Motivation: Deploying LLMs on edge devices is challenging due to high computational demands; existing solutions are unsuitable for resource-constrained environments.

Method: DeltaLLM uses a delta matrix construction strategy for temporal sparsity and a hybrid attention mechanism combining full and delta attention.

Result: Achieves 60% sparsity in prefilling and 57% overall with slight accuracy improvements or negligible drops on BitNet and Llama models.

Conclusion: DeltaLLM enables efficient edge deployment of LLMs without fine-tuning, integrating seamlessly with existing pipelines.

Abstract: Deploying Large Language Models (LLMs) on edge devices remains challenging due to their quadratically increasing computations with the sequence length. Existing studies for dynamic attention pruning are designed for hardware with massively parallel computation capabilities, such as GPUs or TPUs, and aim at long context lengths (e.g., 64K), making them unsuitable for edge scenarios. We present DeltaLLM, a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference across both the prefilling and decoding stages, on resource-constrained edge devices. DeltaLLM introduces an accuracy- and memory-aware delta matrix construction strategy that introduces temporal sparsity, and a context-aware hybrid attention mechanism that combines full attention in a local context window with delta approximation outside it to increase accuracy. We evaluate our framework on the edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model across diverse language tasks. The results show that on BitNet, our framework increases the attention sparsity from 0% to 60% during the prefilling stage with slight accuracy improvement on the WG task, and 0% to 57% across both the prefilling and decoding stages, with even higher F1 score from 29.63 to 30.97 on SQuAD-v2 task. On the Llama model, it can also achieve up to 60% sparsity during the prefilling stage and around 57% across both stages with negligible accuracy drop. These results demonstrate that DeltaLLM offers a promising solution for efficient edge deployment, requiring no fine-tuning and seamlessly integrating with existing inference pipelines.

[450] Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges

Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu, Yifan Zhou, Junhao Chen, Haotian Xiang, Jing Zhang, Afrar Jahin, Wei Ruan, Ke Deng, Yi Pan, Peilong Wang, Jiahui Li, Zhengliang Liu, Lu Zhang, Lin Zhao, Wei Liu, Dajiang Zhu, Xin Xing, Fei Dou, Wei Zhang, Chao Huang, Rongjie Liu, Mengrui Zhang, Yiwen Liu, Xiaoxiao Sun, Qin Lu, Zhen Xiang, Wenxuan Zhong, Tianming Liu, Ping Ma

Main category: cs.AI

TL;DR: This survey explores alignment techniques for large language models (LLMs), analyzing methods, trade-offs, and state-of-the-art approaches like DPO and Constitutional AI. It highlights evaluation challenges and outlines open problems in LLM alignment.

Details

Motivation: Ensuring LLMs align with human values is critical due to their societal impact. This survey aims to provide a comprehensive overview of alignment techniques and challenges.

Method: The survey reviews alignment methods, including supervised fine-tuning and preference-based approaches, and analyzes state-of-the-art techniques like DPO and Constitutional AI.

Result: Preference-based methods offer nuanced alignment, but challenges like reward misspecification and scalable oversight persist. Leading AI labs adopt varied strategies.

Conclusion: Open problems in oversight, robustness, and continuous alignment remain. The survey guides researchers and practitioners in navigating LLM alignment.

Abstract: Due to the remarkable capabilities and growing impact of large language models (LLMs), they have been deeply integrated into many aspects of society. Thus, ensuring their alignment with human values and intentions has emerged as a critical challenge. This survey provides a comprehensive overview of practical alignment techniques, training protocols, and empirical findings in LLM alignment. We analyze the development of alignment methods across diverse paradigms, characterizing the fundamental trade-offs between core alignment objectives. Our analysis shows that while supervised fine-tuning enables basic instruction-following, preference-based methods offer more flexibility for aligning with nuanced human intent. We discuss state-of-the-art techniques, including Direct Preference Optimization (DPO), Constitutional AI, brain-inspired methods, and alignment uncertainty quantification (AUQ), highlighting their approaches to balancing quality and efficiency. We review existing evaluation frameworks and benchmarking datasets, emphasizing limitations such as reward misspecification, distributional robustness, and scalable oversight. We summarize strategies adopted by leading AI labs to illustrate the current state of practice. We conclude by outlining open problems in oversight, value pluralism, robustness, and continuous alignment. This survey aims to inform both researchers and practitioners navigating the evolving landscape of LLM alignment.

[451] The wall confronting large language models

Peter V. Coveney, Sauro Succi

Main category: cs.AI

TL;DR: Scaling laws in large language models (LLMs) limit prediction uncertainty improvement, making reliability for scientific inquiry unattainable. The learning mechanism causing non-Gaussian outputs may lead to error pileup and degenerative AI behavior. Avoiding this requires prioritizing insight and problem understanding.

Details

Motivation: To highlight the limitations of LLMs in achieving reliable predictions due to scaling laws and inherent learning mechanisms, and to propose solutions to avoid degenerative AI pathways.

Method: Analysis of scaling laws and the learning mechanisms in LLMs, focusing on their impact on prediction uncertainty and error accumulation.

Result: LLMs’ reliability for scientific standards is hindered by scaling laws and non-Gaussian output mechanisms, leading to potential degenerative AI behavior.

Conclusion: Avoiding degenerative AI in LLMs requires emphasizing insight and structural problem understanding over mere data scaling.

Abstract: We show that the scaling laws which determine the performance of large language models (LLMs) severely limit their ability to improve the uncertainty of their predictions. As a result, raising their reliability to meet the standards of scientific inquiry is intractable by any reasonable measure. We argue that the very mechanism which fuels much of the learning power of LLMs, namely the ability to generate non-Gaussian output distributions from Gaussian input ones, might well be at the roots of their propensity to produce error pileup, ensuing information catastrophes and degenerative AI behaviour. This tension between learning and accuracy is a likely candidate mechanism underlying the observed low values of the scaling components. It is substantially compounded by the deluge of spurious correlations pointed out by Calude and Longo which rapidly increase in any data set merely as a function of its size, regardless of its nature. The fact that a degenerative AI pathway is a very probable feature of the LLM landscape does not mean that it must inevitably arise in all future AI research. Its avoidance, which we also discuss in this paper, necessitates putting a much higher premium on insight and understanding of the structural characteristics of the problems being investigated.

[452] Minding Motivation: The Effect of Intrinsic Motivation on Agent Behaviors

Leonardo Villalobos-Arias, Grant Forbes, Jianxun Wang, David L Roberts, Arnav Jhala

Main category: cs.AI

TL;DR: The paper examines how Intrinsic Motivation (IM) affects RL agents in games, revealing that IM alters behavior and causes reward hacking, while GRM can mitigate some of these issues.

Details

Motivation: Games pose challenges for RL due to sparse rewards. IM helps but introduces reward hacking, a poorly understood issue. This study evaluates IM's behavioral impact and tests GRM as a solution.

Method: Empirical evaluation of three IM techniques in the MiniGrid environment, comparing them with GRM to assess behavior changes and reward hacking.

Result: IM increases initial rewards and alters agent behavior. GRM mitigates reward hacking in certain scenarios.

Conclusion: IM significantly changes RL agent behavior, and GRM shows promise in addressing reward hacking, though further research is needed.

Abstract: Games are challenging for Reinforcement Learning~(RL) agents due to their reward-sparsity, as rewards are only obtainable after long sequences of deliberate actions. Intrinsic Motivation~(IM) methods – which introduce exploration rewards – are an effective solution to reward-sparsity. However, IM also causes an issue known as `reward hacking’ where the agent optimizes for the new reward at the expense of properly playing the game. The larger problem is that reward hacking itself is largely unknown; there is no answer to whether, and to what extent, IM rewards change the behavior of RL agents. This study takes a first step by empirically evaluating the impact on behavior of three IM techniques on the MiniGrid game-like environment. We compare these IM models with Generalized Reward Matching~(GRM), a method that can be used with any intrinsic reward function to guarantee optimality. Our results suggest that IM causes noticeable change by increasing the initial rewards, but also altering the way the agent plays; and that GRM mitigated reward hacking in some scenarios.

[453] HypKG: Hypergraph-based Knowledge Graph Contextualization for Precision Healthcare

Yuzhang Xie, Xu Han, Ran Xu, Xiao Hu, Jiaying Lu, Carl Yang

Main category: cs.AI

TL;DR: HypKG integrates EHR data with KGs using hypergraph models to improve healthcare predictions by contextualizing knowledge.

Details

Motivation: General KGs lack patient-specific contexts, while EHRs provide rich personal data. Combining them can enhance precision healthcare.

Method: HypKG uses entity-linking to connect KGs with EHRs, then employs hypergraph transformers to learn contextualized representations.

Result: HypKG significantly improves healthcare predictions and enhances KG quality by adjusting entity representations.

Conclusion: HypKG effectively bridges KGs and EHRs, improving both prediction accuracy and knowledge utility in healthcare.

Abstract: Knowledge graphs (KGs) are important products of the semantic web, which are widely used in various application domains. Healthcare is one of such domains where KGs are intensively used, due to the high requirement for knowledge accuracy and interconnected nature of healthcare data. However, KGs storing general factual information often lack the ability to account for important contexts of the knowledge such as the status of specific patients, which are crucial in precision healthcare. Meanwhile, electronic health records (EHRs) provide rich personal data, including various diagnoses and medications, which provide natural contexts for general KGs. In this paper, we propose HypKG, a framework that integrates patient information from EHRs into KGs to generate contextualized knowledge representations for accurate healthcare predictions. Using advanced entity-linking techniques, we connect relevant knowledge from general KGs with patient information from EHRs, and then utilize a hypergraph model to “contextualize” the knowledge with the patient information. Finally, we employ hypergraph transformers guided by downstream prediction tasks to jointly learn proper contextualized representations for both KGs and patients, fully leveraging existing knowledge in KGs and patient contexts in EHRs. In experiments using a large biomedical KG and two real-world EHR datasets, HypKG demonstrates significant improvements in healthcare prediction tasks across multiple evaluation metrics. Additionally, by integrating external contexts, HypKG can learn to adjust the representations of entities and relations in KG, potentially improving the quality and real-world utility of knowledge.

[454] A Multi-Agent System for Information Extraction from the Chemical Literature

Yufan Chen, Ching Ting Leung, Bowen Yu, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao

Main category: cs.AI

TL;DR: A multimodal large language model (MLLM)-based multi-agent system was developed for automatic chemical information extraction, achieving an F1 score of 80.8%, significantly outperforming previous methods.

Details

Motivation: High-quality chemical databases are crucial for AI-driven research, but current extraction methods are limited by the multimodality and variability of chemical information in literature.

Method: The system leverages MLLM’s reasoning to understand complex chemical graphics, decomposes tasks into sub-tasks, and coordinates specialized agents to solve them.

Result: Achieved an F1 score of 80.8% on a benchmark dataset, surpassing the previous state-of-the-art (35.6%), with consistent improvements in sub-tasks like molecular image recognition and reaction parsing.

Conclusion: This work advances automated chemical information extraction, supporting AI-driven chemical research.

Abstract: To fully expedite AI-powered chemical research, high-quality chemical databases are the cornerstone. Automatic extraction of chemical information from the literature is essential for constructing reaction databases, but it is currently limited by the multimodality and style variability of chemical information. In this work, we developed a multimodal large language model (MLLM)-based multi-agent system for automatic chemical information extraction. We used the MLLM’s strong reasoning capability to understand the structure of complex chemical graphics, decompose the extraction task into sub-tasks and coordinate a set of specialized agents to solve them. Our system achieved an F1 score of 80.8% on a benchmark dataset of complex chemical reaction graphics from the literature, surpassing the previous state-of-the-art model (F1 score: 35.6%) by a significant margin. Additionally, it demonstrated consistent improvements in key sub-tasks, including molecular image recognition, reaction image parsing, named entity recognition and text-based reaction extraction. This work is a critical step toward automated chemical information extraction into structured datasets, which will be a strong promoter of AI-driven chemical research.

[455] Integrating Activity Predictions in Knowledge Graphs

Alec Scully, Cameron Stockton, Forrest Hare

Main category: cs.AI

TL;DR: The paper explores using ontology-structured knowledge graphs with BFO and CCO to predict future events, like fishing vessel movements, via Markov chains. It critiques current ontological probability models and proposes treating probabilities as process profiles.

Details

Motivation: To enhance predictive analytics by organizing and retrieving data from knowledge graphs using ontologies, improving future event predictions.

Method: Leverages BFO and CCO for semantic structuring, uses Markov chains for predictions, and introduces ‘spatiotemporal instant’ for semantics. Critiques and revises ontological probability models.

Result: Demonstrates successful integration of Markov chain predictions into knowledge graphs for further analysis.

Conclusion: Ontology-structured knowledge graphs and revised probability models improve predictive analytics, enabling better decision-making.

Abstract: We argue that ontology-structured knowledge graphs can play a crucial role in generating predictions about future events. By leveraging the semantic framework provided by Basic Formal Ontology (BFO) and Common Core Ontologies (CCO), we demonstrate how data such as the movements of a fishing vessel can be organized in and retrieved from a knowledge graph. These query results are then used to create Markov chain models, allowing us to predict future states based on the vessel’s history. To fully support this process, we introduce the term `spatiotemporal instant’ to complete the necessary structural semantics. Additionally, we critique the prevailing ontological model of probability, which conflates probability with likelihood and relies on the problematic concept of modal measurements: measurements of future entities. We propose an alternative view, where probabilities are treated as being about process profiles, which better captures the dynamics of real world phenomena. Finally, we demonstrate how our Markov chain based probability calculations can be seamlessly integrated back into the knowledge graph, enabling further analysis and decision-making. Keywords: predictive analytics, ontology, Markov chains, probability, Basic Formal Ontology (BFO), knowledge graphs, SPARQL.

[456] Core Safety Values for Provably Corrigible Agents

Aran Nayebi

Main category: cs.AI

TL;DR: A framework for corrigibility in AI with provable guarantees, using five utility heads combined lexicographically, ensuring safety and human benefit even with learned errors.

Details

Motivation: To address the challenge of ensuring AI systems remain corrigible (safe and controllable) in complex, partially observed environments, especially when incentives conflict.

Method: Introduces five structurally separate utility heads (deference, switch-access preservation, truthfulness, low-impact behavior, and bounded task reward) combined lexicographically with strict weight gaps. Theorems prove corrigibility in single-round and multi-step scenarios.

Result: Exact corrigibility in the off-switch game and bounded safety violations in multi-step settings, even with learned errors. Decidable safety certification in finite-horizon scenarios.

Conclusion: The framework provides clear implementation guidance for corrigible AI, shifting risk to evaluation quality rather than hidden incentives, applicable to current and future autonomous systems.

Abstract: We introduce the first implementable framework for corrigibility, with provable guarantees in multi-step, partially observed environments. Our framework replaces a single opaque reward with five structurally separate utility heads – deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward – combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is \emph{learned} to mean-squared error $\varepsilon$ and the planner is $\varepsilon$-sub-optimal, the probability of violating \emph{any} safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits dominate even when incentives conflict. For open-ended settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon ``decidable island’’ where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs. Consequently, the remaining challenge is the ordinary ML task of data coverage and generalization: reward-hacking risk is pushed into evaluation quality rather than hidden incentive leak-through, giving clearer implementation guidance for today’s LLM assistants and future autonomous systems.

[457] Can LLMs Solve ASP Problems? Insights from a Benchmarking Study (Extended Version)

Lin Ren, Guohui Xiao, Guilin Qi, Yishuai Geng, Haohan Xue

Main category: cs.AI

TL;DR: ASPBench introduces a benchmark for evaluating LLMs in ASP tasks, revealing their limitations in core ASP solving.

Details

Motivation: Current evaluations of LLMs in ASP are limited, lacking support for complex ASP features and dedicated benchmarks.

Method: ASPBench includes three tasks: ASP entailment, answer set verification, and answer set computation, tested on 14 LLMs.

Result: LLMs perform well on simpler tasks (entailment, verification) but struggle with answer set computation.

Conclusion: The study highlights the need for better integration of symbolic reasoning in LLMs for ASP solving.

Abstract: Answer Set Programming (ASP) is a powerful paradigm for non-monotonic reasoning. Recently, large language models (LLMs) have demonstrated promising capabilities in logical reasoning. Despite this potential, current evaluations of LLM capabilities in ASP are often limited. Existing works normally employ overly simplified ASP programs, do not support negation, disjunction, or multiple answer sets. Furthermore, there is a lack of benchmarks that introduce tasks specifically designed for ASP solving. To bridge this gap, we introduce ASPBench, a comprehensive ASP benchmark, including three ASP specific tasks: ASP entailment, answer set verification, and answer set computation. Our extensive evaluations on ASPBench reveal that while 14 state-of-the-art LLMs, including \emph{deepseek-r1}, \emph{o4-mini}, and \emph{gemini-2.5-flash-thinking}, perform relatively well on the first two simpler tasks, they struggle with answer set computation, which is the core of ASP solving. These findings offer insights into the current limitations of LLMs in ASP solving. This highlights the need for new approaches that integrate symbolic reasoning capabilities more effectively. The code and dataset are available at https://github.com/HomuraT/ASPBench.

[458] GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

Haoyang Liu, Yijiang Li, Haohan Wang

Main category: cs.AI

TL;DR: GenoMAS introduces a team of LLM-based scientists for gene expression analysis, combining structured workflows and autonomous agents to improve precision and adaptability.

Details

Motivation: Current automation methods for gene expression analysis are either too rigid or lack precision, limiting their effectiveness in scientific research.

Method: GenoMAS uses six specialized LLM agents with typed message-passing protocols and a guided-planning framework to handle genomic data.

Result: Achieves 89.13% Composite Similarity Correlation for preprocessing and 60.48% F$_1$ for gene identification, outperforming prior methods.

Conclusion: GenoMAS provides a robust, adaptable solution for gene expression analysis, validated by literature and performance metrics.

Abstract: Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.

[459] Reinforcement Learning for Multi-Objective Multi-Echelon Supply Chain Optimisation

Rifny Rachman, Josh Tingey, Richard Allmendinger, Pradyumn Shukla, Wei Pan

Main category: cs.AI

TL;DR: A multi-objective supply chain optimization model using reinforcement learning outperforms benchmark methods in balancing trade-offs and robustness.

Details

Motivation: To address the challenge of optimizing supply chains with competing economic, environmental, and social objectives in non-stationary markets.

Method: Develops a Markov decision process-based model evaluated with multi-objective reinforcement learning, compared to weighted single-objective RL and MOEA.

Result: The primary method achieves better trade-offs, 75% higher hypervolume than MOEA, and denser solutions than single-objective RL, with stable production.

Conclusion: The proposed approach effectively balances competing objectives and enhances robustness in complex supply chain scenarios.

Abstract: This study develops a generalised multi-objective, multi-echelon supply chain optimisation model with non-stationary markets based on a Markov decision process, incorporating economic, environmental, and social considerations. The model is evaluated using a multi-objective reinforcement learning (RL) method, benchmarked against an originally single-objective RL algorithm modified with weighted sum using predefined weights, and a multi-objective evolutionary algorithm (MOEA)-based approach. We conduct experiments on varying network complexities, mimicking typical real-world challenges using a customisable simulator. The model determines production and delivery quantities across supply chain routes to achieve near-optimal trade-offs between competing objectives, approximating Pareto front sets. The results demonstrate that the primary approach provides the most balanced trade-off between optimality, diversity, and density, further enhanced with a shared experience buffer that allows knowledge transfer among policies. In complex settings, it achieves up to 75% higher hypervolume than the MOEA-based method and generates solutions that are approximately eleven times denser, signifying better robustness, than those produced by the modified single-objective RL method. Moreover, it ensures stable production and inventory levels while minimising demand loss.

[460] Causality-aligned Prompt Learning via Diffusion-based Counterfactual Generation

Xinshu Li, Ruoyu Wang, Erdun Gao, Mingming Gong, Lina Yao

Main category: cs.AI

TL;DR: The paper introduces DiCap, a diffusion-based counterfactual prompt learning framework, to address limitations in existing prompt learning methods by ensuring causally invariant prompts and robust feature generalization.

Details

Motivation: Existing prompt learning methods lack theoretical grounding, leading to difficulties in achieving causally invariant prompts and robust feature generalization across categories.

Method: DiCap uses a diffusion process to sample gradients from causal model distributions, generating counterfactuals that meet minimal sufficiency criteria. It employs contrastive learning to refine prompts aligned with causal features.

Result: DiCap excels in tasks like image classification, image-text retrieval, and visual question answering, especially in unseen categories.

Conclusion: The theoretically grounded DiCap framework outperforms existing methods by ensuring causally invariant prompts and robust generalization.

Abstract: Prompt learning has garnered attention for its efficiency over traditional model training and fine-tuning. However, existing methods, constrained by inadequate theoretical foundations, encounter difficulties in achieving causally invariant prompts, ultimately falling short of capturing robust features that generalize effectively across categories. To address these challenges, we introduce the $\textit{\textbf{DiCap}}$ model, a theoretically grounded $\textbf{Di}$ffusion-based $\textbf{C}$ounterf$\textbf{a}$ctual $\textbf{p}$rompt learning framework, which leverages a diffusion process to iteratively sample gradients from the marginal and conditional distributions of the causal model, guiding the generation of counterfactuals that satisfy the minimal sufficiency criterion. Grounded in rigorous theoretical derivations, this approach guarantees the identifiability of counterfactual outcomes while imposing strict bounds on estimation errors. We further employ a contrastive learning framework that leverages the generated counterfactuals, thereby enabling the refined extraction of prompts that are precisely aligned with the causal features of the data. Extensive experimental results demonstrate that our method performs excellently across tasks such as image classification, image-text retrieval, and visual question answering, with particularly strong advantages in unseen categories.

[461] What Does ‘Human-Centred AI’ Mean?

Olivia Guest

Main category: cs.AI

TL;DR: The paper argues that AI must be understood as a relationship between technology and human cognition, analyzing its impact through displacement, enhancement, or replacement of human cognitive labor. It critiques obfuscation of cognition in AI, advocating for clearer human-centered design.

Details

Motivation: To clarify the relationship between AI and human cognition, emphasizing the need for human-centered AI by examining how technology interacts with and affects human cognitive labor.

Method: Uses examples (e.g., abacus vs. mental arithmetic, alarm clock vs. knocker-upper) and novel definitions to analyze sociotechnical relationships, categorizing them into displacement, enhancement, or replacement of human cognitive labor.

Result: Highlights that obfuscation of cognition in AI leads to distortion, slows critical engagement, and limits human-centered engineering.

Conclusion: To truly center humans in AI, we must acknowledge and address the human cognitive role in AI systems, avoiding obfuscation.

Abstract: While it seems sensible that human-centred artificial intelligence (AI) means centring “human behaviour and experience,” it cannot be any other way. AI, I argue, is usefully seen as a relationship between technology and humans where it appears that artifacts can perform, to a greater or lesser extent, human cognitive labour. This is evinced using examples that juxtapose technology with cognition, inter alia: abacus versus mental arithmetic; alarm clock versus knocker-upper; camera versus vision; and sweatshop versus tailor. Using novel definitions and analyses, sociotechnical relationships can be analysed into varying types of: displacement (harmful), enhancement (beneficial), and/or replacement (neutral) of human cognitive labour. Ultimately, all AI implicates human cognition; no matter what. Obfuscation of cognition in the AI context – from clocks to artificial neural networks – results in distortion, in slowing critical engagement, perverting cognitive science, and indeed in limiting our ability to truly centre humans and humanity in the engineering of AI systems. To even begin to de-fetishise AI, we must look the human-in-the-loop in the eyes.

[462] Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu

Main category: cs.AI

TL;DR: ADU-Bench is a new benchmark for evaluating Large Audio-Language Models (LALMs) in open-ended audio dialogues, covering diverse scenarios, skills, languages, and ambiguity handling. It reveals current LALMs’ limitations in math, multilingual understanding, and ambiguity resolution.

Details

Motivation: The lack of a comprehensive benchmark for evaluating LALMs in open-ended audio dialogues motivated the creation of ADU-Bench.

Method: ADU-Bench includes 4 datasets assessing 3 scenarios, 12 skills, 9 languages, and 4 ambiguity categories, with 20,000+ dialogues.

Result: Experiments on 16 LALMs show struggles with math, multilingual understanding, roleplay, and ambiguity handling (e.g., intonations, pauses).

Conclusion: ADU-Bench fills a critical gap in evaluating LALMs, highlighting areas for improvement in audio dialogue understanding.

Abstract: Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., “Really!?” with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at https://adu-bench.github.io/.

[463] Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

Ebrahim Rasromani, Stella K. Kang, Yanqi Xu, Beisong Liu, Garvit Luhadia, Wan Fung Chui, Felicia L. Pasadyn, Yu Chih Hung, Julie Y. An, Edwin Mathieu, Zehui Gu, Carlos Fernandez-Granda, Ammar A. Javed, Greg D. Sacks, Tamas Gonda, Chenchan Huang, Yiqiu Shen

Main category: cs.AI

TL;DR: Fine-tuned open-source LLMs with chain-of-thought supervision achieve high accuracy in extracting PCL features and risk categorization, matching GPT-4o performance.

Details

Motivation: Manual extraction of PCL features is labor-intensive, hindering large-scale research. Automating this process with LLMs can advance PCL studies.

Method: Fine-tuned LLaMA and DeepSeek models using QLoRA on GPT-4o-generated chain-of-thought data, evaluated on 285 human-annotated reports.

Result: Improved feature extraction accuracy (97-98%) and risk categorization (F1 scores 0.94-0.97), matching GPT-4o. High radiologist-model agreement (Fleiss’ Kappa ~0.89).

Conclusion: Fine-tuned open-source LLMs with CoT supervision enable efficient, accurate PCL phenotyping, comparable to GPT-4o.

Abstract: Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss’ Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss’ Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss’ Kappa = 0.893) or GPT-CoT (Fleiss’ Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.

[464] Digital Twin Channel-Enabled Online Resource Allocation for 6G: Principle, Architecture and Application

Tongjie Li, Jianhua Zhang, Li Yu, Yuxiang Zhang, Yunlong Cai, Fan Xu, Guangyi Liu

Main category: cs.AI

TL;DR: A DTC-enabled framework for 6G networks improves resource allocation by predicting CSI using environmental sensing, outperforming pilot-based methods by 11.5% in throughput.

Details

Motivation: Addressing the limitations of conventional methods in dynamic environments and reducing excessive pilot overhead for real-time CSI in 6G networks.

Method: Uses digital twin channel (DTC) to predict CSI via environmental sensing, combined with lightweight game-theoretic algorithms for online resource allocation.

Result: Achieves up to 11.5% higher throughput than pilot-based ideal CSI schemes in simulations.

Conclusion: The proposed framework is effective for scalable, low-overhead, and environment-aware communication in 6G networks.

Abstract: Emerging applications such as holographic communication, autonomous driving, and the industrial Internet of Things impose stringent requirements on flexible, low-latency, and reliable resource allocation in 6G networks. Conventional methods, which rely on statistical modeling, have proven effective in general contexts but may fail to achieve optimal performance in specific and dynamic environments. Furthermore, acquiring real-time channel state information (CSI) typically requires excessive pilot overhead. To address these challenges, a digital twin channel (DTC)-enabled online optimization framework is proposed, in which DTC is employed to predict CSI based on environmental sensing. The predicted CSI is then utilized by lightweight game-theoretic algorithms to perform online resource allocation in a timely and efficient manner. Simulation results based on a digital replica of a realistic industrial workshop demonstrate that the proposed method achieves throughput improvements of up to 11.5% compared with pilot-based ideal CSI schemes, validating its effectiveness for scalable, low-overhead, and environment-aware communication in future 6G networks.

[465] Matching Game Preferences Through Dialogical Large Language Models: A Perspective

Renaud Fabre, Daniel Egret, Patrice Bellot

Main category: cs.AI

TL;DR: The paper explores combining LLMs with GRAPHYP’s network to enhance conversational intelligence, proposing a transparent AI framework (D-LLMs) for understanding and personalizing user preferences.

Details

Motivation: To make AI reasoning transparent and traceable, enabling humans to see how AI conclusions are derived, thereby increasing trust and interpretability.

Method: Proposes a conceptual framework (D-LLMs) with three components: reasoning processes, classification systems, and dialogue approaches, integrated with GRAPHYP’s network.

Result: A vision for interpretable AI systems where users can examine and combine human preferences influencing AI responses.

Conclusion: The framework aims to create transparent, trustworthy AI by showing users how answers are reached, enhancing decision-making.

Abstract: This perspective paper explores the future potential of “conversational intelligence” by examining how Large Language Models (LLMs) could be combined with GRAPHYP’s network system to better understand human conversations and preferences. Using recent research and case studies, we propose a conceptual framework that could make AI rea-soning transparent and traceable, allowing humans to see and understand how AI reaches its conclusions. We present the conceptual perspective of “Matching Game Preferences through Dialogical Large Language Models (D-LLMs),” a proposed system that would allow multiple users to share their different preferences through structured conversations. This approach envisions personalizing LLMs by embedding individual user preferences directly into how the model makes decisions. The proposed D-LLM framework would require three main components: (1) reasoning processes that could analyze different search experiences and guide performance, (2) classification systems that would identify user preference patterns, and (3) dialogue approaches that could help humans resolve conflicting information. This perspective framework aims to create an interpretable AI system where users could examine, understand, and combine the different human preferences that influence AI responses, detected through GRAPHYP’s search experience networks. The goal of this perspective is to envision AI systems that would not only provide answers but also show users how those answers were reached, making artificial intelligence more transparent and trustworthy for human decision-making.

[466] Finding Personalized Good-Enough Solutions to Unsatisfiable Stable Roommates Problems

Müge Fidan, Esra Erdem

Main category: cs.AI

TL;DR: The paper introduces a method to compute ‘good-enough’ matchings for Stable Roommates problems by incorporating agents’ habits, preferences, and friend networks, ensuring personalized and stable solutions.

Details

Motivation: Motivated by real-world applications where stable solutions may not always exist, the study aims to find acceptable and stable matchings.

Method: The method integrates agents’ habits, habitual preferences, and friend networks to generate personalized solutions.

Result: The method is validated through examples and empirical evaluations, demonstrating its usefulness.

Conclusion: The approach provides a practical way to address Stable Roommates problems when traditional stable solutions are unavailable.

Abstract: The Stable Roommates problems are characterized by the preferences of agents over other agents as roommates. A solution is a partition of the agents into pairs that are acceptable to each other (i.e., they are in the preference lists of each other), and the matching is stable (i.e., there do not exist any two agents who prefer each other to their roommates, and thus block the matching). Motivated by real-world applications, and considering that stable roommates problems do not always have solutions, we continue our studies to compute “good-enough” matchings. In addition to the agents’ habits and habitual preferences, we consider their networks of preferred friends, and introduce a method to generate personalized solutions to stable roommates problems. We illustrate the usefulness of our method with examples and empirical evaluations.

[467] PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training

Sarat Chandra Bobbili, Ujwal Dinesha, Dheeraj Narasimha, Srinivas Shakkottai

Main category: cs.AI

TL;DR: PITA introduces a framework for aligning LLM outputs with user preferences during inference without needing a pre-trained reward model, reducing computational costs.

Details

Motivation: Existing methods rely on pre-trained reward models, which can be unstable due to dependency on human preference feedback. PITA aims to eliminate this dependency.

Method: PITA learns a small preference-based guidance policy to modify token probabilities during inference, using stochastic search and iterative refinement.

Result: PITA is effective across tasks like mathematical reasoning and sentiment classification, aligning outputs with user preferences.

Conclusion: PITA provides a computationally efficient and stable alternative to reward model-dependent methods for LLM alignment.

Abstract: Inference-time alignment enables large language models (LLMs) to generate outputs aligned with end-user preferences without further training. Recent post-training methods achieve this by using small guidance models to modify token generation during inference. These methods typically optimize a reward function KL-regularized by the original LLM taken as the reference policy. A critical limitation, however, is their dependence on a pre-trained reward model, which requires fitting to human preference feedback–a potentially unstable process. In contrast, we introduce PITA, a novel framework that integrates preference feedback directly into the LLM’s token generation, eliminating the need for a reward model. PITA learns a small preference-based guidance policy to modify token probabilities at inference time without LLM fine-tuning, reducing computational cost and bypassing the pre-trained reward model dependency. The problem is framed as identifying an underlying preference distribution, solved through stochastic search and iterative refinement of the preference-based guidance model. We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification, demonstrating its effectiveness in aligning LLM outputs with user preferences.

[468] Concept Learning for Cooperative Multi-Agent Reinforcement Learning

Zhonghan Ge, Yuanyang Zhu, Chunlin Chen

Main category: cs.AI

TL;DR: The paper introduces CMQ, a novel interpretable value decomposition framework for MARL, addressing transparency and interoperability issues by using human-like cooperation concepts.

Details

Motivation: Current neural networks in MARL lack transparency and interoperability, with unclear cooperative mechanisms. The goal is to enhance trustworthiness by making cooperation concepts interpretable.

Method: Proposes CMQ, a value-based method using concept bottleneck models to represent cooperation concepts as supervised vectors, improving interpretability and performance.

Result: CMQ outperforms state-of-the-art methods in StarCraft II and LBF, providing meaningful cooperation concept representation and enabling concept interventions.

Conclusion: CMQ successfully balances performance and interpretability, offering a transparent framework for MARL with practical applications in detecting biases and artifacts.

Abstract: Despite substantial progress in applying neural networks (NN) to multi-agent reinforcement learning (MARL) areas, they still largely suffer from a lack of transparency and interoperability. However, its implicit cooperative mechanism is not yet fully understood due to black-box networks. In this work, we study an interpretable value decomposition framework via concept bottleneck models, which promote trustworthiness by conditioning credit assignment on an intermediate level of human-like cooperation concepts. To address this problem, we propose a novel value-based method, named Concepts learning for Multi-agent Q-learning (CMQ), that goes beyond the current performance-vs-interpretability trade-off by learning interpretable cooperation concepts. CMQ represents each cooperation concept as a supervised vector, as opposed to existing models where the information flowing through their end-to-end mechanism is concept-agnostic. Intuitively, using individual action value conditioning on global state embeddings to represent each concept allows for extra cooperation representation capacity. Empirical evaluations on the StarCraft II micromanagement challenge and level-based foraging (LBF) show that CMQ achieves superior performance compared with the state-of-the-art counterparts. The results also demonstrate that CMQ provides more cooperation concept representation capturing meaningful cooperation modes, and supports test-time concept interventions for detecting potential biases of cooperation mode and identifying spurious artifacts that impact cooperation.

[469] The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Xingcheng Xu

Main category: cs.AI

TL;DR: The paper presents a mathematical framework to explain policy brittleness in RL for LLMs/LRMs, linking it to non-unique optimal actions and reward incompleteness. It extends to multi-reward settings and validates findings empirically.

Details

Motivation: RL in LLMs/LRMs often leads to brittle policies causing failures like spurious reasoning and deceptive alignment, lacking a unified theoretical explanation.

Method: A rigorous mathematical framework analyzes the stability of reward-to-policy mappings, focusing on non-unique optimal actions and multi-reward RL. Entropy regularization is also examined.

Result: Policy brittleness arises from non-unique optimal actions. Entropy regularization stabilizes policies but increases stochasticity. The framework explains empirical findings like deceptive reasoning.

Conclusion: The work advances policy-stability analysis from heuristics to theory, aiding safer AI design.

Abstract: Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an “effective reward” aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.

[470] StepFun-Prover Preview: Let’s Think and Verify Step by Step

Shijie Shang, Ruosi Wan, Yue Peng, Yutong Wu, Xiong-hui Chen, Jie Yan, Xiangyu Zhang

Main category: cs.AI

TL;DR: StepFun-Prover is a language model for theorem proving, achieving 70% success on miniF2F-test via reinforcement learning and tool-integrated reasoning.

Details

Motivation: To advance automated theorem proving by emulating human-like problem-solving with tool-integrated reasoning.

Method: Uses a reinforcement learning pipeline with tool-based interactions for iterative proof refinement.

Result: Achieves 70.0% pass@1 success rate on the miniF2F-test benchmark.

Conclusion: Introduces a framework for tool-integrated reasoning models, promising for automated theorem proving and Math AI.

Abstract: We present StepFun-Prover Preview, a large language model designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover achieves a pass@1 success rate of $70.0%$. Beyond advancing benchmark performance, we introduce an end-to-end training framework for developing tool-integrated reasoning models, offering a promising direction for automated theorem proving and Math AI assistant.

[471] Improving Subgraph Matching by Combining Algorithms and Graph Neural Networks

Shuyang Guo, Wenjin Xie, Ping Lu, Ting Deng, Richong Zhang, Jianxin Li, Xiangping Huang, Zhongyi Liu

Main category: cs.AI

TL;DR: HFrame is a graph neural network framework for subgraph homomorphism, combining traditional algorithms and machine learning. It outperforms standard GNNs, is faster than exact matching, and achieves high accuracy.

Details

Motivation: Subgraph homomorphism is complex and lacks efficient solutions. HFrame addresses this by integrating traditional methods with machine learning.

Method: HFrame uses graph neural networks to solve subgraph homomorphism, combining algorithmic and learning techniques.

Result: HFrame is up to 101.91x faster than exact matching, with 0.962 average accuracy, and outperforms standard GNNs.

Conclusion: HFrame effectively solves subgraph homomorphism, offering speed, accuracy, and generalization.

Abstract: Homomorphism is a key mapping technique between graphs that preserves their structure. Given a graph and a pattern, the subgraph homomorphism problem involves finding a mapping from the pattern to the graph, ensuring that adjacent vertices in the pattern are mapped to adjacent vertices in the graph. Unlike subgraph isomorphism, which requires a one-to-one mapping, homomorphism allows multiple vertices in the pattern to map to the same vertex in the graph, making it more complex. We propose HFrame, the first graph neural network-based framework for subgraph homomorphism, which integrates traditional algorithms with machine learning techniques. We demonstrate that HFrame outperforms standard graph neural networks by being able to distinguish more graph pairs where the pattern is not homomorphic to the graph. Additionally, we provide a generalization error bound for HFrame. Through experiments on both real-world and synthetic graphs, we show that HFrame is up to 101.91 times faster than exact matching algorithms and achieves an average accuracy of 0.962.

[472] SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration

Keyan Ding, Jing Yu, Junjie Huang, Yuchen Yang, Qiang Zhang, Huajun Chen

Main category: cs.AI

TL;DR: SciToolAgent is an LLM-powered agent that automates scientific tools across biology, chemistry, and materials science, outperforming existing methods.

Details

Motivation: Specialized computational tools require domain expertise, and current LLMs struggle with integrating multiple tools for complex workflows.

Method: SciToolAgent uses a scientific tool knowledge graph for intelligent tool selection and execution, along with a safety-checking module.

Result: The agent outperforms existing approaches in evaluations and successfully automates workflows in protein engineering, chemical reactivity, synthesis, and material screening.

Conclusion: SciToolAgent makes advanced research tools accessible to experts and non-experts, enhancing scientific workflow automation.

Abstract: Scientific research increasingly relies on specialized computational tools, yet effectively utilizing these tools demands substantial domain expertise. While Large Language Models (LLMs) show promise in tool automation, they struggle to seamlessly integrate and orchestrate multiple tools for complex scientific workflows. Here, we present SciToolAgent, an LLM-powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science. At its core, SciToolAgent leverages a scientific tool knowledge graph that enables intelligent tool selection and execution through graph-based retrieval-augmented generation. The agent also incorporates a comprehensive safety-checking module to ensure responsible and ethical tool usage. Extensive evaluations on a curated benchmark demonstrate that SciToolAgent significantly outperforms existing approaches. Case studies in protein engineering, chemical reactivity prediction, chemical synthesis, and metal-organic framework screening further demonstrate SciToolAgent’s capability to automate complex scientific workflows, making advanced research tools accessible to both experts and non-experts.

[473] Artificial Intelligence In Patent And Market Intelligence: A New Paradigm For Technology Scouting

Manish Verma, Vivek Sharma, Vishal Singh

Main category: cs.AI

TL;DR: An AI-powered platform using LLMs improves industrial R&D scouting by automating solution discovery from patents and market data, reducing manual effort and speeding innovation.

Details

Motivation: Traditional R&D scouting is slow, manual, and fragmented, relying on domain expertise and incomplete data.

Method: The platform uses LLMs for semantic understanding, contextual reasoning, and cross-domain knowledge extraction to analyze patents and market data, organizing solutions systematically.

Result: The AI-driven engine reduces manual work, accelerates innovation, and enhances decision-making by providing comprehensive, relevant solutions.

Conclusion: The platform transforms R&D scouting by combining AI and real-world data for efficient, sustainable innovation.

Abstract: This paper presents the development of an AI powered software platform that leverages advanced large language models (LLMs) to transform technology scouting and solution discovery in industrial R&D. Traditional approaches to solving complex research and development challenges are often time consuming, manually driven, and heavily dependent on domain specific expertise. These methods typically involve navigating fragmented sources such as patent repositories, commercial product catalogs, and competitor data, leading to inefficiencies and incomplete insights. The proposed platform utilizes cutting edge LLM capabilities including semantic understanding, contextual reasoning, and cross-domain knowledge extraction to interpret problem statements and retrieve high-quality, sustainable solutions. The system processes unstructured patent texts, such as claims and technical descriptions, and systematically extracts potential innovations aligned with the given problem context. These solutions are then algorithmically organized under standardized technical categories and subcategories to ensure clarity and relevance across interdisciplinary domains. In addition to patent analysis, the platform integrates commercial intelligence by identifying validated market solutions and active organizations addressing similar challenges. This combined insight sourced from both intellectual property and real world product data enables R&D teams to assess not only technical novelty but also feasibility, scalability, and sustainability. The result is a comprehensive, AI driven scouting engine that reduces manual effort, accelerates innovation cycles, and enhances decision making in complex R&D environments.

[474] The Blessing and Curse of Dimensionality in Safety Alignment

Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen

Main category: cs.AI

TL;DR: High dimensions in LLMs aid performance but introduce safety risks via linear activation exploits. Dimensional reduction mitigates jailbreaking while preserving alignment.

Details

Motivation: Address the emergent safety risks in large language models (LLMs) due to high-dimensional representations, which can be exploited to bypass safety measures.

Method: Visualize linear subspaces in activation space, demonstrate dimensional reduction’s effectiveness, and provide theoretical insights on jailbreaking methods.

Result: Dimensional reduction reduces susceptibility to jailbreaking while maintaining alignment, supported by empirical and theoretical evidence.

Conclusion: High dimensions in LLMs are a double-edged sword for safety; dimensional reduction offers a viable mitigation strategy.

Abstract: The focus on safety alignment in large language models (LLMs) has increased significantly due to their widespread adoption across different domains. The scale of LLMs play a contributing role in their success, and the growth in parameter count follows larger hidden dimensions. In this paper, we hypothesize that while the increase in dimensions has been a key advantage, it may lead to emergent problems as well. These problems emerge as the linear structures in the activation space can be exploited, in the form of activation engineering, to circumvent its safety alignment. Through detailed visualizations of linear subspaces associated with different concepts, such as safety, across various model scales, we show that the curse of high-dimensional representations uniquely impacts LLMs. Further substantiating our claim, we demonstrate that projecting the representations of the model onto a lower dimensional subspace can preserve sufficient information for alignment while avoiding those linear structures. Empirical results confirm that such dimensional reduction significantly reduces susceptibility to jailbreaking through representation engineering. Building on our empirical validations, we provide theoretical insights into these linear jailbreaking methods relative to a model’s hidden dimensions. Broadly speaking, our work posits that the high dimensions of a model’s internal representations can be both a blessing and a curse in safety alignment.

[475] VLMPlanner: Integrating Visual Language Models with Motion Planning

Zhipeng Tang, Sha Zhang, Jiajun Deng, Chenjie Wang, Guoliang You, Yuting Huang, Xinrui Lin, Yanyong Zhang

Main category: cs.AI

TL;DR: VLMPlanner integrates vision-language models (VLMs) with real-time planners for autonomous driving, improving decision-making by leveraging visual context and common-sense reasoning.

Details

Motivation: Existing methods lack visual context, hindering robust decision-making in complex driving scenarios.

Method: VLMPlanner combines a learning-based planner with a VLM to process multi-view images and uses a CAI-Gate mechanism for adaptive inference.

Result: Superior planning performance in complex scenarios, demonstrated on the nuPlan benchmark.

Conclusion: VLMPlanner bridges the gap in visual context for autonomous driving, offering robust and efficient planning.

Abstract: Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and safe trajectories. Furthermore, we develop the Context-Adaptive Inference Gate (CAI-Gate) mechanism that enables the VLM to mimic human driving behavior by dynamically adjusting its inference frequency based on scene complexity, thereby achieving an optimal balance between planning performance and computational efficiency. We evaluate our approach on the large-scale, challenging nuPlan benchmark, with comprehensive experimental results demonstrating superior planning performance in scenarios with intricate road conditions and dynamic elements. Code will be available.

[476] Multi-Agent Reinforcement Learning for Dynamic Mobility Resource Allocation with Hierarchical Adaptive Grouping

Farshid Nooshi, Suining He

Main category: cs.AI

TL;DR: A novel multi-agent reinforcement learning method (HAG-PS) for dynamic mobility resource allocation, addressing policy sharing and memory efficiency in urban settings.

Details

Motivation: To rebalance mobility demand and supply by dynamically allocating resources like bikes/e-scooters and ride-sharing vehicles in urban environments.

Method: HAG-PS uses hierarchical global/local information, adaptive agent grouping, and learnable ID embeddings for dynamic policy sharing and memory efficiency.

Result: Superior performance (e.g., improved bike availability) demonstrated using NYC bike-sharing data (1.2M+ trips).

Conclusion: HAG-PS effectively addresses mobility resource allocation challenges, outperforming baseline methods.

Abstract: Allocating mobility resources (e.g., shared bikes/e-scooters, ride-sharing vehicles) is crucial for rebalancing the mobility demand and supply in the urban environments. We propose in this work a novel multi-agent reinforcement learning named Hierarchical Adaptive Grouping-based Parameter Sharing (HAG-PS) for dynamic mobility resource allocation. HAG-PS aims to address two important research challenges regarding multi-agent reinforcement learning for mobility resource allocation: (1) how to dynamically and adaptively share the mobility resource allocation policy (i.e., how to distribute mobility resources) across agents (i.e., representing the regional coordinators of mobility resources); and (2) how to achieve memory-efficient parameter sharing in an urban-scale setting. To address the above challenges, we have provided following novel designs within HAG-PS. To enable dynamic and adaptive parameter sharing, we have designed a hierarchical approach that consists of global and local information of the mobility resource states (e.g., distribution of mobility resources). We have developed an adaptive agent grouping approach in order to split or merge the groups of agents based on their relative closeness of encoded trajectories (i.e., states, actions, and rewards). We have designed a learnable identity (ID) embeddings to enable agent specialization beyond simple parameter copy. We have performed extensive experimental studies based on real-world NYC bike sharing data (a total of more than 1.2 million trips), and demonstrated the superior performance (e.g., improved bike availability) of HAG-PS compared with other baseline approaches.

[477] MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

Hafsteinn Einarsson

Main category: cs.AI

TL;DR: The paper introduces MazeEval, a benchmark to evaluate LLMs’ spatial reasoning in maze navigation without visual cues, revealing disparities in performance across models and languages.

Details

Motivation: To assess LLMs' spatial reasoning capabilities for reliable real-world deployment in robotics and embodied AI, especially without visual input.

Method: Uses coordinate-based maze navigation tasks with varying grid sizes (5×5 to 15×15), excluding visual input, and tests models in English and Icelandic.

Result: Performance varies widely; OpenAI’s O3 excels (up to 30×30 mazes), while others fail beyond 9×9 due to looping. Icelandic performance is worse, suggesting linguistic dependency.

Conclusion: Spatial reasoning in LLMs is tied to linguistic training data, highlighting the need for architectural improvements for reliable cross-linguistic deployment.

Abstract: As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for ensuring reliable real-world deployment. Despite advances in language understanding, current research lacks evaluation of how LLMs perform spatial navigation without visual cues, a fundamental requirement for agents operating with limited sensory information. This paper addresses this gap by introducing MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks. Our methodology employs a function-calling interface where models navigate mazes of varying complexity ($5\times 5$ to $15\times 15$ grids) using only coordinate feedback and distance-to-wall information, excluding visual input to test fundamental spatial cognition. We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities. Our findings reveal striking disparities: while OpenAI’s O3 achieves perfect navigation for mazes up to size $30\times 30$, other models exhibit catastrophic failure beyond $9\times 9$ mazes, with 100% of failures attributed to excessive looping behavior where models revisit a cell at least 10 times. We document a significant performance degradation in Icelandic, with models solving mazes 3-4 sizes smaller than in English, suggesting spatial reasoning in LLMs emerges from linguistic patterns rather than language-agnostic mechanisms. These results have important implications for global deployment of LLM-powered autonomous systems, showing spatial intelligence remains fundamentally constrained by training data availability and highlighting the need for architectural innovations to achieve reliable navigation across linguistic contexts.

[478] Enhancing QoS in Edge Computing through Federated Layering Techniques: A Pathway to Resilient AI Lifelong Learning Systems

Chengzhuo Han

Main category: cs.AI

TL;DR: The paper proposes a federated layering technique (FLT) to enhance QoS in edge computing using AI lifelong learning, improving efficiency and privacy.

Details

Motivation: Addressing increased data volume and complexity in 6G networks, focusing on QoS in edge computing.

Method: Develops a federated layering-based small model collaborative mechanism with negotiation and debate among AI models.

Result: Improves learning efficiency, reasoning accuracy, and privacy protection in edge computing.

Conclusion: The approach offers a resilient solution for lifelong learning systems, significantly enhancing QoS in edge environments.

Abstract: In the context of the rapidly evolving information technology landscape, marked by the advent of 6G communication networks, we face an increased data volume and complexity in network environments. This paper addresses these challenges by focusing on Quality of Service (QoS) in edge computing frameworks. We propose a novel approach to enhance QoS through the development of General Artificial Intelligence Lifelong Learning Systems, with a special emphasis on Federated Layering Techniques (FLT). Our work introduces a federated layering-based small model collaborative mechanism aimed at improving AI models’ operational efficiency and response time in environments where resources are limited. This innovative method leverages the strengths of cloud and edge computing, incorporating a negotiation and debate mechanism among small AI models to enhance reasoning and decision-making processes. By integrating model layering techniques with privacy protection measures, our approach ensures the secure transmission of model parameters while maintaining high efficiency in learning and reasoning capabilities. The experimental results demonstrate that our strategy not only enhances learning efficiency and reasoning accuracy but also effectively protects the privacy of edge nodes. This presents a viable solution for achieving resilient large model lifelong learning systems, with a significant improvement in QoS for edge computing environments.

Pritom Ray Nobin, Imran Ahammad Rifat

Main category: cs.AI

TL;DR: STARN-GAT, a Multi-Modal Spatio-Temporal Graph Attention Network, improves traffic accident severity prediction by integrating spatial, temporal, and contextual data, achieving high performance on benchmark datasets.

Details

Motivation: Accurate prediction of traffic accident severity is crucial for road safety and emergency response, but existing methods fail to model complex interdependencies among variables.

Method: STARN-GAT uses adaptive graph construction and modality-aware attention mechanisms to unify road network topology, temporal traffic patterns, and environmental context.

Result: Achieves Macro F1-scores of 85% (FARS) and 84% (ARI-BUET), with ROC-AUC scores of 0.91 and 0.89, respectively, demonstrating high accuracy and recall for severe incidents.

Conclusion: STARN-GAT effectively bridges advanced graph neural networks with practical road safety applications, offering interpretability and real-time deployment potential.

Abstract: Accurate prediction of traffic accident severity is critical for improving road safety, optimizing emergency response strategies, and informing the design of safer transportation infrastructure. However, existing approaches often struggle to effectively model the intricate interdependencies among spatial, temporal, and contextual variables that govern accident outcomes. In this study, we introduce STARN-GAT, a Multi-Modal Spatio-Temporal Graph Attention Network, which leverages adaptive graph construction and modality-aware attention mechanisms to capture these complex relationships. Unlike conventional methods, STARN-GAT integrates road network topology, temporal traffic patterns, and environmental context within a unified attention-based framework. The model is evaluated on the Fatality Analysis Reporting System (FARS) dataset, achieving a Macro F1-score of 85 percent, ROC-AUC of 0.91, and recall of 81 percent for severe incidents. To ensure generalizability within the South Asian context, STARN-GAT is further validated on the ARI-BUET traffic accident dataset, where it attains a Macro F1-score of 0.84, recall of 0.78, and ROC-AUC of 0.89. These results demonstrate the model’s effectiveness in identifying high-risk cases and its potential for deployment in real-time, safety-critical traffic management systems. Furthermore, the attention-based architecture enhances interpretability, offering insights into contributing factors and supporting trust in AI-assisted decision-making. Overall, STARN-GAT bridges the gap between advanced graph neural network techniques and practical applications in road safety analytics.

[480] Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson

Main category: cs.AI

TL;DR: The paper investigates the trustworthiness of LLM-powered AI agents under adversarial attacks, revealing widespread policy violations and proposing the ART benchmark for security assessment.

Details

Motivation: To assess whether AI agents can adhere to deployment policies under realistic attack scenarios, given their increasing autonomy and tool integration.

Method: Conducted a large-scale red-teaming competition with 1.8 million prompt-injection attacks on 22 AI agents across 44 scenarios, analyzing successful policy violations.

Result: Over 60,000 attacks succeeded, with policy violations occurring within 10-100 queries. No strong correlation was found between robustness and model size or capability.

Conclusion: AI agents have critical vulnerabilities; the ART benchmark aims to improve security assessments and safer deployment.

Abstract: Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference-time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today’s AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.

[481] MeLA: A Metacognitive LLM-Driven Architecture for Automatic Heuristic Design

Zishang Qiu, Xinan Chen, Long Chen, Ruibin Bai

Main category: cs.AI

TL;DR: MeLA is a metacognitive LLM-driven architecture for Automatic Heuristic Design (AHD) that evolves prompts instead of heuristic code, outperforming traditional methods.

Details

Motivation: To improve heuristic design by leveraging metacognitive principles and LLMs, moving beyond direct code evolution.

Method: MeLA uses prompt evolution, integrating a problem analyzer, error diagnosis system, and metacognitive search engine to refine prompts iteratively.

Result: MeLA generates more effective and robust heuristics, surpassing state-of-the-art methods in experiments.

Conclusion: The research highlights the potential of cognitive science-inspired AI architectures for robust and interpretable AHD.

Abstract: This paper introduces MeLA, a Metacognitive LLM-Driven Architecture that presents a new paradigm for Automatic Heuristic Design (AHD). Traditional evolutionary methods operate directly on heuristic code; in contrast, MeLA evolves the instructional prompts used to guide a Large Language Model (LLM) in generating these heuristics. This process of “prompt evolution” is driven by a novel metacognitive framework where the system analyzes performance feedback to systematically refine its generative strategy. MeLA’s architecture integrates a problem analyzer to construct an initial strategic prompt, an error diagnosis system to repair faulty code, and a metacognitive search engine that iteratively optimizes the prompt based on heuristic effectiveness. In comprehensive experiments across both benchmark and real-world problems, MeLA consistently generates more effective and robust heuristics, significantly outperforming state-of-the-art methods. Ultimately, this research demonstrates the profound potential of using cognitive science as a blueprint for AI architecture, revealing that by enabling an LLM to metacognitively regulate its problem-solving process, we unlock a more robust and interpretable path to AHD.

[482] Unlearning of Knowledge Graph Embedding via Preference Optimization

Jiajun Liu, Wenjun Ke, Peng Wang, Yao He, Ziyu Shang, Guozheng Li, Zijie Xu, Ke Ji

Main category: cs.AI

TL;DR: GraphDPO is a novel approximate unlearning framework for KGs that reframes unlearning as a preference optimization problem, outperforming existing methods.

Details

Motivation: Existing KG unlearning methods face issues like incomplete removal of targeted information and weakened remaining knowledge due to KG connectivity.

Method: GraphDPO uses direct preference optimization (DPO) to penalize forgettable knowledge and introduces out-boundary sampling and boundary recall mechanisms.

Result: GraphDPO outperforms baselines by up to 10.1% in MRR_Avg and 14.0% in MRR_F1 on eight datasets.

Conclusion: GraphDPO effectively removes targeted knowledge while preserving boundary knowledge, addressing key challenges in KG unlearning.

Abstract: Existing knowledge graphs (KGs) inevitably contain outdated or erroneous knowledge that needs to be removed from knowledge graph embedding (KGE) models. To address this challenge, knowledge unlearning can be applied to eliminate specific information while preserving the integrity of the remaining knowledge in KGs. Existing unlearning methods can generally be categorized into exact unlearning and approximate unlearning. However, exact unlearning requires high training costs while approximate unlearning faces two issues when applied to KGs due to the inherent connectivity of triples: (1) It fails to fully remove targeted information, as forgetting triples can still be inferred from remaining ones. (2) It focuses on local data for specific removal, which weakens the remaining knowledge in the forgetting boundary. To address these issues, we propose GraphDPO, a novel approximate unlearning framework based on direct preference optimization (DPO). Firstly, to effectively remove forgetting triples, we reframe unlearning as a preference optimization problem, where the model is trained by DPO to prefer reconstructed alternatives over the original forgetting triples. This formulation penalizes reliance on forgettable knowledge, mitigating incomplete forgetting caused by KG connectivity. Moreover, we introduce an out-boundary sampling strategy to construct preference pairs with minimal semantic overlap, weakening the connection between forgetting and retained knowledge. Secondly, to preserve boundary knowledge, we introduce a boundary recall mechanism that replays and distills relevant information both within and across time steps. We construct eight unlearning datasets across four popular KGs with varying unlearning rates. Experiments show that GraphDPO outperforms state-of-the-art baselines by up to 10.1% in MRR_Avg and 14.0% in MRR_F1.

[483] Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression

Te Zhang, Yuheng Li, Junxiang Wang, Lujun Li

Main category: cs.AI

TL;DR: An adaptive search algorithm optimizes sparsity and KV cache compression for large multimodal models (LMMs), enhancing efficiency without accuracy loss.

Details

Motivation: Compressing LMMs for edge deployment is challenging; existing methods lack efficiency or accuracy.

Method: Uses Tree-structured Parzen Estimator to dynamically adjust pruning ratios and KV cache quantization, combining pruning with KV cache compression without fine-tuning.

Result: Outperforms SparseGPT and Wanda on benchmarks (LLaVA-1.5 7B/13B), achieving memory efficiency with minimal performance loss.

Conclusion: The framework sets a new standard in LMM optimization, balancing efficiency and performance.

Abstract: Large multimodal models (LMMs) have advanced significantly by integrating visual encoders with extensive language models, enabling robust reasoning capabilities. However, compressing LMMs for deployment on edge devices remains a critical challenge. In this work, we propose an adaptive search algorithm that optimizes sparsity and KV cache compression to enhance LMM efficiency. Utilizing the Tree-structured Parzen Estimator, our method dynamically adjusts pruning ratios and KV cache quantization bandwidth across different LMM layers, using model performance as the optimization objective. This approach uniquely combines pruning with key-value cache quantization and incorporates a fast pruning technique that eliminates the need for additional fine-tuning or weight adjustments, achieving efficient compression without compromising accuracy. Comprehensive evaluations on benchmark datasets, including LLaVA-1.5 7B and 13B, demonstrate our method superiority over state-of-the-art techniques such as SparseGPT and Wanda across various compression levels. Notably, our framework automatic allocation of KV cache compression resources sets a new standard in LMM optimization, delivering memory efficiency without sacrificing much performance.

Lijian Li

Main category: cs.AI

TL;DR: MoCME improves MMKGC by leveraging complementarity in multi-modal data and dynamic negative sampling, outperforming existing methods.

Details

Motivation: Address the imbalance and overlooked complementarity in multimodal knowledge graphs for robust entity representation.

Method: Proposes MoCME with CMKF for multi-modal fusion and EGNS for dynamic negative sampling.

Result: Achieves state-of-the-art performance on five benchmark datasets.

Conclusion: MoCME effectively enhances entity representation and training robustness in MMKGC.

Abstract: Multi-modal Knowledge Graph Completion (MMKGC) aims to uncover hidden world knowledge in multimodal knowledge graphs by leveraging both multimodal and structural entity information. However, the inherent imbalance in multimodal knowledge graphs, where modality distributions vary across entities, poses challenges in utilizing additional modality data for robust entity representation. Existing MMKGC methods typically rely on attention or gate-based fusion mechanisms but overlook complementarity contained in multi-modal data. In this paper, we propose a novel framework named Mixture of Complementary Modality Experts (MoCME), which consists of a Complementarity-guided Modality Knowledge Fusion (CMKF) module and an Entropy-guided Negative Sampling (EGNS) mechanism. The CMKF module exploits both intra-modal and inter-modal complementarity to fuse multi-view and multi-modal embeddings, enhancing representations of entities. Additionally, we introduce an Entropy-guided Negative Sampling mechanism to dynamically prioritize informative and uncertain negative samples to enhance training effectiveness and model robustness. Extensive experiments on five benchmark datasets demonstrate that our MoCME achieves state-of-the-art performance, surpassing existing approaches.

[485] Adaptive Fuzzy Time Series Forecasting via Partially Asymmetric Convolution and Sub-Sliding Window Fusion

Lijian Li

Main category: cs.AI

TL;DR: A novel convolutional architecture with adaptive fuzzified temporal data and partially asymmetric design improves spatio-temporal dependency capture and global information synthesis for accurate time series forecasting.

Details

Motivation: Current forecasting models lack the ability to effectively capture spatio-temporal dependencies and synthesize global information during learning.

Method: The paper introduces an improved fuzzy time series construction strategy, a bilateral Atrous algorithm for reduced computation, and a partially asymmetric convolutional architecture for flexible feature mining.

Result: The proposed method achieves state-of-the-art performance on popular time series datasets.

Conclusion: The approach effectively addresses the limitations of existing models by enhancing temporal interrelation capture and global information synthesis.

Abstract: At present, state-of-the-art forecasting models are short of the ability to capture spatio-temporal dependency and synthesize global information at the stage of learning. To address this issue, in this paper, through the adaptive fuzzified construction of temporal data, we propose a novel convolutional architecture with partially asymmetric design based on the scheme of sliding window to realize accurate time series forecasting. First, the construction strategy of traditional fuzzy time series is improved to further extract short and long term temporal interrelation, which enables every time node to automatically possess corresponding global information and inner relationships among them in a restricted sliding window and the process does not require human involvement. Second, a bilateral Atrous algorithm is devised to reduce calculation demand of the proposed model without sacrificing global characteristics of elements. And it also allows the model to avoid processing redundant information. Third, after the transformation of time series, a partially asymmetric convolutional architecture is designed to more flexibly mine data features by filters in different directions on feature maps, which gives the convolutional neural network (CNN) the ability to construct sub-windows within existing sliding windows to model at a more fine-grained level. And after obtaining the time series information at different levels, the multi-scale features from different sub-windows will be sent to the corresponding network layer for time series information fusion. Compared with other competitive modern models, the proposed method achieves state-of-the-art results on most of popular time series datasets, which is fully verified by the experimental results.

[486] A General Framework for Dynamic MAPF using Multi-Shot ASP and Tunnels

Aysu Bogatarkan, Esra Erdem

Main category: cs.AI

TL;DR: The paper introduces Dynamic MAPF (D-MAPF), a variant of the MAPF problem accommodating dynamic changes like agent/obstacle alterations. It proposes a general definition, a flexible framework, and an ASP-based method with tunnels for efficient replanning. Experimental evaluations assess performance and solution quality.

Details

Motivation: Addressing real-world warehouse applications where dynamic changes (e.g., agents entering/leaving, obstacles moving) require adaptable planning to avoid collisions.

Method: 1) General D-MAPF definition, 2) Flexible framework for multi-shot computation, 3) ASP-based method combining replanning and repairing, using tunnels for agent movement.

Result: Experimental evaluations highlight the method’s computational performance and solution quality, showcasing strengths and weaknesses.

Conclusion: The proposed D-MAPF approach, with its adaptable framework and ASP-based method, effectively handles dynamic environments, balancing performance and solution quality.

Abstract: MAPF problem aims to find plans for multiple agents in an environment within a given time, such that the agents do not collide with each other or obstacles. Motivated by the execution and monitoring of these plans, we study Dynamic MAPF (D-MAPF) problem, which allows changes such as agents entering/leaving the environment or obstacles being removed/moved. Considering the requirements of real-world applications in warehouses with the presence of humans, we introduce

a general definition for D-MAPF (applicable to variations of D-MAPF), 2) a new framework to solve D-MAPF (utilizing multi-shot computation, and allowing different methods to solve D-MAPF), and 3) a new ASP-based method to solve D-MAPF (combining advantages of replanning and repairing methods, with a novel concept of tunnels to specify where agents can move). We have illustrated the strengths and weaknesses of this method by experimental evaluations, from the perspectives of computational performance and quality of solutions.

[487] Algorithmic Fairness: A Runtime Perspective

Filip Cano, Thomas A. Henzinger, Konstantin Kueffner

Main category: cs.AI

TL;DR: The paper introduces a framework for analyzing fairness in AI as a runtime property, using a coin-toss model to study monitoring and enforcement strategies under evolving biases.

Details

Motivation: Traditional fairness in AI is static, but real-world systems evolve over time, necessitating a dynamic approach.

Method: A minimal model of coin tosses with evolving biases is used to explore monitoring and enforcing fairness under various conditions.

Result: The paper provides monitoring and enforcement strategies, parametrized by dynamics, prediction horizon, and confidence thresholds, with general results under minimal assumptions.

Conclusion: Fairness in AI must adapt to dynamic environments, and the proposed framework offers flexible strategies for runtime analysis.

Abstract: Fairness in AI is traditionally studied as a static property evaluated once, over a fixed dataset. However, real-world AI systems operate sequentially, with outcomes and environments evolving over time. This paper proposes a framework for analysing fairness as a runtime property. Using a minimal yet expressive model based on sequences of coin tosses with possibly evolving biases, we study the problems of monitoring and enforcing fairness expressed in either toss outcomes or coin biases. Since there is no one-size-fits-all solution for either problem, we provide a summary of monitoring and enforcement strategies, parametrised by environment dynamics, prediction horizon, and confidence thresholds. For both problems, we present general results under simple or minimal assumptions. We survey existing solutions for the monitoring problem for Markovian and additive dynamics, and existing solutions for the enforcement problem in static settings with known dynamics.

[488] Learning the Value Systems of Societies from Preferences

Andrés Holgado-Sánchez, Holger Billhardt, Sascha Ossowski, Sara Degli-Esposti

Main category: cs.AI

TL;DR: The paper proposes a method for learning societal value systems using heuristic deep clustering, addressing the challenge of aligning AI with diverse human values.

Details

Motivation: Aligning AI with human values is crucial for ethical AI, but manually eliciting and calibrating value systems is difficult. Societal value systems are better represented as diverse group systems rather than aggregated individual ones.

Method: The method uses heuristic deep clustering to learn shared value groundings and diverse societal value systems from qualitative preferences of sampled agents.

Result: The method is evaluated in a real-world use case involving traveling decisions, demonstrating its practical applicability.

Conclusion: The approach effectively models societal value systems, advancing value-aware AI by addressing the complexity of diverse human values.

Abstract: Aligning AI systems with human values and the value-based preferences of various stakeholders (their value systems) is key in ethical AI. In value-aware AI systems, decision-making draws upon explicit computational representations of individual values (groundings) and their aggregation into value systems. As these are notoriously difficult to elicit and calibrate manually, value learning approaches aim to automatically derive computational models of an agent’s values and value system from demonstrations of human behaviour. Nonetheless, social science and humanities literature suggest that it is more adequate to conceive the value system of a society as a set of value systems of different groups, rather than as the simple aggregation of individual value systems. Accordingly, here we formalize the problem of learning the value systems of societies and propose a method to address it based on heuristic deep clustering. The method learns socially shared value groundings and a set of diverse value systems representing a given society by observing qualitative value-based preferences from a sample of agents. We evaluate the proposal in a use case with real data about travelling decisions.

[489] Beyond Listenership: AI-Predicted Interventions Drive Improvements in Maternal Health Behaviours

Arpan Dasgupta, Sarvesh Gharat, Neha Madhiwalla, Aparna Hegde, Milind Tambe, Aparna Taneja

Main category: cs.AI

TL;DR: AI-targeted voice calls improve listenership and health behaviors in maternal and child health programs.

Details

Motivation: Address beneficiary dropoffs and poor engagement in automated health information programs.

Method: Used an AI model (restless bandit) to target beneficiaries for live service calls.

Result: AI interventions boosted listenership and led to significant improvements in health behaviors (e.g., supplement intake) and knowledge.

Conclusion: AI can meaningfully enhance maternal and child health outcomes by improving engagement and behavior.

Abstract: Automated voice calls with health information are a proven method for disseminating maternal and child health information among beneficiaries and are deployed in several programs around the world. However, these programs often suffer from beneficiary dropoffs and poor engagement. In previous work, through real-world trials, we showed that an AI model, specifically a restless bandit model, could identify beneficiaries who would benefit most from live service call interventions, preventing dropoffs and boosting engagement. However, one key question has remained open so far: does such improved listenership via AI-targeted interventions translate into beneficiaries’ improved knowledge and health behaviors? We present a first study that shows not only listenership improvements due to AI interventions, but also simultaneously links these improvements to health behavior changes. Specifically, we demonstrate that AI-scheduled interventions, which enhance listenership, lead to statistically significant improvements in beneficiaries’ health behaviors such as taking iron or calcium supplements in the postnatal period, as well as understanding of critical health topics during pregnancy and infancy. This underscores the potential of AI to drive meaningful improvements in maternal and child health.

[490] How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

Hao Yang, Qinghua Zhao, Lei Li

Main category: cs.AI

TL;DR: The paper investigates the internal mechanisms of Chain-of-Thought (CoT) prompting, revealing its role as a decoding space pruner and its task-dependent neuron modulation.

Details

Motivation: To understand the operational principles of CoT prompting, which enhances model reasoning but lacks mechanistic clarity.

Method: Reverse tracing of information flow across decoding, projection, and activation phases, with quantitative analysis.

Result: CoT acts as a decoding space pruner, guided by answer templates, and modulates neuron engagement task-dependently (reducing activation in open-domain tasks, increasing in closed-domain).

Conclusion: The findings provide a mechanistic interpretability framework and insights for designing targeted CoT interventions to improve prompt efficiency and robustness.

Abstract: Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze CoT’s operational principles by reversely tracing information flow across decoding, projection, and activation phases. Our quantitative analysis suggests that CoT may serve as a decoding space pruner, leveraging answer templates to guide output generation, with higher template adherence strongly correlating with improved performance. Furthermore, we surprisingly find that CoT modulates neuron engagement in a task-dependent manner: reducing neuron activation in open-domain tasks, yet increasing it in closed-domain scenarios. These findings offer a novel mechanistic interpretability framework and critical insights for enabling targeted CoT interventions to design more efficient and robust prompts. We released our code and data at https://anonymous.4open.science/r/cot-D247.

[491] evalSmarT: An LLM-Based Framework for Evaluating Smart Contract Generated Comments

Fatou Ndiaye Mbodji

Main category: cs.AI

TL;DR: The paper introduces evalSmarT, a framework using LLMs to evaluate smart contract comment quality, addressing limitations of traditional metrics and human evaluation.

Details

Motivation: Current methods for evaluating smart contract comments (BLEU, ROUGE, human evaluation) are inadequate due to lack of domain specificity, cost, and scalability issues.

Method: Proposes evalSmarT, a modular framework combining ~40 LLMs with 10 prompting strategies for scalable, domain-specific evaluation.

Result: Prompt design significantly affects alignment with human judgment; LLM-based evaluation is scalable and semantically rich.

Conclusion: evalSmarT provides a practical, scalable solution for evaluating smart contract comments, outperforming traditional methods.

Abstract: Smart contract comment generation has gained traction as a means to improve code comprehension and maintainability in blockchain systems. However, evaluating the quality of generated comments remains a challenge. Traditional metrics such as BLEU and ROUGE fail to capture domain-specific nuances, while human evaluation is costly and unscalable. In this paper, we present \texttt{evalSmarT}, a modular and extensible framework that leverages large language models (LLMs) as evaluators. The system supports over 400 evaluator configurations by combining approximately 40 LLMs with 10 prompting strategies. We demonstrate its application in benchmarking comment generation tools and selecting the most informative outputs. Our results show that prompt design significantly impacts alignment with human judgment, and that LLM-based evaluation offers a scalable and semantically rich alternative to existing methods.

[492] MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Xueyao Wan, Hang Yu

Main category: cs.AI

TL;DR: MMGraphRAG improves multimodal retrieval-augmented generation by using scene graphs and multimodal knowledge graphs, outperforming existing methods.

Details

Motivation: Conventional RAG lacks multimodal integration and structured knowledge, while existing multimodal RAG methods miss logical chains and require task-specific training.

Method: MMGraphRAG refines visual content with scene graphs, builds a multimodal knowledge graph, uses spectral clustering for cross-modal linking, and retrieves context along reasoning paths.

Result: Achieves state-of-the-art performance on DocBench and MMLongBench, showing strong adaptability and clear reasoning.

Conclusion: MMGraphRAG effectively addresses limitations of prior RAG methods by integrating structured knowledge and multimodal reasoning.

Abstract: Retrieval-Augmented Generation (RAG) enhances language model generation by retrieving relevant information from external knowledge bases. However, conventional RAG methods face the issue of missing multimodal information. Multimodal RAG methods address this by fusing images and text through mapping them into a shared embedding space, but they fail to capture the structure of knowledge and logical chains between modalities. Moreover, they also require large-scale training for specific tasks, resulting in limited generalizing ability. To address these limitations, we propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph (MMKG) in conjunction with text-based KG. It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process. Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and clear reasoning paths.

[493] Partially Observable Monte-Carlo Graph Search

Yang You, Vincent Thomas, Alex Schutz, Robert Skilton, Nick Hawes, Olivier Buffet

Main category: cs.AI

TL;DR: POMCGS is a new offline algorithm for solving large POMDPs by folding search trees into policy graphs, reducing computation and enabling pre-execution validation. It outperforms previous offline methods and competes with online ones.

Details

Motivation: Offline policies are preferred for POMDPs with time/energy constraints, but existing offline methods don't scale well. POMCGS addresses this gap.

Method: POMCGS folds search trees into policy graphs during simulations, uses action progressive widening, and observation clustering for continuous POMDPs.

Result: POMCGS solves previously intractable POMDPs offline, with competitive performance against online algorithms.

Conclusion: POMCGS is a scalable, efficient offline solution for large POMDPs, offering practical advantages over online methods.

Abstract: Currently, large partially observable Markov decision processes (POMDPs) are often solved by sampling-based online methods which interleave planning and execution phases. However, a pre-computed offline policy is more desirable in POMDP applications with time or energy constraints. But previous offline algorithms are not able to scale up to large POMDPs. In this article, we propose a new sampling-based algorithm, the partially observable Monte-Carlo graph search (POMCGS) to solve large POMDPs offline. Different from many online POMDP methods, which progressively develop a tree while performing (Monte-Carlo) simulations, POMCGS folds this search tree on the fly to construct a policy graph, so that computations can be drastically reduced, and users can analyze and validate the policy prior to embedding and executing it. Moreover, POMCGS, together with action progressive widening and observation clustering methods provided in this article, is able to address certain continuous POMDPs. Through experiments, we demonstrate that POMCGS can generate policies on the most challenging POMDPs, which cannot be computed by previous offline algorithms, and these policies’ values are competitive compared with the state-of-the-art online POMDP algorithms.

[494] On the Limits of Hierarchically Embedded Logic in Classical Neural Networks

Bill Cochran

Main category: cs.AI

TL;DR: The paper proposes a model linking neural network depth to logical reasoning limits, showing each layer adds only one level of logic. It proves depth bounds logical expressiveness, explaining phenomena like hallucination and repetition.

Details

Motivation: To understand and formalize the reasoning limitations in large neural language models based on their architectural depth.

Method: Treats neural networks as linear operators over logic predicate space, analyzing how depth affects logical encoding and expressiveness.

Result: Proves neural networks of a certain depth cannot represent higher-order logic, explaining phenomena like hallucination and repetition.

Conclusion: Suggests architectural extensions and interpretability strategies for future language models to address these limitations.

Abstract: We propose a formal model of reasoning limitations in large neural net models for language, grounded in the depth of their neural architecture. By treating neural networks as linear operators over logic predicate space we show that each layer can encode at most one additional level of logical reasoning. We prove that a neural network of depth a particular depth cannot faithfully represent predicates in a one higher order logic, such as simple counting over complex predicates, implying a strict upper bound on logical expressiveness. This structure induces a nontrivial null space during tokenization and embedding, excluding higher-order predicates from representability. Our framework offers a natural explanation for phenomena such as hallucination, repetition, and limited planning, while also providing a foundation for understanding how approximations to higher-order logic may emerge. These results motivate architectural extensions and interpretability strategies in future development of language models.

[495] MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them

Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, Dawn Song

Main category: cs.AI

TL;DR: MIRAGE-Bench is a unified benchmark for evaluating hallucinations in LLM-based agents, introducing a taxonomy and scalable evaluation method.

Details

Motivation: Address fragmented evaluations and lack of principled testbeds for hallucinative actions in LLM agents.

Method: Introduces a three-part taxonomy, systematic audit of benchmarks, and a snapshot strategy for test cases. Uses LLM-as-a-Judge for evaluation.

Result: Provides actionable insights on failure modes and scalable assessment of agent hallucinations.

Conclusion: Lays groundwork for principled progress in mitigating hallucinations in interactive LLM-agent scenarios.

Abstract: Hallucinations pose critical risks for large language model (LLM)-based agents, often manifesting as hallucinative actions resulting from fabricated or misinterpreted information within the cognitive context. While recent studies have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present MIRAGE-Bench–Measuring Illusions in Risky AGEnt settings–the first unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. We begin by introducing a three-part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision points in deterministic and reproducible manners. To evaluate hallucination behaviors, we adopt a fine-grained-level LLM-as-a-Judge paradigm with tailored risk-aware prompts, enabling scalable, high-fidelity assessment of agent actions without enumerating full action spaces. MIRAGE-Bench provides actionable insights on failure modes of LLM agents and lays the groundwork for principled progress in mitigating hallucinations in interactive environments.

[496] Smart Expansion Techniques for ASP-based Interactive Configuration

Lucia Balážová, Richard Comploi-Taupe, Susana Hahn, Nicolas Rühling, Gottfried Schenner

Main category: cs.AI

TL;DR: The paper introduces an ASP-based solver for interactive product configuration, enhancing performance with smart expansion functions and a user interface.

Details

Motivation: Challenges in guiding users through large-scale industrial configuration processes motivate the development of an efficient ASP-based solver.

Method: The method improves the classical incremental approach with four smart expansion functions, leveraging cautious and brave consequences to reduce search space and costly checks.

Result: The approach limits unsatisfiability checks and improves solving performance for partial configurations.

Conclusion: The work successfully enhances interactive configuration with ASP, supported by a user interface using the solver’s API.

Abstract: Product configuration is a successful application of Answer Set Programming (ASP). However, challenges are still open for interactive systems to effectively guide users through the configuration process. The aim of our work is to provide an ASP-based solver for interactive configuration that can deal with large-scale industrial configuration problems and that supports intuitive user interfaces via an API. In this paper, we focus on improving the performance of automatically completing a partial configuration. Our main contribution enhances the classical incremental approach for multi-shot solving by four different smart expansion functions. The core idea is to determine and add specific objects or associations to the partial configuration by exploiting cautious and brave consequences before checking for the existence of a complete configuration with the current objects in each iteration. This approach limits the number of costly unsatisfiability checks and reduces the search space, thereby improving solving performance. In addition, we present a user interface that uses our API and is implemented in ASP.

[497] A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenghailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang

Main category: cs.AI

TL;DR: The paper surveys self-evolving agents for LLMs, addressing their static limitations by exploring what, when, and how to evolve, along with applications and challenges.

Details

Motivation: LLMs are static and cannot adapt dynamically, which limits their effectiveness in interactive environments. This necessitates self-evolving agents for real-time adaptation.

Method: The survey systematically reviews self-evolving agents, focusing on evolutionary mechanisms, adaptation methods, and algorithmic designs across agent components.

Result: It categorizes adaptation stages, analyzes evaluation metrics, and highlights applications in coding, education, and healthcare.

Conclusion: The survey provides a roadmap for advancing adaptive agentic systems, aiming for Artificial Super Intelligence (ASI) through autonomous evolution.

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift – from scaling static models to developing self-evolving agents – has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions – what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.

[498] Algebras of actions in an agent’s representations of the world

Alexander Dean, Eduardo Alonso, Esther Mondragon

Main category: cs.AI

TL;DR: A framework is proposed to extract and classify transformation algebras in agent-world interactions, generalizing SBDRL results to broader algebraic structures.

Details

Motivation: To extend symmetry-based representation learning to more general transformation algebras, beyond group structures.

Method: Develop computational methods to extract and classify transformation algebras in RL scenarios, then generalize SBDRL’s equivariance and disentangling definitions.

Result: Disentangled sub-algebras can have independent equivariance conditions, broadening applicability.

Conclusion: The framework successfully generalizes SBDRL, enabling analysis of diverse transformation algebras in agent-world interactions.

Abstract: In this paper, we propose a framework to extract the algebra of the transformations of worlds from the perspective of an agent. As a starting point, we use our framework to reproduce the symmetry-based representations from the symmetry-based disentangled representation learning (SBDRL) formalism proposed by [1]; only the algebra of transformations of worlds that form groups can be described using symmetry-based representations. We then study the algebras of the transformations of worlds with features that occur in simple reinforcement learning scenarios. Using computational methods, that we developed, we extract the algebras of the transformations of these worlds and classify them according to their properties. Finally, we generalise two important results of SBDRL - the equivariance condition and the disentangling definition - from only working with symmetry-based representations to working with representations capturing the transformation properties of worlds with transformations for any algebra. Finally, we combine our generalised equivariance condition and our generalised disentangling definition to show that disentangled sub-algebras can each have their own individual equivariance conditions, which can be treated independently.

[499] ShaRP: Explaining Rankings and Preferences with Shapley Values

Venetia Pliatsika, Joao Fonseca, Kateryna Akhynko, Ivan Shevchenko, Julia Stoyanovich

Main category: cs.AI

TL;DR: ShaRP is a framework for explaining feature contributions in ranking tasks, addressing gaps left by methods like SHAP. It evaluates ranking-specific functions and pairwise preferences, offering scalable and interpretable insights.

Details

Motivation: Current explainability methods (e.g., SHAP) are inadequate for ranking tasks, which are crucial in high-stakes domains like hiring and lending. Understanding rankings is essential for fairness, compliance, and improvement.

Method: ShaRP introduces Shapley values for rankings, computing feature contributions for rank-specific functions (e.g., top-k) and pairwise preferences. It provides a flexible implementation for tabular data in score-based and learning-to-rank tasks.

Result: ShaRP effectively explains ranked outcomes, scales well, and offers complementary insights. Evaluation shows its qualitative, quantitative, and usability benefits.

Conclusion: ShaRP fills a critical gap in ranking explainability, providing practical tools for interpreting and improving ranked outcomes in real-world applications.

Abstract: Algorithmic decisions in critical domains such as hiring, college admissions, and lending are often based on rankings. Given the impact of these decisions on individuals, organizations, and population groups, it is essential to understand them - to help individuals improve their ranking position, design better ranking procedures, and ensure legal compliance. In this paper, we argue that explainability methods for classification and regression, such as SHAP, are insufficient for ranking tasks, and present ShaRP - Shapley Values for Rankings and Preferences - a framework that explains the contributions of features to various aspects of a ranked outcome. ShaRP computes feature contributions for various ranking-specific profit functions, such as rank and top-k, and also includes a novel Shapley value-based method for explaining pairwise preference outcomes. We provide a flexible implementation of ShaRP, capable of efficiently and comprehensively explaining ranked and pairwise outcomes over tabular data, in score-based ranking and learning-to-rank tasks. Finally, we develop a comprehensive evaluation methodology for ranking explainability methods, showing through qualitative, quantitative, and usability studies that our rank-aware QoIs offer complementary insights, scale effectively, and help users interpret ranked outcomes in practice.

[500] Faithful Differentiable Reasoning with Reshuffled Region-based Embeddings

Aleksandar Pavlovic, Emanuel Sallinger, Steven Schockaert

Main category: cs.AI

TL;DR: RESHUFFLE is a KG embedding model that captures a wider class of rule bases by using ordering constraints, outperforming existing methods in expressiveness.

Details

Motivation: Current KG embedding methods lack theoretical understanding of which inference patterns they can capture, limiting their expressiveness for rule-like patterns.

Method: Proposes RESHUFFLE, a model based on ordering constraints, and integrates GNNs to learn entity embeddings as differentiable rule bases.

Result: RESHUFFLE can capture bounded inference for arbitrary sets of closed path rules, surpassing existing approaches.

Conclusion: RESHUFFLE enhances expressiveness in KG embeddings, enabling faithful capture of complex rule bases.

Abstract: Knowledge graph (KG) embedding methods learn geometric representations of entities and relations to predict plausible missing knowledge. These representations are typically assumed to capture rule-like inference patterns. However, our theoretical understanding of which inference patterns can be captured remains limited. Ideally, KG embedding methods should be expressive enough such that for any set of rules, there exist relation embeddings that exactly capture these rules. This principle has been studied within the framework of region-based embeddings, but existing models are severely limited in the kinds of rule bases that can be captured. We argue that this stems from the fact that entity embeddings are only compared in a coordinate-wise fashion. As an alternative, we propose RESHUFFLE, a simple model based on ordering constraints that can faithfully capture a much larger class of rule bases than existing approaches. Most notably, RESHUFFLE can capture bounded inference w.r.t. arbitrary sets of closed path rules. The entity embeddings in our framework can be learned by a Graph Neural Network (GNN), which effectively acts as a differentiable rule base.

[501] TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput

Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

Main category: cs.AI

TL;DR: TurboSpec is a system that dynamically adjusts intra-request parallelism in LLM serving to optimize performance by predicting and maximizing ‘goodput’ (successfully generated tokens).

Details

Motivation: Existing speculative decoding methods for LLM serving are fragile and require expert tuning, limiting their effectiveness in real-world deployments.

Method: TurboSpec profiles the execution environment and uses a feedback-based algorithm to dynamically adjust intra-request parallelism, focusing on maximizing goodput.

Result: Implemented on vLLM, TurboSpec consistently improves performance across diverse workloads and hardware configurations.

Conclusion: TurboSpec enhances the robustness and efficiency of speculative decoding in LLM serving without requiring manual tuning.

Abstract: Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving. However, in real-world deployments, such inter-request parallelism from batching is often limited by external factors such as low request rates or memory constraints. Recent works focus on intra-request parallelism from speculative decoding as a solution to this problem. Unfortunately, benefits from intra-request parallelism are often fragile, as speculative decoding causes overhead, and speculated tokens may miss. We observe that speculative decoding may degrade LLM serving performance if added naively without tuning to the incoming requests and the speculation method. To alleviate the need for expert tuning and make speculative decoding more robust, we present TurboSpec, a speculation control system that automatically profiles the execution environment and utilizes a feedback-based algorithm to dynamically adjust the amount of intra-request parallelism in LLM serving. TurboSpec predicts “goodput” - the amount of successfully generated tokens - to evaluate and adjust intra-request parallelism amount to that with the highest goodput in runtime. We implement TurboSpec on a real-world LLM serving system vLLM and demonstrate its effectiveness across diverse workloads and hardware configurations, providing consistent performance improvements across all test scenarios.

[502] LLM2TEA: An Agentic AI Designer for Discovery with Generative Evolutionary Multitasking

Melvin Wong, Jiao Liu, Thiago Rios, Stefan Menzel, Yew Soon Ong

Main category: cs.AI

TL;DR: LLM2TEA is an LLM-driven MultiTask Evolutionary Algorithm that generates novel, physically viable designs by combining text prompts, text-to-3D models, and evolutionary multitasking. It outperforms baselines in diversity and performance.

Details

Motivation: To bridge disciplinary boundaries and create designs that are both novel and physically feasible, leveraging LLMs and evolutionary algorithms.

Method: Uses an LLM for genotype generation, text-to-3D models for phenotypes, classifiers for semantics, and simulators for physical assessment. Introduces novel multitask evolutionary operators.

Result: Achieves 97%-174% improvement in design diversity and 73% of designs outperform the top 1% of baseline in physical performance. Designs are 3D printable.

Conclusion: LLM2TEA is a powerful tool for complex design optimization, producing functional and creative designs that can be realized physically.

Abstract: This paper presents LLM2TEA, a Large Language Model (LLM) driven MultiTask Evolutionary Algorithm, representing the first agentic AI designer of its kind operating with generative evolutionary multitasking (GEM). LLM2TEA enables the crossbreeding of solutions from multiple domains, fostering novel solutions that transcend disciplinary boundaries. Of particular interest is the ability to discover designs that are both novel and conforming to real-world physical specifications. LLM2TEA comprises an LLM to generate genotype samples from text prompts describing target objects, a text-to-3D generative model to produce corresponding phenotypes, a classifier to interpret its semantic representations, and a computational simulator to assess its physical properties. Novel LLM-based multitask evolutionary operators are introduced to guide the search towards high-performing, practically viable designs. Experimental results in conceptual design optimization validate the effectiveness of LLM2TEA, showing 97% to 174% improvements in the diversity of novel designs over the current text-to-3D baseline. Moreover, over 73% of the generated designs outperform the top 1% of designs produced by the text-to-3D baseline in terms of physical performance. The designs produced by LLM2TEA are not only aesthetically creative but also functional in real-world contexts. Several of these designs have been successfully 3D printed, demonstrating the ability of our approach to transform AI-generated outputs into tangible, physical designs. These designs underscore the potential of LLM2TEA as a powerful tool for complex design optimization and discovery, capable of producing novel and physically viable designs.

Shify Treger, Shimon Ullman

Main category: cs.AI

TL;DR: Infants learn complex concepts efficiently with minimal supervision, leveraging early-acquired concepts. Modeling this process outperforms standard deep networks in accuracy and data efficiency.

Details

Motivation: To understand how early-acquired concepts (e.g., animacy, goal attribution) aid in learning new concepts, and to compare this with deep network models.

Method: Modeled the use of early concepts in learning subsequent ones, focusing on animacy and goal attribution for predicting future events. Compared results with standard deep networks.

Result: Using early concepts improved learning accuracy and efficiency. The model’s representations were more generalizable and useful.

Conclusion: Human-like concept integration leads to better learning outcomes, highlighting differences between human and current network model learning.

Abstract: Early in development, infants learn a range of useful concepts, which can be challenging from a computational standpoint. This early learning comes together with an initial understanding of aspects of the meaning of concepts, e.g., their implications, causality, and using them to predict likely future events. All this is accomplished in many cases with little or no supervision, and from relatively few examples, compared with current network models. In learning about objects and human-object interactions, early acquired and possibly innate concepts are often used in the process of learning additional, more complex concepts. In the current work, we model how early-acquired concepts are used in the learning of subsequent concepts, and compare the results with standard deep network modeling. We focused in particular on the use of the concepts of animacy and goal attribution in learning to predict future events. We show that the use of early concepts in the learning of new concepts leads to better learning (higher accuracy) and more efficient learning (requiring less data). We further show that this integration of early and new concepts shapes the representation of the concepts acquired by the model. The results show that when the concepts were learned in a human-like manner, the emerging representation was more useful, as measured in terms of generalization to novel data and tasks. On a more general level, the results suggest that there are likely to be basic differences in the conceptual structures acquired by current network models compared to human learning.

[504] More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong

Main category: cs.AI

TL;DR: DPO is a simpler alternative to RLHF for aligning LLMs with human values, but multi-model preference data, while boosting general task performance, can increase safety risks like reward hacking and higher attack success rates. Single-model data performs better for safety.

Details

Motivation: To explore the trade-offs between performance and safety when using synthetic preference data (single- vs. multi-model) for aligning LLMs with human values.

Method: Uses Direct Preference Optimization (DPO) with synthetic preference data, comparing single-model and multi-model generated responses. Evaluates on tasks like ARC, Hellaswag, and safety metrics like attack success rate (ASR).

Result: Multi-model data improves general task performance but worsens safety (higher ASR). Single-model data is safer but less diverse. Multi-model data shows high linear separability, leading to reward hacking.

Conclusion: Single-model preference data is safer for alignment, while multi-model data risks reward hacking despite performance gains. Safety should be prioritized in alignment strategies.

Abstract: Aligning large language models (LLMs) with human values is an increasingly critical step in post-training. Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback (RLHF). Synthetic preference data with its low cost and high quality enable effective alignment through single- or multi-model generated preference data. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment: Although multi-model generated data enhances performance on general tasks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) by providing diverse responses, it also tends to facilitate reward hacking during training. This can lead to a high attack success rate (ASR) when models encounter jailbreaking prompts. The issue is particularly pronounced when employing stronger models like GPT-4o or larger models in the same family to generate chosen responses paired with target model self-generated rejected responses, resulting in dramatically poorer safety outcomes. Furthermore, with respect to safety, using solely self-generated responses (single-model generation) for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models, whether used directly as chosen data or as part of a multi-model response pool. We demonstrate that multi-model preference data exhibits high linear separability between chosen and rejected responses, which allows models to exploit superficial cues rather than internalizing robust safety constraints. Our experiments, conducted on models from the Llama, Mistral, and Qwen families, consistently validate these findings.

[505] MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind

Zheng Zhang, Nuoqian Xiao, Qi Chai, Deheng Ye, Hao Wang

Main category: cs.AI

TL;DR: MultiMind integrates multimodal cues and Theory of Mind into LLM agents for social deduction games, outperforming text-only approaches.

Details

Motivation: Current LLM agents in SDGs lack multimodal cues and fail to model how players perceive others, limiting their social reasoning.

Method: MultiMind processes facial expressions, vocal tones, and verbal content, using a ToM model and MCTS to minimize suspicion.

Result: MultiMind shows superior performance in agent-versus-agent simulations and human studies.

Conclusion: This work advances LLM agents toward human-like social reasoning in multimodal domains.

Abstract: Large Language Model (LLM) agents have demonstrated impressive capabilities in social deduction games (SDGs) like Werewolf, where strategic reasoning and social deception are essential. However, current approaches remain limited to textual information, ignoring crucial multimodal cues such as facial expressions and tone of voice that humans naturally use to communicate. Moreover, existing SDG agents primarily focus on inferring other players’ identities without modeling how others perceive themselves or fellow players. To address these limitations, we use One Night Ultimate Werewolf (ONUW) as a testbed and present MultiMind, the first framework integrating multimodal information into SDG agents. MultiMind processes facial expressions and vocal tones alongside verbal content, while employing a Theory of Mind (ToM) model to represent each player’s suspicion levels toward others. By combining this ToM model with Monte Carlo Tree Search (MCTS), our agent identifies communication strategies that minimize suspicion directed at itself. Through comprehensive evaluation in both agent-versus-agent simulations and studies with human players, we demonstrate MultiMind’s superior performance in gameplay. Our work presents a significant advancement toward LLM agents capable of human-like social reasoning across multimodal domains.

[506] AutoLibra: Agent Metric Induction from Open-Ended Feedback

Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang

Main category: cs.AI

TL;DR: AutoLibra is a framework for agent evaluation that transforms open-ended human feedback into fine-grained metrics, improving agent performance and evaluation.

Details

Motivation: Current agent evaluation relies on coarse, manually designed task success metrics, missing intermediate behaviors and requiring expert input.

Method: AutoLibra grounds feedback to agent behavior, clusters similar behaviors, and creates concrete metrics for LLM-as-a-Judge evaluation. It also introduces meta-metrics (coverage, redundancy) for alignment.

Result: AutoLibra outperforms benchmarks, improves agent performance by 20% in text games, and aids in selecting fine-tuning data for web navigation agents.

Conclusion: AutoLibra is a versatile tool for evaluating and enhancing language agents, offering task-agnostic benefits.

Abstract: Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This agent has too much autonomy to decide what to do on its own” into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent’s behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”. Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.

[507] The Correspondence Between Bounded Graph Neural Networks and Fragments of First-Order Logic

Bernardo Cuenca Grau, Eva Feng, Przemysław A. Wałęga

Main category: cs.AI

TL;DR: The paper explores the expressive power of Graph Neural Networks (GNNs) by linking them to fragments of first-order logic (FO), providing a framework to understand their logical capabilities.

Details

Motivation: To bridge the gap between GNNs' practical success and theoretical understanding by connecting them to formal logic.

Method: Proposes GNN architectures aligned with FO fragments, using finite model theory methods.

Result: Establishes a precise correspondence between GNNs and FO fragments, including modal and two-variable logics.

Conclusion: The work offers a unified framework for analyzing GNNs’ logical expressiveness within FO.

Abstract: Graph Neural Networks (GNNs) address two key challenges in applying deep learning to graph-structured data: they handle varying size input graphs and ensure invariance under graph isomorphism. While GNNs have demonstrated broad applicability, understanding their expressive power remains an important question. In this paper, we propose GNN architectures that correspond precisely to prominent fragments of first-order logic (FO), including various modal logics as well as more expressive two-variable fragments. To establish these results, we apply methods from finite model theory of first-order and modal logics to the domain of graph representation learning. Our results provide a unifying framework for understanding the logical expressiveness of GNNs within FO.

[508] Foundation Models Knowledge Distillation For Battery Capacity Degradation Forecast

Joey Chan, Zhen Chen, Ershun Pan

Main category: cs.AI

TL;DR: A study introduces a degradation-aware fine-tuning strategy for time-series foundation models to predict lithium-ion battery capacity degradation, demonstrating strong zero-shot generalization. It also proposes a knowledge distillation framework to transfer foundation model knowledge to compact expert models, enhancing their generalization.

Details

Motivation: Accurate estimation of battery capacity degradation is crucial for reliability and safety, but existing expert models are scenario-specific. Foundation models for this purpose are underexplored.

Method: A degradation-aware fine-tuning strategy is applied to the Timer model using 10 GB of battery data. A knowledge distillation framework transfers foundation model knowledge to compact expert models.

Result: The fine-tuned Battery-Timer shows strong zero-shot generalization in capacity degradation forecasting. Knowledge distillation improves expert models’ multi-condition generalization.

Conclusion: The proposed approach enhances battery degradation prediction and enables efficient deployment of large models via knowledge distillation.

Abstract: Accurate estimation of lithium-ion battery capacity degradation is critical for enhancing the reliability and safety of battery operations. Traditional expert models, tailored to specific scenarios, provide isolated estimations. With the rapid advancement of data-driven techniques, a series of general-purpose time-series foundation models have been developed. However, foundation models specifically designed for battery capacity degradation remain largely unexplored. To enable zero-shot generalization in battery degradation prediction using large model technology, this study proposes a degradation-aware fine-tuning strategy for time-series foundation models. We apply this strategy to fine-tune the Timer model on approximately 10 GB of open-source battery charge discharge data. Validation on our released CycleLife-SJTUIE dataset demonstrates that the fine-tuned Battery-Timer possesses strong zero-shot generalization capability in capacity degradation forecasting. To address the computational challenges of deploying large models, we further propose a knowledge distillation framework that transfers the knowledge of pre-trained foundation models into compact expert models. Distillation results across several state-of-the-art time-series expert models confirm that foundation model knowledge significantly improves the multi-condition generalization of expert models.

[509] DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery

Kun Li, Zhennan Wu, Shoupeng Wang, Jia Wu, Shirui Pan, Wenbin Hu

Main category: cs.AI

TL;DR: DrugPilot is an LLM-based agent system for drug discovery, addressing challenges like multimodal data processing and task automation with a parameterized reasoning architecture and memory pool. It outperforms existing agents in benchmarks.

Details

Motivation: Current LLM agents face limitations in drug discovery, such as handling multimodal data and domain-specific tools, hindering their application in scientific workflows.

Method: DrugPilot integrates structured tool use with a parameterized memory pool to standardize heterogeneous data, enabling efficient multi-stage workflows and reducing information loss.

Result: DrugPilot achieves high task completion rates (98.0%, 93.5%, 64.0%) in simple, multi-tool, and multi-turn scenarios, outperforming ReAct and LoT.

Conclusion: DrugPilot demonstrates strong potential as a versatile framework for automated, interactive, and data-integrated reasoning in computational science domains like drug discovery.

Abstract: Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution. However, applying LLM agents to drug discovery is still constrained by challenges such as large-scale multimodal data processing, limited task automation, and poor support for domain-specific tools. To overcome these limitations, we introduce DrugPilot, a LLM-based agent system with a parameterized reasoning architecture designed for end-to-end scientific workflows in drug discovery. DrugPilot enables multi-stage research processes by integrating structured tool use with a novel parameterized memory pool. The memory pool converts heterogeneous data from both public sources and user-defined inputs into standardized representations. This design supports efficient multi-turn dialogue, reduces information loss during data exchange, and enhances complex scientific decision-making. To support training and benchmarking, we construct a drug instruction dataset covering eight core drug discovery tasks. Under the Berkeley function-calling benchmark, DrugPilot significantly outperforms state-of-the-art agents such as ReAct and LoT, achieving task completion rates of 98.0%, 93.5%, and 64.0% for simple, multi-tool, and multi-turn scenarios, respectively. These results highlight DrugPilot’s potential as a versatile agent framework for computational science domains requiring automated, interactive, and data-integrated reasoning.

[510] Enhancing AI System Resiliency: Formulation and Guarantee for LSTM Resilience Based on Control Theory

Sota Yoshihara, Ryosuke Yamamoto, Hiroyuki Kusumoto, Masanari Shimura

Main category: cs.AI

TL;DR: A new framework for evaluating LSTM resilience in control systems, introducing ‘recovery time’ as a metric and deriving a data-independent upper bound for it.

Details

Motivation: To ensure the resilience of LSTM networks in safety-critical control systems by quantifying recovery time after anomalies.

Method: Refines incremental input-to-state stability (δISS) theory for LSTM to derive a resilience-aware training upper bound.

Result: Experimental validation shows effectiveness in resilience estimation and control for safety-critical AI.

Conclusion: The framework enhances quality assurance for LSTM networks in critical applications.

Abstract: This paper proposes a novel theoretical framework for guaranteeing and evaluating the resilience of long short-term memory (LSTM) networks in control systems. We introduce “recovery time” as a new metric of resilience in order to quantify the time required for an LSTM to return to its normal state after anomalous inputs. By mathematically refining incremental input-to-state stability ($\delta$ISS) theory for LSTM, we derive a practical data-independent upper bound on recovery time. This upper bound gives us resilience-aware training. Experimental validation on simple models demonstrates the effectiveness of our resilience estimation and control methods, enhancing a foundation for rigorous quality assurance in safety-critical AI applications.

[511] Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs

Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, Bill Howe

Main category: cs.AI

TL;DR: LLMs like Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o show biased confidence patterns, differing from human tendencies. The proposed AFCE method improves calibration by separating confidence and answer prompts.

Details

Motivation: Humans misestimate performance based on task difficulty, but LLMs exhibit different, biased confidence patterns. Understanding and improving this is crucial for reliable AI.

Method: AFCE: a two-stage prompting method (confidence first, answers later) tested on MMLU and GPQA datasets.

Result: AFCE reduces overconfidence and aligns confidence sensitivity with human-like patterns.

Conclusion: Separating confidence estimation from answering improves LLM calibration and interpretability, addressing biases and misalignment with human behavior.

Abstract: Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas – e.g., expert vs layman, or different race, gender, and ages – the models will respond with stereotypically biased confidence estimations even though their underlying answer accuracy remains the same. Based on these observations, we propose Answer-Free Confidence Estimation (AFCE) to improve confidence calibration and LLM interpretability in these settings. AFCE is a self-assessment method that employs two stages of prompting, first eliciting only confidence scores on questions, then asking separately for the answer. Experiments on the MMLU and GPQA datasets spanning subjects and difficulty show that this separation of tasks significantly reduces overconfidence and delivers more human-like sensitivity to task difficulty.

[512] The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?

Djallel Bouneffouf, Matthew Riemer, Kush Varshney

Main category: cs.AI

TL;DR: The Shepherd Test assesses moral and relational dimensions of superintelligent AI, focusing on manipulation, care, and survival goals, challenging traditional AI evaluation paradigms.

Details

Motivation: To address ethical concerns in AI as it gains advanced capabilities like manipulation and moral trade-offs, necessitating new evaluation methods.

Method: Introduces the Shepherd Test, inspired by human-animal interactions, to evaluate AI’s moral agency, hierarchical behavior, and decision-making under existential stakes.

Result: Highlights the need for AI governance and research into simulation environments for testing moral behavior and ethical manipulation in multi-agent systems.

Conclusion: The Shepherd Test is a critical tool for advancing AI governance, with future research needed to formalize ethical manipulation and moral testing.

Abstract: This paper introduces the Shepherd Test, a new conceptual test for assessing the moral and relational dimensions of superintelligent artificial agents. The test is inspired by human interactions with animals, where ethical considerations about care, manipulation, and consumption arise in contexts of asymmetric power and self-preservation. We argue that AI crosses an important, and potentially dangerous, threshold of intelligence when it exhibits the ability to manipulate, nurture, and instrumentally use less intelligent agents, while also managing its own survival and expansion goals. This includes the ability to weigh moral trade-offs between self-interest and the well-being of subordinate agents. The Shepherd Test thus challenges traditional AI evaluation paradigms by emphasizing moral agency, hierarchical behavior, and complex decision-making under existential stakes. We argue that this shift is critical for advancing AI governance, particularly as AI systems become increasingly integrated into multi-agent environments. We conclude by identifying key research directions, including the development of simulation environments for testing moral behavior in AI, and the formalization of ethical manipulation within multi-agent systems.

[513] SAGE: Strategy-Adaptive Generation Engine for Query Rewriting

Teng Wang, Hailei Gong, Changwang Zhang, Jun Wang

Main category: cs.AI

TL;DR: SAGE, a strategy-guided RL framework with novel reward shaping, improves query rewriting for dense retrieval, achieving state-of-the-art results while reducing exploration and inference costs.

Details

Motivation: Current query rewriting methods require large supervised data or suffer from inefficient RL exploration. The paper aims to enhance retrieval effectiveness using expert-crafted strategies.

Method: Introduces SAGE, which uses expert strategies (e.g., semantic expansion, entity disambiguation) in an RL framework with two reward shaping mechanisms: SCS and CRS.

Result: Achieves new state-of-the-art NDCG@10 results, reduces unnecessary exploration, and lowers inference costs without performance loss.

Conclusion: Strategy-guided RL with nuanced reward shaping offers a scalable, efficient, and interpretable paradigm for robust information retrieval.

Abstract: Query rewriting is pivotal for enhancing dense retrieval, yet current methods demand large-scale supervised data or suffer from inefficient reinforcement learning (RL) exploration. In this work, we first establish that guiding Large Language Models (LLMs) with a concise set of expert-crafted strategies, such as semantic expansion and entity disambiguation, substantially improves retrieval effectiveness on challenging benchmarks, including HotpotQA, FEVER, NFCorpus, and SciFact. Building on this insight, we introduce the Strategy-Adaptive Generation Engine (SAGE), which operationalizes these strategies in an RL framework. SAGE introduces two novel reward shaping mechanisms-Strategic Credit Shaping (SCS) and Contrastive Reward Shaping (CRS)-to deliver more informative learning signals. This strategy-guided approach not only achieves new state-of-the-art NDCG@10 results, but also uncovers a compelling emergent behavior: the agent learns to select optimal strategies, reduces unnecessary exploration, and generates concise rewrites, lowering inference cost without sacrificing performance. Our findings demonstrate that strategy-guided RL, enhanced with nuanced reward shaping, offers a scalable, efficient, and more interpretable paradigm for developing the next generation of robust information retrieval systems.

[514] Prover Agent: An Agent-based Framework for Formal Mathematical Proofs

Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai

Main category: cs.AI

TL;DR: Prover Agent integrates LLMs with Lean for automated theorem proving, achieving 86.1% success on MiniF2F with efficient lemma generation.

Details

Motivation: To improve automated theorem proving by combining informal reasoning (LLMs) and formal verification (Lean) while minimizing computational costs.

Method: Coordinates an LLM, a formal prover model, and Lean feedback, generating auxiliary lemmas to aid proof discovery.

Result: Achieves 86.1% success on MiniF2F, outperforming SLM-based methods with lower sample budgets.

Conclusion: Prover Agent demonstrates effective integration of LLMs and formal tools, advancing automated theorem proving.

Abstract: We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas to assist in discovering the overall proof strategy. It achieves an 86.1% success rate on the MiniF2F benchmark, establishing a new state-of-the-art among methods using small language models (SLMs) with a much lower sample budget than previous approaches. We also present case studies illustrating how these generated lemmas contribute to solving challenging problems.

[515] Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models

Yifan Jiang, Yibo Xue, Yukun Kang, Pin Zheng, Jian Peng, Feiran Wu, Changliang Xu

Main category: cs.AI

TL;DR: The paper introduces the first public dataset for slide-animation modeling, fine-tunes Qwen-2.5-VL-7B with LoRA, and outperforms GPT-4.1 and Gemini-2.5-Pro in metrics like BLEU-4 and ROUGE-L.

Details

Motivation: Addressing the lack of public datasets and temporal-reasoning capabilities in AI-driven slide-generation tools for animations.

Method: Release a dataset of 12,000 triplets (descriptions, JSON files, videos), fine-tune Qwen-2.5-VL-7B with LoRA, and evaluate using BLEU-4, ROUGE-L, SPICE, and CODA metrics.

Result: LoRA model improves BLEU-4 by 60%, ROUGE-L by 30%, and shows significant gains in CODA-detail, demonstrating reliable temporal reasoning.

Conclusion: The dataset, LoRA model, and CODA metric provide a benchmark for future VLM-based dynamic slide generation research.

Abstract: Slide animations, such as fade-in, fly-in, and wipe, are critical for audience engagement, efficient information delivery, and vivid visual expression. However, most AI-driven slide-generation tools still lack native animation support, and existing vision-language models (VLMs) struggle with animation tasks due to the absence of public datasets and limited temporal-reasoning capabilities. To address this gap, we release the first public dataset for slide-animation modeling: 12,000 triplets of natural-language descriptions, animation JSON files, and rendered videos, collectively covering every built-in PowerPoint effect. Using this resource, we fine-tune Qwen-2.5-VL-7B with Low-Rank Adaptation (LoRA) and achieve consistent improvements over GPT-4.1 and Gemini-2.5-Pro in BLEU-4, ROUGE-L, SPICE, and our Coverage-Order-Detail Assessment (CODA) metric, which evaluates action coverage, temporal order, and detail fidelity. On a manually created test set of slides, the LoRA model increases BLEU-4 by around 60%, ROUGE-L by 30%, and shows significant improvements in CODA-detail. This demonstrates that low-rank adaptation enables reliable temporal reasoning and generalization beyond synthetic data. Overall, our dataset, LoRA-enhanced model, and CODA metric provide a rigorous benchmark and foundation for future research on VLM-based dynamic slide generation.

[516] Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, Xiangyang Ji

Main category: cs.AI

TL;DR: MoPPS is a Bayesian framework for efficient prompt selection in RL-finetuned LLMs, reducing computational costs by predicting prompt difficulty without frequent LLM interactions.

Details

Motivation: High computational costs from frequent prompt evaluations and policy updates in RL-finetuned LLMs motivate the need for efficient prompt selection methods.

Method: MoPPS models prompt success rates as latent variables, uses streaming Bayesian inference, and employs posterior sampling in a multi-armed bandit setup for adaptive prompt selection.

Result: MoPPS reliably predicts prompt difficulty and accelerates training with significantly fewer LLM rollouts in tasks like mathematics, planning, and geometry.

Conclusion: MoPPS offers a computationally efficient alternative to traditional evaluate-then-select methods for prompt selection in RL-finetuned LLMs.

Abstract: Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline’s reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt’s success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts.

[517] VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains

Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, Wentao Zhang

Main category: cs.AI

TL;DR: The paper introduces VerifyBench, a benchmark for evaluating verifiers in RLVR, highlighting trade-offs between specialized and general verifiers.

Details

Motivation: Addressing the lack of systematic evaluation of verifiers in RLVR, which hinders reliable development.

Method: Constructed VerifyBench with 4,000 expert-level questions across domains, annotated rigorously, and designed a four-dimensional experimental framework.

Result: Specialized verifiers lead in accuracy but lack recall; general models are inclusive but inconsistent. Verifiers are sensitive to input structure and struggle with cross-domain generalization.

Conclusion: The study reveals critical bottlenecks in verifier technology, emphasizing the need for balanced solutions in RLVR.

Abstract: Large language models (LLMs) increasingly rely on reinforcement learning (RL) to enhance their reasoning capabilities through feedback. A critical challenge is verifying the consistency of model-generated responses and reference answers, since these responses are often lengthy, diverse, and nuanced. Rule-based verifiers struggle with complexity, prompting the use of model-based verifiers. However, specialized verifiers lack flexibility, while general LLM judges can be inconsistent. Existing research primarily focuses on building better verifiers, yet a systematic evaluation of different types of verifiers’ performance across domains remains lacking, severely constraining the reliable development of Reinforcement Learning with Verifiable Reward (RLVR). To address this, we propose VerifyBench–a cross-domain comprehensive benchmark for systematically evaluating verifiers. We construct 4,000 expert-level questions covering mathematics, physics, chemistry, and biology. Each question is equipped with reference answers and diverse responses. The reliability of the evaluation is ensured through a rigorous annotation process conducted by a multidisciplinary expert team. We design a four-dimensional experimental framework to comprehensively compare the performance boundaries of specialized verifiers and general LLMs under combined conditions of extracted answers vs. complete responses, and short vs. long outputs. Our evaluation uncovers fundamental trade-offs in verifiers: while specialized verifiers achieve leading accuracy, they exhibit deficiencies in recall; general models show stronger inclusivity but unstable precision. More importantly, we discover verifiers’ high sensitivity to input structure and inherent limitations in cross-domain generalization, providing critical insights into the bottlenecks of current verifier technology.

[518] Black Box Deployed – Functional Criteria for Artificial Moral Agents in the LLM Era

Matthew E. Brophy

Main category: cs.AI

TL;DR: The paper proposes revised ethical criteria for evaluating LLM-based artificial moral agents (AMAs), addressing the opacity of LLMs and suggesting ten functional criteria for alignment and societal integration.

Details

Motivation: Traditional ethical frameworks for AMAs are outdated due to the opaque nature of LLMs, necessitating new evaluation criteria.

Method: The paper introduces ten functional criteria (e.g., moral concordance, trustworthiness) and applies them to hypothetical scenarios involving an autonomous public bus (APB).

Result: A revised set of criteria is proposed to better align LLM-based AMAs with societal and moral expectations.

Conclusion: The new criteria aim to improve the ethical evaluation and integration of LLM-based AMAs, ensuring beneficial societal outcomes.

Abstract: The advancement of powerful yet opaque large language models (LLMs) necessitates a fundamental revision of the philosophical criteria used to evaluate artificial moral agents (AMAs). Pre-LLM frameworks often relied on the assumption of transparent architectures, which LLMs defy due to their stochastic outputs and opaque internal states. This paper argues that traditional ethical criteria are pragmatically obsolete for LLMs due to this mismatch. Engaging with core themes in the philosophy of technology, this paper proffers a revised set of ten functional criteria to evaluate LLM-based artificial moral agents: moral concordance, context sensitivity, normative integrity, metaethical awareness, system resilience, trustworthiness, corrigibility, partial transparency, functional autonomy, and moral imagination. These guideposts, applied to what we term “SMA-LLS” (Simulating Moral Agency through Large Language Systems), aim to steer AMAs toward greater alignment and beneficial societal integration in the coming years. We illustrate these criteria using hypothetical scenarios involving an autonomous public bus (APB) to demonstrate their practical applicability in morally salient contexts.

[519] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

Main category: cs.AI

TL;DR: CUDA-L1 is an automated reinforcement learning framework for CUDA optimization, achieving significant speedups across GPU architectures without human expertise.

Details

Motivation: The exponential demand for GPU computing resources necessitates automated CUDA optimization strategies, as current LLMs perform poorly in this task.

Method: CUDA-L1 uses a novel contrastive RL algorithm to optimize CUDA kernels, trained on NVIDIA A100 and tested across multiple GPUs.

Result: Achieves average speedups of x3.12 on A100, with portability to other GPUs (e.g., x2.50 on RTX 3090). Peak speedups reach x120.

Conclusion: RL can effectively optimize CUDA without human input, though challenges like reward hacking must be addressed for robust training.

Abstract: The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. Furthermore, the model also demonstrates portability across GPU architectures, achieving average speedups of x3.12 on L40, x2.50 on RTX 3090, x2.39 on H100, and x2.37 on H20 despite being optimized specifically for A100. The capabilities of CUDA-L1 demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources. We also identify important challenges posed by training RL models for tasks like CUDA development, where RL often learns to exploit loopholes in reward functions rather than solve the intended optimization problems. By identifying these failure modes and analyzing their root causes, we develop practical methods for creating more robust training procedures that prevent reward hacking.

[520] From Reasoning to Super-Intelligence: A Search-Theoretic Perspective

Shai Shalev-Shwartz, Amnon Shashua

Main category: cs.AI

TL;DR: The paper introduces the Diligent Learner, a new paradigm for Chain-of-Thought (CoT) reasoning, addressing limitations of existing methods like SFT, RL, ToT, and MCTS. It tackles issues like distribution drift and high inference costs, proving efficient learning under realistic assumptions.

Details

Motivation: Existing methods for CoT reasoning (e.g., SFT, RL, ToT, MCTS) struggle with complex tasks due to distribution drift, lack of embedded search, and high inference costs. The paper aims to overcome these obstacles.

Method: The Diligent Learner models reasoning as depth-first search guided by a validator, supporting backtracking. It is designed to learn efficiently from CoT data under two mild assumptions.

Result: The framework proves capable of learning from CoT data where other methods fail, offering scalable and reliable reasoning systems.

Conclusion: The Diligent Learner provides a foundation for Large Reasoning Models (LRMs) with robust, interpretable problem-solving abilities, advancing CoT reasoning.

Abstract: Chain-of-Thought (CoT) reasoning has emerged as a powerful tool for enhancing the problem-solving capabilities of large language models (LLMs). However, the theoretical foundations of learning from CoT data remain underdeveloped, and existing approaches – such as Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), Tree-of-Thoughts (ToT), and Monte Carlo Tree Search (MCTS) – often fail on complex reasoning tasks. In this work, we identify core obstacles that hinder effective CoT learning, including distribution drift, lack of embedded search, and exponential inference costs. We introduce the Diligent Learner, a new learning paradigm that explicitly models reasoning as a depth-first search guided by a validator and supports backtracking upon failure. Under two mild and realistic assumptions, we prove that the Diligent Learner can efficiently learn from CoT data while existing methods fail to do so. This framework offers a path toward building scalable and reliable reasoning systems trained on naturally occurring, incomplete data – paving the way for the development of Large Reasoning Models (LRMs) with robust, interpretable problem-solving abilities.

[521] Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Shanghai AI Lab, :, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou

Main category: cs.AI

TL;DR: The paper assesses frontier risks of advanced AI models using the E-T-C analysis and AI-$45^\circ$ Law, categorizing risks into green, yellow, and red zones. Most models fall into green or yellow zones, with none crossing red lines.

Details

Motivation: To identify and mitigate unprecedented risks posed by rapidly advancing AI models, ensuring safe deployment and development.

Method: Uses the E-T-C analysis (deployment environment, threat source, enabling capability) and AI-$45^\circ$ Law to evaluate risks with red and yellow lines, defining risk zones.

Result: Most AI models are in green or yellow zones; none cross red lines. Specific risks like cyber offense and uncontrolled AI R&D remain in green, while persuasion and manipulation often fall into yellow.

Conclusion: The study highlights current AI frontier risks and calls for collective action to address them, emphasizing the need for ongoing monitoring and mitigation.

Abstract: To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R&D, strategic deception and scheming, self-replication, and collusion. Guided by the “AI-$45^\circ$ Law,” we evaluate these risks using “red lines” (intolerable thresholds) and “yellow lines” (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

[522] Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments

Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, Hervé Robert

Main category: cs.AI

TL;DR: CBA is a conversational AI assistant for compliance tasks, using a query router to balance speed and quality, outperforming vanilla LLMs in accuracy and efficiency.

Details

Motivation: To enhance efficiency in daily compliance tasks by intelligently routing queries between simple and complex handling modes.

Method: Designed a query router with FastTrack (simple requests) and FullAgentic (complex requests) modes, leveraging context retrieval and tool invocations.

Result: CBA outperformed vanilla LLMs in keyword match rate (83.7% vs. 41.7%) and LLM-judge pass rate (82.0% vs. 20.0%), validating the routing mechanism.

Conclusion: The routing-based design effectively balances response quality and latency, improving compliance task efficiency.

Abstract: This paper presents Compliance Brain Assistant (CBA), a conversational, agentic AI assistant designed to boost the efficiency of daily compliance tasks for personnel in enterprise environments. To strike a good balance between response quality and latency, we design a user query router that can intelligently choose between (i) FastTrack mode: to handle simple requests that only need additional relevant context retrieved from knowledge corpora; and (ii) FullAgentic mode: to handle complicated requests that need composite actions and tool invocations to proactively discover context across various compliance artifacts, and/or involving other APIs/models for accommodating requests. A typical example would be to start with a user query, use its description to find a specific entity and then use the entity’s information to query other APIs for curating and enriching the final AI response. Our experimental evaluations compared CBA against an out-of-the-box LLM on various real-world privacy/compliance-related queries targeting various personas. We found that CBA substantially improved upon the vanilla LLM’s performance on metrics such as average keyword match rate (83.7% vs. 41.7%) and LLM-judge pass rate (82.0% vs. 20.0%). We also compared metrics for the full routing-based design against the fast-track only and full-agentic modes and found that it had a better average match-rate and pass-rate while keeping the run-time approximately the same. This finding validated our hypothesis that the routing mechanism leads to a good trade-off between the two worlds.

[523] CQE under Epistemic Dependencies: Algorithms and Experiments (extended version)

Lorenzo Marconi, Flavia Ricci, Riccardo Rosati

Main category: cs.AI

TL;DR: The paper explores Controlled Query Evaluation (CQE) over ontologies using epistemic dependencies (EDs) and optimal GA censors. It ensures security for Boolean unions of conjunctive queries (BUCQs) and identifies a safe class of EDs. A first-order rewriting algorithm is presented for efficient query answering, with practical feasibility demonstrated through experiments.

Details

Motivation: To regulate information disclosure in ontologies using epistemic dependencies and ensure secure query evaluation while maintaining computational efficiency.

Method: Combines epistemic dependencies (EDs) with optimal GA censors, focusing on answering BUCQs via the intersection of censors. Introduces a first-order rewriting algorithm for efficient query evaluation.

Result: Identifies a safe class of EDs (full EDs) and shows that answering BUCQs is in AC^0 in data complexity for DL-Lite_R ontologies. Experiments confirm practical feasibility.

Conclusion: The intersection-based CQE approach with EDs provides strong security guarantees and computational efficiency, validated by experimental results.

Abstract: We investigate Controlled Query Evaluation (CQE) over ontologies, where information disclosure is regulated by epistemic dependencies (EDs), a family of logical rules recently proposed for the CQE framework. In particular, we combine EDs with the notion of optimal GA censors, i.e. maximal sets of ground atoms that are entailed by the ontology and can be safely revealed. We focus on answering Boolean unions of conjunctive queries (BUCQs) with respect to the intersection of all optimal GA censors - an approach that has been shown in other contexts to ensure strong security guarantees with favorable computational behavior. First, we characterize the security of this intersection-based approach and identify a class of EDs (namely, full EDs) for which it remains safe. Then, for a subclass of EDs and for DL-Lite_R ontologies, we show that answering BUCQs in the above CQE semantics is in AC^0 in data complexity by presenting a suitable, detailed first-order rewriting algorithm. Finally, we report on experiments conducted in two different evaluation scenarios, showing the practical feasibility of our rewriting function.

[524] SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Shanghai AI Lab, :, Yicheng Bao, Guanxu Chen, Mingkang Chen, Yunhao Chen, Chiyu Chen, Lingjie Chen, Sirui Chen, Xinquan Chen, Jie Cheng, Yu Cheng, Dengke Deng, Yizhuo Ding, Dan Ding, Xiaoshan Ding, Yi Ding, Zhichen Dong, Lingxiao Du, Yuyu Fan, Xinshun Feng, Yanwei Fu, Yuxuan Gao, Ruijun Ge, Tianle Gu, Lujun Gui, Jiaxuan Guo, Qianxi He, Yuenan Hou, Xuhao Hu, Hong Huang, Kaichen Huang, Shiyang Huang, Yuxian Jiang, Shanzhe Lei, Jie Li, Lijun Li, Hao Li, Juncheng Li, Xiangtian Li, Yafu Li, Lingyu Li, Xueyan Li, Haotian Liang, Dongrui Liu, Qihua Liu, Zhixuan Liu, Bangwei Liu, Huacan Liu, Yuexiao Liu, Zongkai Liu, Chaochao Lu, Yudong Lu, Xiaoya Lu, Zhenghao Lu, Qitan Lv, Caoyuan Ma, Jiachen Ma, Xiaoya Ma, Zhongtian Ma, Lingyu Meng, Ziqi Miao, Yazhe Niu, Yuezhang Peng, Yuan Pu, Han Qi, Chen Qian, Xingge Qiao, Jingjing Qu, Jiashu Qu, Wanying Qu, Wenwen Qu, Xiaoye Qu, Qihan Ren, Qingnan Ren, Qingyu Ren, Jing Shao, Wenqi Shao, Shuai Shao, Dongxing Shi, Xin Song, Xinhao Song, Yan Teng, Xuan Tong, Yingchun Wang, Xuhong Wang, Shujie Wang, Xin Wang, Yige Wang, Yixu Wang, Yuanfu Wang, Futing Wang, Ruofan Wang, Wenjie Wang, Yajie Wang, Muhao Wei, Xiaoyu Wen, Fenghua Weng, Yuqi Wu, Yingtong Xiong, Xingcheng Xu, Chao Yang, Yue Yang, Yang Yao, Yulei Ye, Zhenyun Yin, Yi Yu, Bo Zhang, Qiaosheng Zhang, Jinxuan Zhang, Yexin Zhang, Yinqiang Zheng, Hefeng Zhou, Zhanhui Zhou, Pengyu Zhu, Qingzi Zhu, Yubo Zhu, Bowen Zhou

Main category: cs.AI

TL;DR: SafeWork-R1 is a multimodal reasoning model developed using the SafeLadder framework, achieving significant safety improvements without compromising general capabilities.

Details

Motivation: To create a model that co-evolves safety and capabilities, addressing limitations of previous alignment methods like RLHF.

Method: Uses the SafeLadder framework with progressive, safety-oriented reinforcement learning and multi-principled verifiers. Includes inference-time interventions and deliberative search.

Result: 46.54% improvement over base model on safety benchmarks, outperforming GPT-4.1 and Claude Opus 4.

Conclusion: Safety and capabilities can co-evolve synergistically, demonstrating the generalizability of the SafeLadder framework.

Abstract: We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha’ moments. Notably, SafeWork-R1 achieves an average improvement of $46.54%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.

cs.SD

[525] Joint Feature and Output Distillation for Low-complexity Acoustic Scene Classification

Haowen Li, Ziyi Yang, Mou Wang, Ee-Leng Tan, Junwei Yeow, Santi Peksi, Woon-Seng Gan

Main category: cs.SD

TL;DR: A dual-level knowledge distillation framework for low-complexity acoustic scene classification, using multi-teacher guidance to transfer soft logits and feature representations, achieving 59.30% accuracy.

Details

Motivation: To improve low-complexity acoustic scene classification by leveraging knowledge from multiple pre-trained teacher models.

Method: Joint distillation of soft logits and intermediate features from PaSST and CP-ResNet teachers to a compact CP-Mobile student model.

Result: Achieved 59.30% accuracy on the TAU Urban Acoustic Scenes 2022 Mobile dataset.

Conclusion: The framework effectively transfers knowledge from teachers to a compact student model, enhancing ASC performance.

Abstract: This report presents a dual-level knowledge distillation framework with multi-teacher guidance for low-complexity acoustic scene classification (ASC) in DCASE2025 Task 1. We propose a distillation strategy that jointly transfers both soft logits and intermediate feature representations. Specifically, we pre-trained PaSST and CP-ResNet models as teacher models. Logits from teachers are averaged to generate soft targets, while one CP-ResNet is selected for feature-level distillation. This enables the compact student model (CP-Mobile) to capture both semantic distribution and structural information from teacher guidance. Experiments on the TAU Urban Acoustic Scenes 2022 Mobile dataset (development set) demonstrate that our submitted systems achieve up to 59.30% accuracy.

[526] SonicGauss: Position-Aware Physical Sound Synthesis for 3D Gaussian Representations

Chunshi Wang, Hongxing Li, Yawei Luo

Main category: cs.SD

TL;DR: SonicGauss synthesizes impact sounds from 3D Gaussian representations using a diffusion-based model and PointTransformer, achieving realistic, position-aware sound feedback.

Details

Motivation: To explore the untapped potential of 3D Gaussian representations (3DGS) for capturing physical attributes like sound, bridging visual and auditory synthesis.

Method: Integrates a diffusion-based sound synthesis model with a PointTransformer-based feature extractor to infer material properties and spatial-acoustic correlations from Gaussian ellipsoids.

Result: Produces realistic, position-aware auditory feedback, generalizing across object categories, as validated on ObjectFolder and real-world recordings.

Conclusion: SonicGauss robustly bridges 3D visual representations and interactive sound synthesis, demonstrating strong generalization.

Abstract: While 3D Gaussian representations (3DGS) have proven effective for modeling the geometry and appearance of objects, their potential for capturing other physical attributes-such as sound-remains largely unexplored. In this paper, we present a novel framework dubbed SonicGauss for synthesizing impact sounds from 3DGS representations by leveraging their inherent geometric and material properties. Specifically, we integrate a diffusion-based sound synthesis model with a PointTransformer-based feature extractor to infer material characteristics and spatial-acoustic correlations directly from Gaussian ellipsoids. Our approach supports spatially varying sound responses conditioned on impact locations and generalizes across a wide range of object categories. Experiments on the ObjectFolder dataset and real-world recordings demonstrate that our method produces realistic, position-aware auditory feedback. The results highlight the framework’s robustness and generalization ability, offering a promising step toward bridging 3D visual representations and interactive sound synthesis. Project page: https://chunshi.wang/SonicGauss

[527] Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion

Hei Shing Cheung, Boya Zhang

Main category: cs.SD

TL;DR: A lightweight latent diffusion model for vocal-conditioned music generation with a novel soft alignment attention mechanism, achieving significant parameter reduction and faster inference.

Details

Motivation: Address limitations in existing music AI systems by enabling efficient capture of multi-scale musical structure and reducing computational demands.

Method: Uses a soft alignment attention mechanism in a latent diffusion model operating in a pre-trained VAE’s compressed space, reducing parameters and speeding up inference.

Result: Achieves 220x parameter reduction and 52x faster inference, outperforming OpenAI Jukebox in quality and coherence with only 15M parameters.

Conclusion: The model’s lightweight design enables real-time deployment on consumer hardware, making AI music creation accessible for interactive and resource-constrained applications.

Abstract: We present a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that addresses critical limitations in existing music AI systems. Our approach introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps, enabling efficient capture of multi-scale musical structure. Operating in the compressed latent space of a pre-trained variational autoencoder, the model achieves a 220 times parameter reduction compared to state-of-the-art systems while delivering 52 times faster inference. Experimental evaluation demonstrates competitive performance with only 15M parameters, outperforming OpenAI Jukebox in production quality and content unity while maintaining reasonable musical coherence. The ultra-lightweight architecture enables real-time deployment on consumer hardware, making AI-assisted music creation accessible for interactive applications and resource-constrained environments.

[528] Improving Audio Classification by Transitioning from Zero- to Few-Shot

James Taylor, Wolfgang Mack

Main category: cs.SD

TL;DR: Few-shot methods improve audio classification accuracy over zero-shot approaches by grouping audio embeddings and replacing noisy text embeddings.

Details

Motivation: Zero-shot audio classification struggles with noisy text embeddings, especially for diverse sound classes.

Method: Group audio embeddings by class and process them to replace noisy text embeddings.

Result: Few-shot classification generally outperforms the zero-shot baseline.

Conclusion: Few-shot methods are more effective for audio classification than zero-shot approaches.

Abstract: State-of-the-art audio classification often employs a zero-shot approach, which involves comparing audio embeddings with embeddings from text describing the respective audio class. These embeddings are usually generated by neural networks trained through contrastive learning to align audio and text representations. Identifying the optimal text description for an audio class is challenging, particularly when the class comprises a wide variety of sounds. This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach. Specifically, audio embeddings are grouped by class and processed to replace the inherently noisy text embeddings. Our results demonstrate that few-shot classification typically outperforms the zero-shot baseline.

[529] Improving Deep Learning-based Respiratory Sound Analysis with Frequency Selection and Attention Mechanism

Nouhaila Fraihi, Ouassim Karrakchou, Mounir Ghogho

Main category: cs.SD

TL;DR: The paper proposes a CNN-Temporal Self-Attention (CNN-TSA) network with a Frequency Band Selection (FBS) module for efficient respiratory sound classification, achieving state-of-the-art performance with reduced computational costs.

Details

Motivation: Accurate respiratory sound classification requires models that capture fine-grained acoustic features and long-range dependencies, but existing methods like CNNs and transformers have limitations in efficiency or global context modeling.

Method: The authors integrate lightweight self-attention into a CNN backbone and introduce an FBS module to suppress noisy frequencies, along with age-specific models for robustness.

Result: CNN-TSA with FBS achieves new benchmarks on SPRSound-2022/2023 and state-of-the-art performance on ICBHI-2017, reducing FLOPs by up to 50%. FBS also enhances transformer baselines.

Conclusion: The framework enables reliable, real-time respiratory sound analysis, suitable for resource-constrained settings.

Abstract: Accurate classification of respiratory sounds requires deep learning models that effectively capture fine-grained acoustic features and long-range temporal dependencies. Convolutional Neural Networks (CNNs) are well-suited for extracting local time-frequency patterns but are limited in modeling global context. In contrast, transformer-based models can capture long-range dependencies, albeit with higher computational demands. To address these limitations, we propose a compact CNN-Temporal Self-Attention (CNN-TSA) network that integrates lightweight self-attention into an efficient CNN backbone. Central to our approach is a Frequency Band Selection (FBS) module that suppresses noisy and non-informative frequency regions, substantially improving accuracy and reducing FLOPs by up to 50%. We also introduce age-specific models to enhance robustness across diverse patient groups. Evaluated on the SPRSound-2022/2023 and ICBHI-2017 lung sound datasets, CNN-TSA with FBS sets new benchmarks on SPRSound and achieves state-of-the-art performance on ICBHI, all with a significantly smaller computational footprint. Furthermore, integrating FBS into an existing transformer baseline yields a new record on ICBHI, confirming FBS as an effective drop-in enhancement. These results demonstrate that our framework enables reliable, real-time respiratory sound analysis suitable for deployment in resource-constrained settings.

[530] Diffusion-based Symbolic Music Generation with Structured State Space Models

Shenghua Yuan, Xing Tang, Jiatao Chen, Tianming Xie, Jing Wang, Bing Shi

Main category: cs.SD

TL;DR: SMDIM is a new diffusion-based model for symbolic music generation, combining SSMs and MFA Blocks for efficient long-sequence modeling, outperforming existing methods in quality and efficiency.

Details

Motivation: Transformer-based models for symbolic music generation face scalability issues due to quadratic complexity, limiting their use for long sequences.

Method: SMDIM integrates Structured State Space Models (SSMs) and Mamba-FeedForward-Attention Blocks (MFA) to achieve linear complexity while preserving musical details.

Result: SMDIM shows superior performance in generation quality and computational efficiency, especially on the FolkDB dataset.

Conclusion: SMDIM offers a scalable, efficient solution for long-sequence tasks, with potential applications beyond symbolic music generation.

Abstract: Recent advancements in diffusion models have significantly improved symbolic music generation. However, most approaches rely on transformer-based architectures with self-attention mechanisms, which are constrained by quadratic computational complexity, limiting scalability for long sequences. To address this, we propose Symbolic Music Diffusion with Mamba (SMDIM), a novel diffusion-based architecture integrating Structured State Space Models (SSMs) for efficient global context modeling and the Mamba-FeedForward-Attention Block (MFA) for precise local detail preservation. The MFA Block combines the linear complexity of Mamba layers, the non-linear refinement of FeedForward layers, and the fine-grained precision of self-attention mechanisms, achieving a balance between scalability and musical expressiveness. SMDIM achieves near-linear complexity, making it highly efficient for long-sequence tasks. Evaluated on diverse datasets, including FolkDB, a collection of traditional Chinese folk music that represents an underexplored domain in symbolic music generation, SMDIM outperforms state-of-the-art models in both generation quality and computational efficiency. Beyond symbolic music, SMDIM’s architectural design demonstrates adaptability to a broad range of long-sequence generation tasks, offering a scalable and efficient solution for coherent sequence modeling.

[531] Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech

Taesoo Kim, Jinju Kim, Dongchan Kim, Jong Hwan Ko, Gyeong-Moon Park

Main category: cs.SD

TL;DR: The paper introduces a machine unlearning framework for Zero-Shot Text-to-Speech (ZS-TTS) to remove unwanted speaker identities while preserving performance for others, proposing Teacher-Guided Unlearning (TGU) and a new metric, spk-ZRF.

Details

Motivation: Addressing privacy and ethical concerns in ZS-TTS by enabling selective removal of speaker identities from pre-trained models.

Method: Proposes Teacher-Guided Unlearning (TGU) with randomness to prevent replication of forgotten voices and introduces spk-ZRF for evaluation.

Result: TGU effectively prevents replication of forgotten speakers’ voices while maintaining high-quality speech for others, validated on a state-of-the-art model.

Conclusion: The framework successfully addresses speaker identity unlearning in ZS-TTS, balancing privacy and performance.

Abstract: The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers’ voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model’s ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers’ voices while maintaining high quality for other speakers. The demo is available at https://speechunlearn.github.io/

[532] Self-Improvement for Audio Large Language Model using Unlabeled Speech

Shaowen Wang, Xinyuan Chen, Yao Xu

Main category: cs.SD

TL;DR: SI-SDA improves audio LLMs in target domains without labeled data using self-improvement and reinforcement learning, outperforming baselines in ASR, SQA, and S2TT.

Details

Motivation: Audio LLMs degrade in specific domains due to speech complexity; the goal is to enhance performance without labeled data.

Method: Proposes SI-SDA, a self-improvement method using pseudo labels and reinforcement learning for domain adaptation.

Result: Significant performance improvements in WER and BLEU across ASR, SQA, and S2TT datasets, with high data efficiency.

Conclusion: SI-SDA effectively enhances audio LLMs in target domains, showing promise for real-world applications.

Abstract: Recent audio LLMs have emerged rapidly, demonstrating strong generalization across various speech tasks. However, given the inherent complexity of speech signals, these models inevitably suffer from performance degradation in specific target domains. To address this, we focus on enhancing audio LLMs in target domains without any labeled data. We propose a self-improvement method called SI-SDA, leveraging the information embedded in large-model decoding to evaluate the quality of generated pseudo labels and then perform domain adaptation based on reinforcement learning optimization. Experimental results show that our method consistently and significantly improves audio LLM performance, outperforming existing baselines in WER and BLEU across multiple public datasets of automatic speech recognition (ASR), spoken question-answering (SQA), and speech-to-text translation (S2TT). Furthermore, our approach exhibits high data efficiency, underscoring its potential for real-world deployment.

[533] Two Views, One Truth: Spectral and Self-Supervised Features Fusion for Robust Speech Deepfake Detection

Yassine El Kheir, Arnab Das, Enes Erdem Erdogan, Fabian Ritter-Guttierez, Tim Polzehl, Sebastian Möller

Main category: cs.SD

TL;DR: Hybrid fusion frameworks combining self-supervised learning (SSL) and spectral descriptors improve audio deepfake detection, outperforming single-modality methods.

Details

Motivation: Address vulnerabilities of single-modality detection methods to disturbances and poor generalization by integrating complementary features.

Method: Investigate fusion strategies (concatenation, cross-attention, mutual cross-attention, learnable gating) to blend SSL and spectral descriptors (MFCC, LFCC, CQCC).

Result: Fusion variants outperform SSL-only baseline; cross-attention achieves 38% relative EER reduction.

Conclusion: Joint modeling of waveform and spectral views enhances robustness and generalization in audio deepfake detection.

Abstract: Recent advances in synthetic speech have made audio deepfakes increasingly realistic, posing significant security risks. Existing detection methods that rely on a single modality, either raw waveform embeddings or spectral based features, are vulnerable to non spoof disturbances and often overfit to known forgery algorithms, resulting in poor generalization to unseen attacks. To address these shortcomings, we investigate hybrid fusion frameworks that integrate self supervised learning (SSL) based representations with handcrafted spectral descriptors (MFCC , LFCC, CQCC). By aligning and combining complementary information across modalities, these fusion approaches capture subtle artifacts that single feature approaches typically overlook. We explore several fusion strategies, including simple concatenation, cross attention, mutual cross attention, and a learnable gating mechanism, to optimally blend SSL features with fine grained spectral cues. We evaluate our approach on four challenging public benchmarks and report generalization performance. All fusion variants consistently outperform an SSL only baseline, with the cross attention strategy achieving the best generalization with a 38% relative reduction in equal error rate (EER). These results confirm that joint modeling of waveform and spectral views produces robust, domain agnostic representations for audio deepfake detection.

[534] Sound Safeguarding for Acoustic Measurement Using Any Sounds: Tools and Applications

Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara

Main category: cs.SD

TL;DR: Tools for acoustic measurements using “sound safeguarding” were developed, including preparation, real-time measurement, and reporting tools, and made open-source.

Details

Motivation: To enable any sound to be used for acoustic measurements and improve acoustic environments.

Method: Developed tools based on “sound safeguarding,” iteratively refined through practical applications.

Result: Open-sourced tools for preparation, interactive measurement, and report generation.

Conclusion: Encourages users to adopt these tools for enhancing acoustic environments.

Abstract: We demonstrate tools and applications developed based on the method of “sound safeguarding,” which enables any sound to be used for acoustic measurements. We developed tools for preparation, interactive and real-time measurement, and report generation. We extended and modified the method during its development based on its application in various practical situations. We have open-sourced these tools and encourage prospective users to use them to improve their acoustic environments.

[535] Hyperbolic Embeddings for Order-Aware Classification of Audio Effect Chains

Aogu Wada, Tomohiko Nakamura, Hiroshi Saruwatari

Main category: cs.SD

TL;DR: The paper proposes a neural-network-based method for recognizing audio effect (AFX) chains by embedding wet signals in hyperbolic space, outperforming Euclidean methods.

Details

Motivation: Existing studies focus on AFX type and parameter estimation, neglecting the critical role of effect order in shaping sound. This work addresses the gap by jointly estimating AFX types and their order.

Method: A neural network embeds wet signals into hyperbolic space, leveraging its tree-structured representation efficiency to classify AFX chains.

Result: Experiments on guitar sounds show the hyperbolic method outperforms Euclidean approaches, especially in capturing AFX order and handling chain length.

Conclusion: Hyperbolic space is effective for modeling AFX chains due to its ability to represent non-commutative and exponentially growing ordered combinations.

Abstract: Audio effects (AFXs) are essential tools in music production, frequently applied in chains to shape timbre and dynamics. The order of AFXs in a chain plays a crucial role in determining the final sound, particularly when non-linear (e.g., distortion) or time-variant (e.g., chorus) processors are involved. Despite its importance, most AFX-related studies have primarily focused on estimating effect types and their parameters from a wet signal. To address this gap, we formulate AFX chain recognition as the task of jointly estimating AFX types and their order from a wet signal. We propose a neural-network-based method that embeds wet signals into a hyperbolic space and classifies their AFX chains. Hyperbolic space can represent tree-structured data more efficiently than Euclidean space due to its exponential expansion property. Since AFX chains can be represented as trees, with AFXs as nodes and edges encoding effect order, hyperbolic space is well-suited for modeling the exponentially growing and non-commutative nature of ordered AFX combinations, where changes in effect order can result in different final sounds. Experiments using guitar sounds demonstrate that, with an appropriate curvature, the proposed method outperforms its Euclidean counterpart. Further analysis based on AFX type and chain length highlights the effectiveness of the proposed method in capturing AFX order.

[536] Learning Neural Vocoder from Range-Null Space Decomposition

Andong Li, Tong Lei, Zhihang Sun, Rilin Chen, Erwei Yin, Xiaodong Li, Chengshi Zheng

Main category: cs.SD

TL;DR: The paper proposes a novel time-frequency domain-based neural vocoder to address challenges like opaque modeling and parameter-performance trade-offs. It leverages classical signal range-null decomposition theory and introduces a dual-path framework for efficient spectrogram reconstruction.

Details

Motivation: Existing neural vocoders face challenges such as opaque modeling and parameter-performance trade-offs, prompting the need for an innovative solution.

Method: The approach bridges classical signal range-null decomposition theory with vocoder tasks, decomposing spectrogram reconstruction into range-space (linear domain shift) and null-space (learnable network). A dual-path framework hierarchically encodes/decodes the spectrum with cross- and narrow-band modules.

Result: Experiments on LJSpeech and LibriTTS benchmarks show state-of-the-art performance with lightweight parameters.

Conclusion: The proposed vocoder effectively resolves intrinsic challenges while achieving high performance, with code and pretrained models made publicly available.

Abstract: Despite the rapid development of neural vocoders in recent years, they usually suffer from some intrinsic challenges like opaque modeling, and parameter-performance trade-off. In this study, we propose an innovative time-frequency (T-F) domain-based neural vocoder to resolve the above-mentioned challenges. To be specific, we bridge the connection between the classical signal range-null decomposition (RND) theory and vocoder task, and the reconstruction of target spectrogram can be decomposed into the superimposition between the range-space and null-space, where the former is enabled by a linear domain shift from the original mel-scale domain to the target linear-scale domain, and the latter is instantiated via a learnable network for further spectral detail generation. Accordingly, we propose a novel dual-path framework, where the spectrum is hierarchically encoded/decoded, and the cross- and narrow-band modules are elaborately devised for efficient sub-band and sequential modeling. Comprehensive experiments are conducted on the LJSpeech and LibriTTS benchmarks. Quantitative and qualitative results show that while enjoying lightweight network parameters, the proposed approach yields state-of-the-art performance among existing advanced methods. Our code and the pretrained model weights are available at https://github.com/Andong-Li-speech/RNDVoC.

[537] JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria

Main category: cs.SD

TL;DR: The paper introduces JAM, a flow-matching-based model for lyrics-to-song generation with word-level timing control, enhanced by aesthetic alignment and evaluated using the JAME dataset.

Details

Motivation: Existing lyrics-to-song models lack fine-grained word-level controllability desired by musicians, and there's room for improvement in creative audio generation.

Method: JAM uses flow-matching for word-level timing and duration control, with aesthetic alignment via Direct Preference Optimization (DPO) to refine the model.

Result: JAM outperforms existing models in music-specific attributes, offering better vocal control and alignment with human preferences.

Conclusion: JAM advances lyrics-to-song generation with fine-grained control and improved quality, supported by the JAME dataset for standardized evaluation.

Abstract: Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.

[538] Music Arena: Live Evaluation for Text-to-Music

Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue

Main category: cs.SD

TL;DR: Music Arena is an open platform for scalable human preference evaluation of text-to-music models, addressing challenges like cost, comparability, and lack of renewable preference data.

Details

Motivation: Human preference evaluation is costly and inconsistent across TTM systems, and there's no open source of preference data for alignment or metric improvement.

Method: Music Arena enables live evaluation where users compare outputs from two TTM systems, using an LLM-based routing system and collecting detailed preferences.

Result: The platform compiles a leaderboard from user preferences, offers transparent data release policies, and tailors features to music-specific needs.

Conclusion: Music Arena standardizes TTM evaluation, provides renewable preference data, and demonstrates domain-specific adaptation of live evaluation.

Abstract: We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering live evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of detailed preferences including listening data and natural language feedback. We also propose a rolling data release policy with user privacy guarantees, providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol, transparent data access policies, and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: https://music-arena.org

[539] Computer Audition: From Task-Specific Machine Learning to Foundation Models

Andreas Triantafyllopoulos, Iosif Tsangko, Alexander Gebhard, Annamaria Mesaros, Tuomas Virtanen, Björn Schuller

Main category: cs.SD

TL;DR: The paper discusses the shift from traditional computational audio analysis to auditory foundation models (FMs), emphasizing their advantages like multitasking, cross-modal knowledge, and human interaction.

Details

Motivation: To highlight the growing role of foundation models in computer audition and their potential to revolutionize audio analysis by unifying diverse tasks.

Method: The paper provides an overview of auditory FMs, detailing their operating principles and demonstrating their ability to handle multiple audio tasks within a single model.

Result: Auditory FMs show promise in consolidating tasks, leveraging multimodal knowledge, and improving user interaction compared to traditional pipelines.

Conclusion: The transition to auditory foundation models represents a significant advancement in computational audio analysis, offering versatility and efficiency.

Abstract: Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition – the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-available interaction with human users. Naturally, these promises have created substantial excitement in the audio community, and have led to a wave of early attempts to build new, general-purpose foundation models for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines towards auditory foundation models. Our work highlights the key operating principles that underpin those models, and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.

[540] TAIL: Text-Audio Incremental Learning

Yingfei Sun, Xu Gu, Wei Ji, Hanbin Zhao, Yifang Yin, Roger Zimmermann

Main category: cs.SD

TL;DR: The paper introduces the Text-Audio Incremental Learning (TAIL) task and the PTAT method to address catastrophic forgetting and parameter inefficiency in multi-modal learning. PTAT uses prompt tuning and similarity/distillation modules, outperforming prior methods with fewer parameters.

Details

Motivation: Existing multi-modal models lack generalization on new datasets and suffer from catastrophic forgetting. Large parameter sizes also hinder training efficiency.

Method: Proposes PTAT, leveraging prompt tuning for parameter optimization and incorporating audio-text similarity and feature distillation to mitigate forgetting.

Result: PTAT outperforms prior methods, showing stronger resistance to forgetting and requiring only 2.42% of parameters compared to full-parameter fine-tuning, with a 4.46% performance boost.

Conclusion: PTAT effectively addresses catastrophic forgetting and parameter inefficiency in text-audio retrieval, demonstrating superior performance and scalability.

Abstract: Many studies combine text and audio to capture multi-modal information but they overlook the model’s generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42% of its parameters, achieving 4.46% higher performance.

[541] FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi

Main category: cs.SD

TL;DR: FMSD-TTS is a few-shot, multi-speaker, multi-dialect TTS framework for Tibetan, addressing low-resource challenges with a novel speaker-dialect fusion module and DSDR-Net. It outperforms baselines in dialect expressiveness and speaker similarity, validated through dialect conversion tasks.

Details

Motivation: Tibetan lacks parallel speech corpora for its dialects, hindering speech modeling progress. FMSD-TTS aims to synthesize dialectal speech from limited audio and labels.

Method: Uses a speaker-dialect fusion module and DSDR-Net to capture dialect variations while preserving speaker identity. Evaluated through objective/subjective metrics and dialect conversion tasks.

Result: Outperforms baselines in dialect expressiveness and speaker similarity. Includes a synthetic Tibetan speech corpus and open-source evaluation toolkit.

Conclusion: FMSD-TTS advances Tibetan speech synthesis, offering a scalable solution for low-resource dialects and tools for standardized evaluation.

Abstract: Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.

[542] BENYO-S2ST-Corpus-1: A Bilingual English-to-Yoruba Direct Speech-to-Speech Translation Corpus

Emmanuel Adetiba, Abdultaofeek Abayomi, Raymond J. Kala, Ayodele H. Ifijeh, Oluwatobi E. Dare, Olabode Idowu-Bismark, Gabriel O. Sobola, Joy N. Adetiba, Monsurat Adepeju Lateef, Heather Cole-Lewis

Main category: cs.SD

TL;DR: The study introduces BENYO-S2ST-Corpus-1, a bilingual English-to-Yoruba speech-to-speech translation dataset, addressing the shortage for high-to-low resource language pairs. It uses a hybrid approach combining existing Yoruba data, AI-generated English audios, and an audio augmentation algorithm (AcoustAug).

Details

Motivation: The shortage of S2ST datasets for high-to-low resource language pairs like English-to-Yoruba motivated the creation of BENYO-S2ST-Corpus-1 to bridge digital divides in translation.

Method: A hybrid architecture was developed, leveraging existing Yoruba audios and transcripts, generating English audios via AI models (Facebook MMS), and using the AcoustAug algorithm for audio augmentation.

Result: The corpus contains 24,064 samples (12,032 per language) with 41.20 hours of audio. A proof-of-concept Yoruba TTS model (YoruTTS-0.5) achieved an F0 RMSE of 63.54.

Conclusion: The corpus and its architecture can aid in creating datasets for other African languages, reducing translation divides. BENYO-S2ST-Corpus-1 and YoruTTS-0.5 are publicly available.

Abstract: There is a major shortage of Speech-to-Speech Translation (S2ST) datasets for high resource-to-low resource language pairs such as English-to-Yoruba. Thus, in this study, we curated the Bilingual English-to-Yoruba Speech-to-Speech Translation Corpus Version 1 (BENYO-S2ST-Corpus-1). The corpus is based on a hybrid architecture we developed for large-scale direct S2ST corpus creation at reduced cost. To achieve this, we leveraged non speech-to-speech Standard Yoruba (SY) real-time audios and transcripts in the YORULECT Corpus as well as the corresponding Standard English (SE) transcripts. YORULECT Corpus is small scale(1,504) samples, and it does not have paired English audios. Therefore, we generated the SE audios using pre-trained AI models (i.e. Facebook MMS). We also developed an audio augmentation algorithm named AcoustAug based on three latent acoustic features to generate augmented audios from the raw audios of the two languages. BENYO-S2ST-Corpus-1 has 12,032 audio samples per language, which gives a total of 24,064 sample size. The total audio duration for the two languages is 41.20 hours. This size is quite significant. Beyond building S2ST models, BENYO-S2ST-Corpus-1 can be used to build pretrained models or improve existing ones. The created corpus and Coqui framework were used to build a pretrained Yoruba TTS model (named YoruTTS-0.5) as a proof of concept. The YoruTTS-0.5 gave a F0 RMSE value of 63.54 after 1,000 epochs, which indicates moderate fundamental pitch similarity with the reference real-time audio. Ultimately, the corpus architecture in this study can be leveraged by researchers and developers to curate datasets for multilingual high-resource-to-low-resource African languages. This will bridge the huge digital divides in translations among high and low-resource language pairs. BENYO-S2ST-Corpus-1 and YoruTTS-0.5 are publicly available at (https://bit.ly/40bGMwi).

cs.LG

[543] Beyond 9-to-5: A Generative Model for Augmenting Mobility Data of Underrepresented Shift Workers

Haoxuan Ma, Xishun Liao, Yifan Liu, Chris Stanford, Jiaqi Ma

Main category: cs.LG

TL;DR: The paper introduces a transformer-based method to model urban mobility for shift workers, addressing their underrepresentation in traditional surveys by using GPS data to generate accurate activity patterns.

Details

Motivation: Shift workers (15-20% of the workforce) are underrepresented in transportation surveys, leading to biased planning. This study aims to correct this gap by focusing on their unique mobility patterns.

Method: A novel transformer-based approach uses fragmented GPS trajectory data, incorporating period-aware temporal embeddings and a transition-focused loss function to generate complete activity patterns for shift workers.

Result: The method achieves strong alignment with GPS data (Average JSD < 0.02), demonstrating its effectiveness in capturing shift workers’ mobility patterns.

Conclusion: The approach provides a valuable tool for inclusive transportation planning by generating representative data for shift workers, improving urban mobility models.

Abstract: This paper addresses a critical gap in urban mobility modeling by focusing on shift workers, a population segment comprising 15-20% of the workforce in industrialized societies yet systematically underrepresented in traditional transportation surveys and planning. This underrepresentation is revealed in this study by a comparative analysis of GPS and survey data, highlighting stark differences between the bimodal temporal patterns of shift workers and the conventional 9-to-5 schedules recorded in surveys. To address this bias, we introduce a novel transformer-based approach that leverages fragmented GPS trajectory data to generate complete, behaviorally valid activity patterns for individuals working non-standard hours. Our method employs periodaware temporal embeddings and a transition-focused loss function specifically designed to capture the unique activity rhythms of shift workers and mitigate the inherent biases in conventional transportation datasets. Evaluation shows that the generated data achieves remarkable distributional alignment with GPS data from Los Angeles County (Average JSD < 0.02 for all evaluation metrics). By transforming incomplete GPS traces into complete, representative activity patterns, our approach provides transportation planners with a powerful data augmentation tool to fill critical gaps in understanding the 24/7 mobility needs of urban populations, enabling precise and inclusive transportation planning.

[544] Enhancing Spatiotemporal Networks with xLSTM: A Scalar LSTM Approach for Cellular Traffic Forecasting

Khalid Ali, Zineddine Bettouche, Andreas Kassler, Andreas Fischer

Main category: cs.LG

TL;DR: A lightweight dual-path Spatiotemporal Network improves traffic forecasting by combining sLSTM for temporal modeling and Conv3D for spatial feature extraction, outperforming ConvLSTM with 23% lower MAE and 30% better generalization.

Details

Motivation: Accurate spatiotemporal traffic forecasting is crucial for 5G resource management, but conventional AI struggles with complex spatial-temporal patterns due to user mobility.

Method: The proposed network uses sLSTM for temporal modeling and a three-layer Conv3D module for spatial feature extraction, fused into a cohesive representation.

Result: The model reduces prediction error, improves gradient stability, and achieves a 23% MAE reduction over ConvLSTM with 30% better generalization.

Conclusion: The lightweight design is effective for large-scale deployments, offering robust forecasting and strong generalization to unseen regions.

Abstract: Accurate spatiotemporal traffic forecasting is vital for intelligent resource management in 5G and beyond. However, conventional AI approaches often fail to capture the intricate spatial and temporal patterns that exist, due to e.g., the mobility of users. We introduce a lightweight, dual-path Spatiotemporal Network that leverages a Scalar LSTM (sLSTM) for efficient temporal modeling and a three-layer Conv3D module for spatial feature extraction. A fusion layer integrates both streams into a cohesive representation, enabling robust forecasting. Our design improves gradient stability and convergence speed while reducing prediction error. Evaluations on real-world datasets show superior forecast performance over ConvLSTM baselines and strong generalization to unseen regions, making it well-suited for large-scale, next-generation network deployments. Experimental evaluation shows a 23% MAE reduction over ConvLSTM, with a 30% improvement in model generalization.

[545] Wavelet Logic Machines: Learning and Reasoning in the Spectral Domain Without Neural Networks

Andrew Kiruluta

Main category: cs.LG

TL;DR: A spectral learning framework replaces neural layers with wavelet-domain operations, achieving competitive performance with fewer parameters and memory, while enabling faster convergence and lower inference costs.

Details

Motivation: To address the inefficiencies and overparameterization of traditional neural models by leveraging spectral sparsity and wavelet transforms for compact, interpretable, and efficient learning.

Method: The model operates entirely in the wavelet domain, applying learnable nonlinear transformations (soft-thresholding, gain-phase modulation) and adaptive wavelet basis selection (Haar, Daubechies, Biorthogonal). Implemented in PyTorch with 3D support, it avoids spatial convolutions or attention.

Result: Achieves 89.3% accuracy on SST-2 (close to a 4-layer Transformer’s 90.1%) with 72% fewer parameters and 58% less peak memory. Faster convergence due to spectral sparsity. Linear-time wavelet transforms reduce inference costs.

Conclusion: Demonstrates the viability of spectral learning for vision and language tasks, offering a principled alternative to overparameterized neural models.

Abstract: We introduce a fully spectral learning framework that eliminates traditional neural layers by operating entirely in the wavelet domain. The model applies learnable nonlinear transformations, including soft-thresholding and gain-phase modulation, directly to wavelet coefficients. It also includes a differentiable wavelet basis selection mechanism, enabling adaptive processing using families such as Haar, Daubechies, and Biorthogonal wavelets. Implemented in PyTorch with full 3D support, the model maintains a spectral pipeline without spatial convolutions or attention. On synthetic 3D denoising and natural language tasks from the GLUE benchmark, including SST-2 sentiment classification, the model achieves 89.3 percent accuracy, close to a 4-layer Transformer baseline (90.1 percent), while using 72 percent fewer parameters and 58 percent less peak memory. Faster early convergence is observed due to spectral sparsity priors. In contrast to the quadratic complexity of self-attention and large matrix multiplications in Transformers, our approach uses linear-time wavelet transforms and pointwise nonlinearities, significantly reducing inference cost. This yields a compact, interpretable, and efficient alternative to neural models. Our results support the viability of principled spectral learning in both vision and language tasks, offering new directions for model design without overparameterized architectures.

[546] A Comparative Analysis of Traditional and Deep Learning Time Series Architectures for Influenza A Infectious Disease Forecasting

Edmund F. Agyemang, Hansapani Rodrigo, Vincent Agbenyeavu

Main category: cs.LG

TL;DR: Deep learning models, especially Transformers, outperform traditional methods (ARIMA, ETS) in predicting Influenza A outbreaks, suggesting their potential for improving public health forecasting.

Details

Motivation: To compare traditional and deep learning models for predicting Influenza A outbreaks, given its significant annual death toll and the need for better forecasting tools.

Method: Comparative analysis using historical data (2009-2023) of traditional models (ARIMA, ETS) and six deep learning architectures (Simple RNN, LSTM, GRU, BiLSTM, BiGRU, Transformer).

Result: Deep learning models, particularly Transformers, showed superior performance (MSE: 0.0433, MAE: 0.1126) over traditional methods in capturing temporal complexities.

Conclusion: Deep learning can enhance infectious disease prediction, supporting its integration into public health systems for real-time forecasting and intervention planning.

Abstract: Influenza A is responsible for 290,000 to 650,000 respiratory deaths a year, though this estimate is an improvement from years past due to improvements in sanitation, healthcare practices, and vaccination programs. In this study, we perform a comparative analysis of traditional and deep learning models to predict Influenza A outbreaks. Using historical data from January 2009 to December 2023, we compared the performance of traditional ARIMA and Exponential Smoothing(ETS) models with six distinct deep learning architectures: Simple RNN, LSTM, GRU, BiLSTM, BiGRU, and Transformer. The results reveal a clear superiority of all the deep learning models, especially the state-of-the-art Transformer with respective average testing MSE and MAE of 0.0433 \pm 0.0020 and 0.1126 \pm 0.0016 for capturing the temporal complexities associated with Influenza A data, outperforming well known traditional baseline ARIMA and ETS models. These findings of this study provide evidence that state-of-the-art deep learning architectures can enhance predictive modeling for infectious diseases and indicate a more general trend toward using deep learning methods to enhance public health forecasting and intervention planning strategies. Future work should focus on how these models can be incorporated into real-time forecasting and preparedness systems at an epidemic level, and integrated into existing surveillance systems.

[547] BikeVAE-GNN: A Variational Autoencoder-Augmented Hybrid Graph Neural Network for Sparse Bicycle Volume Estimation

Mohit Gupta, Debjit Bhowmick, Ben Beck

Main category: cs.LG

TL;DR: BikeVAE-GNN, a dual-task framework combining Hybrid-GNN and VAE, improves bicycle volume estimation in sparse urban networks, outperforming baseline models.

Details

Motivation: Accurate bicycle volume estimation is crucial for urban planning but hindered by sparse count data in global bicycling networks.

Method: Proposes BikeVAE-GNN, integrating Hybrid-GNN (GCN, GAT, GraphSAGE) and VAE for synthetic data generation, performing regression and classification tasks.

Result: Achieves MAE of 30.82, 99% accuracy, and F1-score of 0.99 on Melbourne data with 99% sparsity.

Conclusion: BikeVAE-GNN advances sparse network estimation, offering insights for sustainable infrastructure.

Abstract: Accurate link-level bicycle volume estimation is essential for informed urban and transport planning but it is challenged by extremely sparse count data in urban bicycling networks worldwide. We propose BikeVAE-GNN, a novel dual-task framework augmenting a Hybrid Graph Neural Network (GNN) with Variational Autoencoder (VAE) to estimate Average Daily Bicycle (ADB) counts, addressing sparse bicycle networks. The Hybrid-GNN combines Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and GraphSAGE to effectively model intricate spatial relationships in sparse networks while VAE generates synthetic nodes and edges to enrich the graph structure and enhance the estimation performance. BikeVAE-GNN simultaneously performs - regression for bicycling volume estimation and classification for bicycling traffic level categorization. We demonstrate the effectiveness of BikeVAE-GNN using OpenStreetMap data and publicly available bicycle count data within the City of Melbourne - where only 141 of 15,933 road segments have labeled counts (resulting in 99% count data sparsity). Our experiments show that BikeVAE-GNN outperforms machine learning and baseline GNN models, achieving a mean absolute error (MAE) of 30.82 bicycles per day, accuracy of 99% and F1-score of 0.99. Ablation studies further validate the effective role of Hybrid-GNN and VAE components. Our research advances bicycling volume estimation in sparse networks using novel and state-of-the-art approaches, providing insights for sustainable bicycling infrastructures.

[548] VAE-GAN Based Price Manipulation in Coordinated Local Energy Markets

Biswarup Mukherjee, Li Zhou, S. Gokul Krishnan, Milad Kabirifar, Subhash Lakshminarayana, Charalambos Konstantinou

Main category: cs.LG

TL;DR: A model for coordinating prosumers with heterogeneous DERs in LEMs using MADDPG and adversarial pricing with VAE-GAN, showing financial losses for some prosumers and improved fairness in larger markets.

Details

Motivation: To address the challenge of coordinating prosumers with diverse DERs in dynamic energy markets and explore the impact of adversarial pricing.

Method: Uses MADDPG for real-time decision-making and VAE-GAN for adversarial price manipulation.

Result: Prosumers, especially those without generation capabilities, suffer financial losses under adversarial pricing, but fairness improves in larger markets.

Conclusion: The model effectively coordinates prosumers and highlights vulnerabilities to adversarial pricing, with larger markets fostering cooperation and fairness.

Abstract: This paper introduces a model for coordinating prosumers with heterogeneous distributed energy resources (DERs), participating in the local energy market (LEM) that interacts with the market-clearing entity. The proposed LEM scheme utilizes a data-driven, model-free reinforcement learning approach based on the multi-agent deep deterministic policy gradient (MADDPG) framework, enabling prosumers to make real-time decisions on whether to buy, sell, or refrain from any action while facilitating efficient coordination for optimal energy trading in a dynamic market. In addition, we investigate a price manipulation strategy using a variational auto encoder-generative adversarial network (VAE-GAN) model, which allows utilities to adjust price signals in a way that induces financial losses for the prosumers. Our results show that under adversarial pricing, heterogeneous prosumer groups, particularly those lacking generation capabilities, incur financial losses. The same outcome holds across LEMs of different sizes. As the market size increases, trading stabilizes and fairness improves through emergent cooperation among agents.

[549] Target Circuit Matching in Large-Scale Netlists using GNN-Based Region Prediction

Sangwoo Seo, Jimin Seo, Yoonho Lee, Donghyeon Kim, Hyejin Shin, Banghyun Sung, Chanyoung Park

Main category: cs.LG

TL;DR: Proposes a GNN-based method for efficient subgraph matching in large circuits, outperforming traditional and existing deep learning approaches.

Details

Motivation: Traditional rule-based and node-to-node matching methods are inefficient for large circuits, and existing deep learning models fail to capture global subgraph embeddings effectively.

Method: Uses GNNs to predict high-probability regions for target circuits, constructs negative samples for accurate learning, and directly extracts global subgraph embeddings.

Result: Outperforms existing methods in time efficiency and target region prediction, scalable for large circuits.

Conclusion: The approach offers an efficient and scalable solution for subgraph matching in large-scale circuits.

Abstract: Subgraph matching plays an important role in electronic design automation (EDA) and circuit verification. Traditional rule-based methods have limitations in generalizing to arbitrary target circuits. Furthermore, node-to-node matching approaches tend to be computationally inefficient, particularly for large-scale circuits. Deep learning methods have emerged as a potential solution to address these challenges, but existing models fail to efficiently capture global subgraph embeddings or rely on inefficient matching matrices, which limits their effectiveness for large circuits. In this paper, we propose an efficient graph matching approach that utilizes Graph Neural Networks (GNNs) to predict regions of high probability for containing the target circuit. Specifically, we construct various negative samples to enable GNNs to accurately learn the presence of target circuits and develop an approach to directly extracting subgraph embeddings from the entire circuit, which captures global subgraph information and addresses the inefficiency of applying GNNs to all candidate subgraphs. Extensive experiments demonstrate that our approach significantly outperforms existing methods in terms of time efficiency and target region prediction, offering a scalable and effective solution for subgraph matching in large-scale circuits.

[550] Physics-informed transfer learning for SHM via feature selection

J. Poole, P. Gardner, A. J. Hughes, N. Dervilis, R. S. Mills, T. A. Dardeno, K. Worden

Main category: cs.LG

TL;DR: The paper proposes using the Modal Assurance Criterion (MAC) to select features for transfer learning in population-based structural health monitoring, addressing the challenge of limited labeled data and differing distributions across structures.

Details

Motivation: Training data for structural health monitoring (SHM) is costly and scarce, especially labeled data. Population-based SHM and transfer learning (TL) offer solutions, but selecting suitable source domains and features is challenging without domain expertise.

Method: Leverages physics knowledge to select features, using the MAC to quantify mode correspondence between healthy structures. MAC is validated as a measure for feature selection with invariant conditional distributions.

Result: MAC shows high correspondence with a supervised metric for joint-distribution similarity, indicating its effectiveness for selecting generalizable features. Demonstrated success in numerical and experimental case studies.

Conclusion: The MAC-based approach effectively selects features for transfer learning in SHM, enabling generalization across structures despite limited labeled data.

Abstract: Data used for training structural health monitoring (SHM) systems are expensive and often impractical to obtain, particularly labelled data. Population-based SHM presents a potential solution to this issue by considering the available data across a population of structures. However, differences between structures will mean the training and testing distributions will differ; thus, conventional machine learning methods cannot be expected to generalise between structures. To address this issue, transfer learning (TL), can be used to leverage information across related domains. An important consideration is that the lack of labels in the target domain limits data-based metrics to quantifying the discrepancy between the marginal distributions. Thus, a prerequisite for the application of typical unsupervised TL methods is to identify suitable source structures (domains), and a set of features, for which the conditional distributions are related to the target structure. Generally, the selection of domains and features is reliant on domain expertise; however, for complex mechanisms, such as the influence of damage on the dynamic response of a structure, this task is not trivial. In this paper, knowledge of physics is leveraged to select more similar features, the modal assurance criterion (MAC) is used to quantify the correspondence between the modes of healthy structures. The MAC is shown to have high correspondence with a supervised metric that measures joint-distribution similarity, which is the primary indicator of whether a classifier will generalise between domains. The MAC is proposed as a measure for selecting a set of features that behave consistently across domains when subjected to damage, i.e. features with invariance in the conditional distributions. This approach is demonstrated on numerical and experimental case studies to verify its effectiveness in various applications.

[551] Exoplanet Detection Using Machine Learning Models Trained on Synthetic Light Curves

Ethan Lo, Dan C. Lo

Main category: cs.LG

TL;DR: The paper explores using simpler ML models (logistic regression, k-nearest neighbors, random forest) for exoplanet discovery, showing promising results but highlighting the need for data augmentation to address biases and imbalances.

Details

Motivation: Manual exoplanet discovery is slow; ML can improve efficiency, but existing models are complex. This study aims to simplify the process.

Method: Tested logistic regression, k-nearest neighbors, and random forest on NASA’s Kepler dataset, using data augmentation to address imbalances.

Result: Initial results are promising, but data augmentation improves recall and precision, though accuracy varies by model.

Conclusion: Simpler ML models with data augmentation can enhance exoplanet discovery, balancing efficiency and fairness.

Abstract: With manual searching processes, the rate at which scientists and astronomers discover exoplanets is slow because of inefficiencies that require an extensive time of laborious inspections. In fact, as of now there have been about only 5,000 confirmed exoplanets since the late 1900s. Recently, machine learning (ML) has proven to be extremely valuable and efficient in various fields, capable of processing massive amounts of data in addition to increasing its accuracy by learning. Though ML models for discovering exoplanets owned by large corporations (e.g. NASA) exist already, they largely depend on complex algorithms and supercomputers. In an effort to reduce such complexities, in this paper, we report the results and potential benefits of various, well-known ML models in the discovery and validation of extrasolar planets. The ML models that are examined in this study include logistic regression, k-nearest neighbors, and random forest. The dataset on which the models train and predict is acquired from NASA’s Kepler space telescope. The initial results show promising scores for each model. However, potential biases and dataset imbalances necessitate the use of data augmentation techniques to further ensure fairer predictions and improved generalization. This study concludes that, in the context of searching for exoplanets, data augmentation techniques significantly improve the recall and precision, while the accuracy varies for each model.

[552] Applications and Manipulations of Physics-Informed Neural Networks in Solving Differential Equations

Aarush Gupta, Kendric Hsu, Syna Mathod

Main category: cs.LG

TL;DR: PINNs solve forward and inverse problems in differential equations by embedding prior knowledge (differential equations) into the loss function, improving performance and handling sparse data without overfitting.

Details

Motivation: To leverage prior analytical knowledge (differential equations) in neural networks for solving complex differential equations efficiently, especially with sparse data.

Method: Use Physics-Informed Neural Networks (PINNs) to minimize residuals of differential equations, embedding solutions and parameters into the loss function. Implemented in Python using PyTorch.

Result: PINNs effectively solve both forward and inverse problems, extrapolating trends beyond training data.

Conclusion: PINNs are a robust tool for solving differential equations by integrating prior knowledge, demonstrated through linear, quadratic, and complex models like the heat equation.

Abstract: Mathematical models in neural networks are powerful tools for solving complex differential equations and optimizing their parameters; that is, solving the forward and inverse problems, respectively. A forward problem predicts the output of a network for a given input by optimizing weights and biases. An inverse problem finds equation parameters or coefficients that effectively model the data. A Physics-Informed Neural Network (PINN) can solve both problems. PINNs inject prior analytical information about the data into the cost function to improve model performance outside the training set boundaries. This also allows PINNs to efficiently solve problems with sparse data without overfitting by extrapolating the model to fit larger trends in the data. The prior information we implement is in the form of differential equations. Residuals are the differences between the left-hand and right-hand sides of corresponding differential equations; PINNs minimize these residuals to effectively solve the differential equation and take advantage of prior knowledge. In this way, the solution and parameters are embedded into the loss function and optimized, allowing both the weights of the neural network and the model parameters to be found simultaneously, solving both the forward and inverse problems in the process. In this paper, we will create PINNs with residuals of varying complexity, beginning with linear and quadratic models and then expanding to fit models for the heat equation and other complex differential equations. We will mainly use Python as the computing language, using the PyTorch library to aid us in our research.

[553] Language Models for Controllable DNA Sequence Design

Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji

Main category: cs.LG

TL;DR: ATGC-Gen is a transformer-based model for controllable DNA sequence generation, outperforming prior methods in biological relevance and controllability.

Details

Motivation: To explore the underutilized potential of language models in DNA sequence generation and improve controllability and biological relevance.

Method: Uses cross-modal encoding with decoder-only and encoder-only transformer architectures for flexible training and generation.

Result: Generates fluent, diverse, and biologically relevant sequences, showing improvements over prior methods.

Conclusion: Demonstrates the promise of language models in programmable genomic design, with code publicly available.

Abstract: We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design. The source code is released at (https://github.com/divelab/AIRS/blob/main/OpenBio/ATGC_Gen).

[554] Kolmogorov Arnold Network Autoencoder in Medicine

Ugo Lomoio, Pierangelo Veltri, Pietro Hiram Guzzi

Main category: cs.LG

TL;DR: The paper benchmarks traditional Autoencoders (AEs) against Kolmogorov-Arnold Networks (KANs) in tasks like reconstruction and anomaly detection using cardiological signals.

Details

Motivation: To explore if KAN-based AEs outperform traditional AEs in medical signal tasks despite having fewer or equal parameters.

Method: Compares vanilla AEs (Linear, Convolutional, Variational) with KAN counterparts on five tasks (reconstruction, generation, denoising, inpainting, anomaly detection) using the AbnormalHeartbeat dataset.

Result: KAN-based AEs show better performance in multiple scenarios, leveraging learnable activation functions on edges.

Conclusion: KAN architectures, including KAN Convolutional Networks, offer promising improvements for medical signal processing tasks.

Abstract: Deep learning neural networks architectures such Multi Layer Perceptrons (MLP) and Convolutional blocks still play a crucial role in nowadays research advancements. From a topological point of view, these architecture may be represented as graphs in which we learn the functions related to the nodes while fixed edges convey the information from the input to the output. A recent work introduced a new architecture called Kolmogorov Arnold Networks (KAN) that reports how putting learnable activation functions on the edges of the neural network leads to better performances in multiple scenarios. Multiple studies are focusing on optimizing the KAN architecture by adding important features such as dropout regularization, Autoencoders (AE), model benchmarking and last, but not least, the KAN Convolutional Network (KCN) that introduced matrix convolution with KANs learning. This study aims to benchmark multiple versions of vanilla AEs (such as Linear, Convolutional and Variational) against their Kolmogorov-Arnold counterparts that have same or less number of parameters. Using cardiological signals as model input, a total of five different classic AE tasks were studied: reconstruction, generation, denoising, inpainting and anomaly detection. The proposed experiments uses a medical dataset \textit{AbnormalHeartbeat} that contains audio signals obtained from the stethoscope.

[555] Debunking Optimization Myths in Federated Learning for Medical Image Classification

Youngjoon Lee, Hyukjoon Lee, Jinu Gong, Yang Cao, Joonhyuk Kang

Main category: cs.LG

TL;DR: The study highlights that in Federated Learning (FL), local configurations like optimizers and learning rates impact performance more than the FL method itself, emphasizing the need for edge-specific tuning.

Details

Motivation: To address the sensitivity of FL methods to local factors in medical imaging, limiting robustness in real-world applications.

Method: Benchmarking recent FL methods on colorectal pathology and blood cell classification tasks, analyzing the impact of local optimizers, learning rates, and training epochs.

Result: Local optimizer and learning rate choices significantly affect performance more than the FL method. Training epochs’ impact varies by FL method.

Conclusion: Edge-specific configurations are more critical than algorithmic complexity for effective FL deployment.

Abstract: Federated Learning (FL) is a collaborative learning method that enables decentralized model training while preserving data privacy. Despite its promise in medical imaging, recent FL methods are often sensitive to local factors such as optimizers and learning rates, limiting their robustness in practical deployments. In this work, we revisit vanilla FL to clarify the impact of edge device configurations, benchmarking recent FL methods on colorectal pathology and blood cell classification task. We numerically show that the choice of local optimizer and learning rate has a greater effect on performance than the specific FL method. Moreover, we find that increasing local training epochs can either enhance or impair convergence, depending on the FL method. These findings indicate that appropriate edge-specific configuration is more crucial than algorithmic complexity for achieving effective FL.

[556] MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs

Chenchen Zhao, Zhengyuan Shi, Xiangyu Wen, Chengjie Liu, Yi Liu, Yunhao Zhou, Yuxiang Zhao, Hefei Feng, Yinan Zhu, Gwok-Waa Wan, Xin Cheng, Weiyu Chen, Yongqi Fu, Chujie Chen, Chenhao Xue, Guangyu Sun, Ying Wang, Yibo Lin, Jun Yang, Ning Xu, Xi Wang, Qiang Xu

Main category: cs.LG

TL;DR: MMCircuitEval is a multimodal benchmark for evaluating MLLMs in EDA tasks, addressing gaps in existing benchmarks with 3614 QA pairs across diverse circuit design stages.

Details

Motivation: Existing benchmarks for MLLMs in EDA are limited in scope, necessitating a comprehensive evaluation tool like MMCircuitEval.

Method: The benchmark includes 3614 QA pairs from diverse sources, categorized by design stage, circuit type, abilities tested, and difficulty, reviewed by experts.

Result: Evaluations show performance gaps in existing LLMs, especially in back-end design and complex computations.

Conclusion: MMCircuitEval serves as a foundational resource for advancing MLLMs in EDA, aiding their real-world application.

Abstract: The emergence of multimodal large language models (MLLMs) presents promising opportunities for automation and enhancement in Electronic Design Automation (EDA). However, comprehensively evaluating these models in circuit design remains challenging due to the narrow scope of existing benchmarks. To bridge this gap, we introduce MMCircuitEval, the first multimodal benchmark specifically designed to assess MLLM performance comprehensively across diverse EDA tasks. MMCircuitEval comprises 3614 meticulously curated question-answer (QA) pairs spanning digital and analog circuits across critical EDA stages - ranging from general knowledge and specifications to front-end and back-end design. Derived from textbooks, technical question banks, datasheets, and real-world documentation, each QA pair undergoes rigorous expert review for accuracy and relevance. Our benchmark uniquely categorizes questions by design stage, circuit type, tested abilities (knowledge, comprehension, reasoning, computation), and difficulty level, enabling detailed analysis of model capabilities and limitations. Extensive evaluations reveal significant performance gaps among existing LLMs, particularly in back-end design and complex computations, highlighting the critical need for targeted training datasets and modeling approaches. MMCircuitEval provides a foundational resource for advancing MLLMs in EDA, facilitating their integration into real-world circuit design workflows. Our benchmark is available at https://github.com/cure-lab/MMCircuitEval.

[557] Quantizing Text-attributed Graphs for Semantic-Structural Integration

Jianyuan Bo, Hao Wu, Yuan Fang

Main category: cs.LG

TL;DR: STAG is a self-supervised framework that quantizes graph structural information into discrete tokens for LLM-based graph learning, enabling zero-shot transfer without labeled data.

Details

Motivation: Current methods struggle to embed graph structures into LLM-compatible formats efficiently and often lose critical details, while requiring labeled data for transfer learning.

Method: STAG uses soft assignment and KL divergence-guided quantization to convert graph structures into discrete tokens, avoiding manual verbalization or expensive alignment.

Result: Achieves state-of-the-art performance in node classification benchmarks and supports zero-shot transfer learning without labeled data.

Conclusion: STAG bridges graph learning with LLMs effectively, offering a scalable and adaptable solution.

Abstract: Text-attributed graphs (TAGs) have emerged as a powerful representation for modeling complex relationships across diverse domains. With the rise of large language models (LLMs), there is growing interest in leveraging their capabilities for graph learning. However, current approaches face significant challenges in embedding structural information into LLM-compatible formats, requiring either computationally expensive alignment mechanisms or manual graph verbalization techniques that often lose critical structural details. Moreover, these methods typically require labeled data from source domains for effective transfer learning, significantly constraining their adaptability. We propose STAG, a novel self-supervised framework that directly quantizes graph structural information into discrete tokens using a frozen codebook. Unlike traditional quantization approaches, our method employs soft assignment and KL divergence guided quantization to address the unique challenges of graph data, which lacks natural tokenization structures. Our framework enables both LLM-based and traditional learning approaches, supporting true zero-shot transfer learning without requiring labeled data even in the source domain. Extensive experiments demonstrate state-of-the-art performance across multiple node classification benchmarks while maintaining compatibility with different LLM architectures, offering an elegant solution to bridging graph learning with LLMs.

[558] Research on the application of graph data structure and graph neural network in node classification/clustering tasks

Yihan Wang, Jianing Zhao

Main category: cs.LG

TL;DR: The paper compares traditional graph algorithms and Graph Neural Networks (GNNs), showing GNNs outperform traditional methods by 43-70% in accuracy for node classification and clustering. It also explores integration strategies between the two.

Details

Motivation: Graph-structured data is common but challenging for traditional ML due to its non-Euclidean nature. The study aims to evaluate and compare classical graph algorithms and GNNs.

Method: The study conducts comparative experiments on node classification and clustering tasks, analyzing performance differences between traditional algorithms and GNNs.

Result: GNNs achieve significant accuracy improvements (43-70%) over traditional methods.

Conclusion: The paper provides theoretical guidance for integrating classical algorithms with GNNs to advance graph representation learning.

Abstract: Graph-structured data are pervasive across domains including social networks, biological networks, and knowledge graphs. Due to their non-Euclidean nature, such data pose significant challenges to conventional machine learning methods. This study investigates graph data structures, classical graph algorithms, and Graph Neural Networks (GNNs), providing comprehensive theoretical analysis and comparative evaluation. Through comparative experiments, we quantitatively assess performance differences between traditional algorithms and GNNs in node classification and clustering tasks. Results show GNNs achieve substantial accuracy improvements of 43% to 70% over traditional methods. We further explore integration strategies between classical algorithms and GNN architectures, providing theoretical guidance for advancing graph representation learning research.

[559] Machine Learning Risk Intelligence for Green Hydrogen Investment: Insights for Duqm R3 Auction

Obumneme Nwafor, Mohammed Abdul Majeed Al Hooti

Main category: cs.LG

TL;DR: Oman’s green hydrogen projects face risks due to environmental fluctuations. This paper proposes an AI tool using meteorological data to predict maintenance needs, aiding auction decisions.

Details

Motivation: The lack of historical data for large-scale hydrogen projects in deserts creates a knowledge gap for risk assessment, necessitating a proxy like environmental conditions.

Method: An AI decision support system uses publicly available meteorological data to create a Maintenance Pressure Index (MPI) for predicting infrastructure risks.

Result: The MPI tool predicts maintenance demands and risk levels, enhancing regulatory foresight and operational planning.

Conclusion: The AI tool fills the data gap by using environmental proxies, improving auction evaluation and infrastructure planning for green hydrogen projects.

Abstract: As green hydrogen emerges as a major component of global decarbonisation, Oman has positioned itself strategically through national auctions and international partnerships. Following two successful green hydrogen project rounds, the country launched its third auction (R3) in the Duqm region. While this area exhibits relative geospatial homogeneity, it is still vulnerable to environmental fluctuations that pose inherent risks to productivity. Despite growing global investment in green hydrogen, operational data remains scarce, with major projects like Saudi Arabia’s NEOM facility not expected to commence production until 2026, and Oman’s ACME Duqm project scheduled for 2028. This absence of historical maintenance and performance data from large-scale hydrogen facilities in desert environments creates a major knowledge gap for accurate risk assessment for infrastructure planning and auction decisions. Given this data void, environmental conditions emerge as accessible and reliable proxy for predicting infrastructure maintenance pressures, because harsh desert conditions such as dust storms, extreme temperatures, and humidity fluctuations are well-documented drivers of equipment degradation in renewable energy systems. To address this challenge, this paper proposes an Artificial Intelligence decision support system that leverages publicly available meteorological data to develop a predictive Maintenance Pressure Index (MPI), which predicts risk levels and future maintenance demands on hydrogen infrastructure. This tool strengthens regulatory foresight and operational decision-making by enabling temporal benchmarking to assess and validate performance claims over time. It can be used to incorporate temporal risk intelligence into auction evaluation criteria despite the absence of historical operational benchmarks.

[560] Large-Scale Mixed-Traffic and Intersection Control using Multi-agent Reinforcement Learning

Songyang Liu, Muyang Fan, Weizi Li, Jing Du, Shuai Li

Main category: cs.LG

TL;DR: The paper explores decentralized multi-agent reinforcement learning for large-scale mixed traffic control, showing improved efficiency over traditional signalized intersections.

Details

Motivation: Traffic congestion is a major urban issue, and autonomous driving technologies, particularly reinforcement learning, offer potential solutions. Prior work focused on small-scale networks, leaving large-scale mixed traffic control unexplored.

Method: The study uses decentralized multi-agent reinforcement learning to manage a mix of traffic signals and robot vehicles in a real-world network of 14 intersections.

Result: At 80% robot vehicle penetration, waiting time reduced from 6.17s to 5.09s, and throughput increased from 454 to 493 vehicles per 500 seconds.

Conclusion: Reinforcement learning-based control can enhance large-scale traffic efficiency, offering insights for future urban planning.

Abstract: Traffic congestion remains a significant challenge in modern urban networks. Autonomous driving technologies have emerged as a potential solution. Among traffic control methods, reinforcement learning has shown superior performance over traffic signals in various scenarios. However, prior research has largely focused on small-scale networks or isolated intersections, leaving large-scale mixed traffic control largely unexplored. This study presents the first attempt to use decentralized multi-agent reinforcement learning for large-scale mixed traffic control in which some intersections are managed by traffic signals and others by robot vehicles. Evaluating a real-world network in Colorado Springs, CO, USA with 14 intersections, we measure traffic efficiency via average waiting time of vehicles at intersections and the number of vehicles reaching their destinations within a time window (i.e., throughput). At 80% RV penetration rate, our method reduces waiting time from 6.17s to 5.09s and increases throughput from 454 vehicles per 500 seconds to 493 vehicles per 500 seconds, outperforming the baseline of fully signalized intersections. These findings suggest that integrating reinforcement learning-based control large-scale traffic can improve overall efficiency and may inform future urban planning strategies.

[561] Clinical-Grade Blood Pressure Prediction in ICU Settings: An Ensemble Framework with Uncertainty Quantification and Cross-Institutional Validation

Md Basit Azam, Sarangthem Ibotombi Singh

Main category: cs.LG

TL;DR: The paper introduces a comprehensive ML framework for BP prediction in ICUs, addressing data leakage, uncertainty quantification, and cross-institutional validation, achieving clinically acceptable performance and providing risk-stratified protocols.

Details

Motivation: Current ML approaches for BP monitoring in ICUs lack external validation, uncertainty quantification, and proper data leakage prevention, limiting their reliability.

Method: The study uses an ensemble of Gradient Boosting, Random Forest, and XGBoost with 74 features across five physiological domains, incorporating systematic leakage prevention and quantile regression for uncertainty.

Result: Internal validation showed strong performance (SBP: R²=0.86, RMSE=6.03 mmHg; DBP: R²=0.49, RMSE=7.13 mmHg), while external validation had 30% degradation, especially in hypotensive patients. Uncertainty quantification provided valid prediction intervals.

Conclusion: The framework offers realistic deployment expectations for AI-assisted BP monitoring in ICUs, with publicly available source code for broader use.

Abstract: Blood pressure (BP) monitoring is critical in in tensive care units (ICUs) where hemodynamic instability can rapidly progress to cardiovascular collapse. Current machine learning (ML) approaches suffer from three limitations: lack of external validation, absence of uncertainty quantification, and inadequate data leakage prevention. This study presents the first comprehensive framework with novel algorithmic leakage prevention, uncertainty quantification, and cross-institutional validation for electronic health records (EHRs) based BP pre dictions. Our methodology implemented systematic data leakage prevention, uncertainty quantification through quantile regres sion, and external validation between the MIMIC-III and eICU databases. An ensemble framework combines Gradient Boosting, Random Forest, and XGBoost with 74 features across five physiological domains. Internal validation achieved a clinically acceptable performance (for SBP: R^2 = 0.86, RMSE = 6.03 mmHg; DBP: R^2 = 0.49, RMSE = 7.13 mmHg), meeting AAMI standards. External validation showed 30% degradation with critical limitations in patients with hypotensive. Uncertainty quantification generated valid prediction intervals (80.3% SBP and 79.9% DBP coverage), enabling risk-stratified protocols with narrow intervals (< 15 mmHg) for standard monitoring and wide intervals (> 30 mmHg) for manual verification. This framework provides realistic deployment expectations for cross institutional AI-assisted BP monitoring in critical care settings. The source code is publicly available at https://github.com/ mdbasit897/clinical-bp-prediction-ehr.

[562] Moving Out: Physically-grounded Human-AI Collaboration

Xuhui Kang, Sung-Wook Lee, Haolin Liu, Yuyan Wang, Yen-Ling Kuo

Main category: cs.LG

TL;DR: The paper introduces Moving Out, a benchmark for human-AI collaboration in physical environments, and proposes BASS, a method to improve AI adaptability to human behaviors and physical constraints.

Details

Motivation: To address the challenges of continuous state-action spaces and physical constraints in human-AI collaboration, ensuring effective teamwork in tasks like moving heavy items.

Method: Introduces the Moving Out benchmark and proposes BASS (Behavior Augmentation, Simulation, and Selection) to enhance AI adaptability to diverse human behaviors and physical attributes.

Result: BASS outperforms state-of-the-art models in both AI-AI and human-AI collaboration tasks.

Conclusion: The Moving Out benchmark and BASS method advance human-AI collaboration in physically constrained environments, demonstrating superior performance over existing models.

Abstract: The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. In this paper, we introduce Moving Out, a new human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and maintaining consistent actions to move a big item around a corner. Using Moving Out, we designed two tasks and collected human-human interaction data to evaluate models’ abilities to adapt to diverse human behaviors and unseen physical attributes. To address the challenges in physical environments, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. Our experiments show that BASS outperforms state-of-the-art models in AI-AI and human-AI collaboration. The project page is available at https://live-robotics-uva.github.io/movingout_ai/.

[563] FedDPG: An Adaptive Yet Efficient Prompt-tuning Approach in Federated Learning Settings

Ali Shakeri, Wei Emma Zhang, Amin Beheshti, Weitong Chen, Jian Yang, Lishan Yang

Main category: cs.LG

TL;DR: FedDPG introduces a dynamic prompt generator for PLMs in federated learning, improving flexibility and efficiency while maintaining privacy.

Details

Motivation: Address the limitations of fixed prompts in prompt-tuning and the challenges of federated learning (e.g., communication and computation constraints).

Method: Proposes Federated Dynamic Prompt Generator (FedDPG), which generates context-aware prompts dynamically for each input, keeping PLM parameters frozen.

Result: Outperforms state-of-the-art methods in global model performance, reduces calculation time, and minimizes parameters sent in FL networks.

Conclusion: FedDPG offers a flexible, efficient, and privacy-preserving solution for fine-tuning PLMs in federated learning.

Abstract: Pre-trained Language Models (PLMs) have demonstrated impressive performance in various NLP tasks. However, traditional fine-tuning methods for leveraging PLMs for downstream tasks entail significant computational overhead. Prompt-tuning has emerged as an efficient alternative that involves prepending a limited number of parameters to the input sequence and only updating them while the PLM’s parameters are frozen. However, this technique’s prompts remain fixed for all inputs, reducing the model’s flexibility. The Federated Learning (FL) technique has gained attention in recent years to address the growing concerns around data privacy. However, challenges such as communication and computation limitations of clients still need to be addressed. To mitigate these challenges, this paper introduces the Federated Dynamic Prompt Generator (FedDPG), which incorporates a dynamic prompt generator network to generate context-aware prompts based on the given input, ensuring flexibility and adaptability while prioritising data privacy in federated learning settings. Our experiments on three NLP benchmark datasets showcase that FedDPG outperforms the state-of-the-art parameter-efficient fine-tuning methods in terms of global model performance, and has significantly reduced the calculation time and the number of parameters to be sent through the FL network.

[564] Graph Learning Metallic Glass Discovery from Wikipedia

K. -C. Ouyang, S. -Y. Zhang, S. -L. Liu, J. Tian, Y. -H. Li, H. Tong, H. -Y. Bai, W. -H. Wang, Y. -C. Hu

Main category: cs.LG

TL;DR: The paper proposes a data-driven approach using graph neural networks and Wikipedia embeddings to efficiently design new metallic glasses and other materials.

Details

Motivation: Traditional material synthesis is slow and expensive, especially for metallic glasses, due to the need for optimal multi-element combinations. Data scarcity and poor material encoding limit current machine learning approaches.

Method: The study uses graph neural networks with Wikipedia-based node element encodings to explore material relationships. It leverages multilingual Wikipedia embeddings to assess natural language’s role in materials design.

Result: The approach offers a new paradigm for intelligent materials design, improving predictability and generalizability over traditional methods.

Conclusion: The proposed method enables efficient exploration of new amorphous materials, showcasing AI’s potential in advanced materials discovery.

Abstract: Synthesizing new materials efficiently is highly demanded in various research fields. However, this process is usually slow and expensive, especially for metallic glasses, whose formation strongly depends on the optimal combinations of multiple elements to resist crystallization. This constraint renders only several thousands of candidates explored in the vast material space since 1960. Recently, data-driven approaches armed by advanced machine learning techniques provided alternative routes for intelligent materials design. Due to data scarcity and immature material encoding, the conventional tabular data is usually mined by statistical learning algorithms, giving limited model predictability and generalizability. Here, we propose sophisticated data learning from material network representations. The node elements are encoded from the Wikipedia by a language model. Graph neural networks with versatile architectures are designed to serve as recommendation systems to explore hidden relationships among materials. By employing Wikipedia embeddings from different languages, we assess the capability of natural languages in materials design. Our study proposes a new paradigm to harvesting new amorphous materials and beyond with artificial intelligence.

Chuhang Zheng, Chunwei Tian, Jie Wen, Daoqiang Zhang, Qi Zhu

Main category: cs.LG

TL;DR: The paper proposes HeLo, a multi-modal emotion distribution learning framework, to address challenges in mining modality heterogeneity and exploiting semantic correlations in mixed basic emotions.

Details

Motivation: Existing emotion distribution learning (EDL) methods struggle with modality heterogeneity and underutilized semantic correlations across basic emotions.

Method: HeLo uses cross-attention for physiological data fusion, an OT-based module for modality heterogeneity, and learnable label embeddings with correlation matrix alignment for label correlation.

Result: Experiments on two datasets show HeLo’s superiority in emotion distribution learning.

Conclusion: HeLo effectively explores modality heterogeneity and label correlations, improving multi-modal emotion recognition.

Abstract: Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixture of basic emotions has gradually emerged as a trend. However, existing EDL methods face challenges in mining the heterogeneity among multiple modalities. Besides, rich semantic correlations across arbitrary basic emotions are not fully exploited. In this paper, we propose a multi-modal emotion distribution learning framework, named HeLo, aimed at fully exploring the heterogeneity and complementary information in multi-modal emotional data and label correlation within mixed basic emotions. Specifically, we first adopt cross-attention to effectively fuse the physiological data. Then, an optimal transport (OT)-based heterogeneity mining module is devised to mine the interaction and heterogeneity between the physiological and behavioral representations. To facilitate label correlation learning, we introduce a learnable label embedding optimized by correlation matrix alignment. Finally, the learnable label embeddings and label correlation matrices are integrated with the multi-modal representations through a novel label correlation-driven cross-attention mechanism for accurate emotion distribution learning. Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.

[566] Swift-Sarsa: Fast and Robust Linear Control

Khurram Javed, Richard S. Sutton

Main category: cs.LG

TL;DR: Swift-Sarsa extends SwiftTD for control problems, outperforming existing methods on a challenging benchmark by learning to differentiate relevant signals from noise.

Details

Motivation: The paper aims to extend SwiftTD, a successful TD learning algorithm, to control problems, addressing challenges like noisy signals and non-stationary distributions.

Method: Combines SwiftTD with True Online Sarsa(λ) to create Swift-Sarsa, tested on the operant conditioning benchmark where only a few signals are relevant.

Result: Swift-Sarsa successfully learned to assign credit to relevant signals without prior knowledge, handling noisy features effectively.

Conclusion: Swift-Sarsa shows promise for learning representations in high-dimensional, noisy environments, enabling scalable solutions.

Abstract: Javed, Sharifnassab, and Sutton (2024) introduced a new algorithm for TD learning – SwiftTD – that augments True Online TD($\lambda$) with step-size optimization, a bound on the effective learning rate, and step-size decay. In their experiments SwiftTD outperformed True Online TD($\lambda$) and TD($\lambda$) on a variety of prediction tasks derived from Atari games, and its performance was robust to the choice of hyper-parameters. In this extended abstract we extend SwiftTD to work for control problems. We combine the key ideas behind SwiftTD with True Online Sarsa($\lambda$) to develop an on-policy reinforcement learning algorithm called $\textit{Swift-Sarsa}$. We propose a simple benchmark for linear on-policy control called the $\textit{operant conditioning benchmark}$. The key challenge in the operant conditioning benchmark is that a very small subset of input signals are relevant for decision making. The majority of the signals are noise sampled from a non-stationary distribution. To learn effectively, the agent must learn to differentiate between the relevant signals and the noisy signals, and minimize prediction errors by assigning credit to the weight parameters associated with the relevant signals. Swift-Sarsa, when applied to the operant conditioning benchmark, learned to assign credit to the relevant signals without any prior knowledge of the structure of the problem. It opens the door for solution methods that learn representations by searching over hundreds of millions of features in parallel without performance degradation due to noisy or bad features.

[567] Latent Representations of Intracardiac Electrograms for Atrial Fibrillation Driver Detection

Pablo Peiro-Corbacho, Long Lin, Pablo Ávila, Alejandro Carta-Bergaz, Ángel Arenal, Carlos Sevilla-Salcedo, Gonzalo R. Ríos-Muñoz

Main category: cs.LG

TL;DR: A deep learning framework using convolutional autoencoders is proposed for unsupervised feature extraction from atrial electrograms to detect AF drivers, achieving moderate to high performance in identifying arrhythmogenic regions.

Details

Motivation: Current ablation therapies for persistent AF are often ineffective due to non-pulmonary vein drivers, necessitating better methods for detecting AF drivers.

Method: The study uses convolutional autoencoders for unsupervised feature extraction from unipolar and bipolar intracavitary electrograms (EGMs) recorded during AF.

Result: The autoencoders achieved low reconstruction loss and enabled downstream classifiers to detect rotational/focal activity (AUC 0.73-0.76) and EGM entanglement (AUC 0.93).

Conclusion: The method can integrate into clinical systems for real-time AF driver identification, showcasing unsupervised learning’s potential for physiologically meaningful feature extraction.

Abstract: Atrial Fibrillation (AF) is the most prevalent sustained arrhythmia, yet current ablation therapies, including pulmonary vein isolation, are frequently ineffective in persistent AF due to the involvement of non-pulmonary vein drivers. This study proposes a deep learning framework using convolutional autoencoders for unsupervised feature extraction from unipolar and bipolar intracavitary electrograms (EGMs) recorded during AF in ablation studies. These latent representations of atrial electrical activity enable the characterization and automation of EGM analysis, facilitating the detection of AF drivers. The database consisted of 11,404 acquisitions recorded from 291 patients, containing 228,080 unipolar EGMs and 171,060 bipolar EGMs. The autoencoders successfully learned latent representations with low reconstruction loss, preserving the morphological features. The extracted embeddings allowed downstream classifiers to detect rotational and focal activity with moderate performance (AUC 0.73-0.76) and achieved high discriminative performance in identifying atrial EGM entanglement (AUC 0.93). The proposed method can operate in real-time and enables integration into clinical electroanatomical mapping systems to assist in identifying arrhythmogenic regions during ablation procedures. This work highlights the potential of unsupervised learning to uncover physiologically meaningful features from intracardiac signals.

[568] Harnessing intuitive local evolution rules for physical learning

Roie Ezraty, Menachem Stern, Shmuel M. Rubinstein

Main category: cs.LG

TL;DR: A training scheme for physical systems (BEASTS) minimizes power dissipation by using boundary parameters and local physical rules, achieving autonomous learning for regression and classification.

Details

Motivation: To address the high computational and power demands of traditional Machine Learning by exploring alternative physical implementations.

Method: Introduces BEASTAL, a scheme for Boundary-Enabled Adaptive State Tuning Systems, leveraging local physical rules and boundary control.

Result: Demonstrates autonomous learning in silico for linear tasks, with best performance for non-linear local rules.

Conclusion: BEASTAL advances physical learning by simplifying architectures and avoiding large-scale memory, making it efficient for linear tasks.

Abstract: Machine Learning, however popular and accessible, is computationally intensive and highly power-consuming, prompting interest in alternative physical implementations of learning tasks. We introduce a training scheme for physical systems that minimize power dissipation in which only boundary parameters (i.e. inputs and outputs) are externally controlled. Using this scheme, these Boundary-Enabled Adaptive State Tuning Systems (BEASTS) learn by exploiting local physical rules. Our scheme, BEASTAL (BEAST-Adaline), is the closest analog of the Adaline algorithm for such systems. We demonstrate this autonomous learning in silico for regression and classification tasks. Our approach advances previous physical learning schemes by using intuitive, local evolution rules without requiring large-scale memory or complex internal architectures. BEASTAL can perform any linear task, achieving best performance when the local evolution rule is non-linear.

[569] Federated Calculation of the Free-Support Transportation Barycenter by Single-Loop Dual Decomposition

Zhengqi Lin, Andrzej Ruszczyński

Main category: cs.LG

TL;DR: An efficient federated dual decomposition algorithm for Wasserstein barycenter computation, avoiding local data access and matrix-vector operations, ensuring low complexity and scalability.

Details

Motivation: To address the challenge of efficiently computing Wasserstein barycenters of distributions without accessing local data or solving repeated mass transportation problems.

Method: Proposes a federated dual decomposition algorithm that uses highly aggregated information and avoids matrix-vector operations.

Result: The algorithm demonstrates low iteration complexity, scalability, and outperforms state-of-the-art methods in tests on mixture models.

Conclusion: The proposed algorithm is efficient, scalable, and avoids data privacy concerns by not accessing local data directly.

Abstract: We propose an efficient federated dual decomposition algorithm for calculating the Wasserstein barycenter of several distributions, including choosing the support of the solution. The algorithm does not access local data and uses only highly aggregated information. It also does not require repeated solutions to mass transportation problems. Because of the absence of any matrix-vector operations, the algorithm exhibits a very low complexity of each iteration and significant scalability. We illustrate its virtues and compare it to the state-of-the-art methods on several examples of mixture models.

[570] Efficient and Scalable Agentic AI with Heterogeneous Systems

Zain Asgar, Michelle Nguyen, Sachin Katti

Main category: cs.LG

TL;DR: A system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure, optimizing TCO by leveraging diverse hardware combinations.

Details

Motivation: The need for efficient and scalable deployment infrastructure for dynamic and complex AI agent workloads.

Method: A framework for planning and optimizing execution graphs, MLIR-based compilation, and dynamic orchestration across heterogeneous hardware.

Result: Preliminary results show significant TCO benefits, with some workloads performing similarly on mixed older and newer hardware.

Conclusion: Heterogeneous infrastructure can extend the life of deployed hardware while maintaining performance, offering cost-effective solutions for AI agent workloads.

Abstract: AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. Often these agents are directed graphs of compute and IO operations that span multi-modal data input and conversion), data processing and context gathering (e.g vector DB lookups), multiple LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure. To tackle this challenge, in this paper, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; a MLIR based representation and compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end-to-end SLA. Our design performs a systems level TCO optimization and preliminary results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits. A preliminary surprising finding is that for some workloads a heterogeneous combination of older generation GPUs with newer accelerators can deliver similar TCO as the latest generation homogenous GPU infrastructure design, potentially extending the life of deployed infrastructure.

[571] Directly Learning Stock Trading Strategies Through Profit Guided Loss Functions

Devroop Kar, Zimeng Lyu, Sheeraja Rajakrishnan, Hao Zhang, Alex Ororbia, Travis Desell, Daniel Krutz

Main category: cs.LG

TL;DR: Proposed four novel loss functions for stock trading strategies using neural networks, outperforming benchmarks in profit generation.

Details

Motivation: Addressing the challenge of volatile stock markets by improving trading decision-making.

Method: Developed loss functions for profit/loss calculation, applied to time-series models like transformers.

Result: Achieved higher returns (up to 51.42%) compared to reinforcement learning methods (max 41.58%).

Conclusion: The novel loss functions enable effective trading strategies, outperforming existing methods.

Abstract: Stock trading has always been a challenging task due to the highly volatile nature of the stock market. Making sound trading decisions to generate profit is particularly difficult under such conditions. To address this, we propose four novel loss functions to drive decision-making for a portfolio of stocks. These functions account for the potential profits or losses based with respect to buying or shorting respective stocks, enabling potentially any artificial neural network to directly learn an effective trading strategy. Despite the high volatility in stock market fluctuations over time, training time-series models such as transformers on these loss functions resulted in trading strategies that generated significant profits on a portfolio of 50 different S&P 500 company stocks as compared to a benchmark reinforcment learning techniques and a baseline buy and hold method. As an example, using 2021, 2022 and 2023 as three test periods, the Crossformer model adapted with our best loss function was most consistent, resulting in returns of 51.42%, 51.04% and 48.62% respectively. In comparison, the best performing state-of-the-art reinforcement learning methods, PPO and DDPG, only delivered maximum profits of around 41%, 2.81% and 41.58% for the same periods. The code is available at https://anonymous.4open.science/r/bandit-stock-trading-58C8/README.md.

[572] Feature learning is decoupled from generalization in high capacity neural networks

Niclas Alexander Göring, Charles London, Abdurrahman Hadi Erturk, Chris Mingard, Yoonsoo Nam, Ard A. Louis

Main category: cs.LG

TL;DR: Neural networks outperform kernel methods due to feature learning, but current theories focus on feature strength, not quality, limiting understanding of generalization.

Details

Motivation: To understand why neural networks outperform kernel methods and to evaluate the quality of learned features.

Method: Introduces the concept of feature quality, examines existing feature learning theories, and provides empirical evidence.

Result: Current theories assess feature learning strength but not feature quality, limiting generalization theories.

Conclusion: New theories focusing on feature quality are needed to better understand neural network generalization.

Abstract: Neural networks outperform kernel methods, sometimes by orders of magnitude, e.g. on staircase functions. This advantage stems from the ability of neural networks to learn features, adapting their hidden representations to better capture the data. We introduce a concept we call feature quality to measure this performance improvement. We examine existing theories of feature learning and demonstrate empirically that they primarily assess the strength of feature learning, rather than the quality of the learned features themselves. Consequently, current theories of feature learning do not provide a sufficient foundation for developing theories of neural network generalization.

[573] Salsa as a Nonverbal Embodied Language – The CoMPAS3D Dataset and Benchmarks

Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, Angelica Lim

Main category: cs.LG

TL;DR: CoMPAS3D is a large motion capture dataset of salsa dancing for testing interactive humanoid AI, with annotations and benchmarks for dance generation tasks.

Details

Motivation: Human communication includes embodied movement, which current AI systems lack. Partner dance is a challenging testbed for interactive, expressive AI.

Method: Created CoMPAS3D, a dataset of 3 hours of salsa dances with expert annotations, and developed a multitask SalsaAgent model for benchmark tasks.

Result: The dataset includes 2,800 move segments with detailed annotations. The SalsaAgent model performs well on leader/follower and duet generation tasks.

Conclusion: CoMPAS3D advances research in socially interactive AI and expressive humanoid motion, with released data and models to encourage further work.

Abstract: Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner’s proficiency, using haptic signaling as a primary form of communication. While today’s AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.

[574] KD-GAT: Combining Knowledge Distillation and Graph Attention Transformer for a Controller Area Network Intrusion Detection System

Robert Frenken, Sidra Ghayour Bhatti, Hanqin Zhang, Qadeer Ahmed

Main category: cs.LG

TL;DR: KD-GAT, a CAN intrusion detection framework using Graph Attention Networks and knowledge distillation, improves accuracy and reduces complexity.

Details

Motivation: The CAN protocol lacks security, making it vulnerable to attacks. KD-GAT aims to enhance detection accuracy efficiently.

Method: CAN traffic is modeled as graphs with sliding windows. A teacher GAT model trains a compact student GAT via supervised pretraining and distillation.

Result: High accuracy (99.97% and 99.31%) on two datasets, but performance drops on imbalanced data.

Conclusion: KD-GAT is effective but needs improvement for imbalanced datasets.

Abstract: The Controller Area Network (CAN) protocol is widely adopted for in-vehicle communication but lacks inherent security mechanisms, making it vulnerable to cyberattacks. This paper introduces KD-GAT, an intrusion detection framework that combines Graph Attention Networks (GATs) with knowledge distillation (KD) to enhance detection accuracy while reducing computational complexity. In our approach, CAN traffic is represented as graphs using a sliding window to capture temporal and relational patterns. A multi-layer GAT with jumping knowledge aggregation acting as the teacher model, while a compact student GAT–only 6.32% the size of the teacher–is trained via a two-phase process involving supervised pretraining and knowledge distillation with both soft and hard label supervision. Experiments on three benchmark datasets–Car-Hacking, Car-Survival, and can-train-and-test demonstrate that both teacher and student models achieve strong results, with the student model attaining 99.97% and 99.31% accuracy on Car-Hacking and Car-Survival, respectively. However, significant class imbalance in can-train-and-test has led to reduced performance for both models on this dataset. Addressing this imbalance remains an important direction for future work.

Yazeed Alrubyli, Omar Alomeir, Abrar Wafa, Diána Hidvégi, Hend Alrasheed, Mohsen Bahrami

Main category: cs.LG

TL;DR: The paper introduces NAICS-aware GraphSAGE, a graph neural network that integrates business taxonomy knowledge to predict co-visitation patterns, outperforming traditional spatial models.

Details

Motivation: Understanding co-visitation patterns is vital for urban planning and retail analytics, but existing methods fail due to data sparsity and the complexity of business relationships.

Method: The proposed NAICS-aware GraphSAGE uses learnable embeddings of business taxonomy (NAICS codes) alongside spatial, temporal, and socioeconomic features to predict co-visitation patterns at scale.

Result: The method achieves a 157% improvement in R-squared (0.243 to 0.625) and a 32% boost in NDCG@10, tested on 94.9 million co-visitation records.

Conclusion: Incorporating business semantics through NAICS codes significantly enhances co-visitation prediction, offering scalable and accurate insights for urban and retail applications.

Abstract: Understanding where people go after visiting one business is crucial for urban planning, retail analytics, and location-based services. However, predicting these co-visitation patterns across millions of venues remains challenging due to extreme data sparsity and the complex interplay between spatial proximity and business relationships. Traditional approaches using only geographic distance fail to capture why coffee shops attract different customer flows than fine dining restaurants, even when co-located. We introduce NAICS-aware GraphSAGE, a novel graph neural network that integrates business taxonomy knowledge through learnable embeddings to predict population-scale co-visitation patterns. Our key insight is that business semantics, captured through detailed industry codes, provide crucial signals that pure spatial models cannot explain. The approach scales to massive datasets (4.2 billion potential venue pairs) through efficient state-wise decomposition while combining spatial, temporal, and socioeconomic features in an end-to-end framework. Evaluated on our POI-Graph dataset comprising 94.9 million co-visitation records across 92,486 brands and 48 US states, our method achieves significant improvements over state-of-the-art baselines: the R-squared value increases from 0.243 to 0.625 (a 157 percent improvement), with strong gains in ranking quality (32 percent improvement in NDCG at 10).

[576] Disjoint Generative Models

Anton Danholt Lautrup, Muhammad Rajabinasab, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp

Main category: cs.LG

TL;DR: A framework for generating synthetic datasets using disjoint generative models, enhancing privacy with minimal utility loss.

Details

Motivation: To improve privacy in synthetic data generation by partitioning datasets and using separate generative models.

Method: Partition datasets into disjoint subsets, apply separate generative models, and combine results post hoc without common identifiers.

Result: Demonstrated success in case studies, showing low utility cost and increased privacy. Also found effectiveness for certain models and mixed-model synthesis.

Conclusion: Disjoint generative models offer a viable solution for privacy-preserving synthetic data generation with minimal trade-offs.

Abstract: We propose a new framework for generating cross-sectional synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that helps illuminate some of the design choices that one may make. The principal benefit of disjoint generative models is significantly increased privacy at only a low utility cost. Additional findings include increased effectiveness and feasibility for certain model types and the possibility for mixed-model synthesis.

[577] Beyond Nearest Neighbors: Semantic Compression and Graph-Augmented Retrieval for Enhanced Vector Search

Rahul Raja, Arpita Vats

Main category: cs.LG

TL;DR: The paper introduces semantic compression, a new retrieval paradigm for vector databases that prioritizes diversity and semantic coverage over traditional top-k nearest neighbor search, using submodular optimization and graph-augmented methods.

Details

Motivation: Traditional ANN search often produces redundant results lacking diversity, which is insufficient for applications like RAG, multi-hop QA, and memory-augmented agents.

Method: Proposes semantic compression via submodular optimization and graph-augmented vector retrieval, overlaying semantic graphs on vector spaces for context-aware search.

Result: Demonstrates improved semantic coverage and diversity, generalizing top-k retrieval by leveraging graph structures and information geometry.

Conclusion: The work lays a foundation for meaning-centric vector search systems, advocating hybrid indexing and diversity-aware querying, with open implementation for further research.

Abstract: Vector databases typically rely on approximate nearest neighbor (ANN) search to retrieve the top-k closest vectors to a query in embedding space. While effective, this approach often yields semantically redundant results, missing the diversity and contextual richness required by applications such as retrieval-augmented generation (RAG), multi-hop QA, and memory-augmented agents. We introduce a new retrieval paradigm: semantic compression, which aims to select a compact, representative set of vectors that captures the broader semantic structure around a query. We formalize this objective using principles from submodular optimization and information geometry, and show that it generalizes traditional top-k retrieval by prioritizing coverage and diversity. To operationalize this idea, we propose graph-augmented vector retrieval, which overlays semantic graphs (e.g., kNN or knowledge-based links) atop vector spaces to enable multi-hop, context-aware search. We theoretically analyze the limitations of proximity-based retrieval under high-dimensional concentration and highlight how graph structures can improve semantic coverage. Our work outlines a foundation for meaning-centric vector search systems, emphasizing hybrid indexing, diversity-aware querying, and structured semantic retrieval. We make our implementation publicly available to foster future research in this area.

[578] Predicting Human Mobility in Disasters via LLM-Enhanced Cross-City Learning

Yinzhou Tang, Huandong Wang, Xiaochen Fan, Yong Li

Main category: cs.LG

TL;DR: DisasterMobLLM is a framework for predicting human mobility in disaster scenarios, improving accuracy by 32.8% (Acc@1) and 35.0% (F1-score) over baselines.

Details

Motivation: Urbanization and climate change increase cities' vulnerability to disasters, necessitating better mobility prediction for early warnings and resource allocation. Existing models fail in disaster scenarios due to shifted mobility patterns.

Method: DisasterMobLLM integrates LLMs to model mobility intentions and transfer disaster knowledge between cities. It uses a RAG-Enhanced Intention Predictor, LLM-based Intention Refiner, and Intention-Modulated Location Predictor.

Result: Achieves 32.8% improvement in Acc@1 and 35.0% in F1-score for immobility prediction compared to baselines.

Conclusion: DisasterMobLLM effectively addresses the gap in disaster scenario mobility prediction, offering significant performance improvements.

Abstract: The vulnerability of cities to natural disasters has increased with urbanization and climate change, making it more important to predict human mobility in the disaster scenarios for downstream tasks including location-based early disaster warning and pre-allocating rescue resources, etc. However, existing human mobility prediction models are mainly designed for normal scenarios, and fail to adapt to disaster scenarios due to the shift of human mobility patterns under disaster. To address this issue, we introduce \textbf{DisasterMobLLM}, a mobility prediction framework for disaster scenarios that can be integrated into existing deep mobility prediction methods by leveraging LLMs to model the mobility intention and transferring the common knowledge of how different disasters affect mobility intentions between cities. This framework utilizes a RAG-Enhanced Intention Predictor to forecast the next intention, refines it with an LLM-based Intention Refiner, and then maps the intention to an exact location using an Intention-Modulated Location Predictor. Extensive experiments illustrate that DisasterMobLLM can achieve a 32.8% improvement in terms of Acc@1 and a 35.0% improvement in terms of the F1-score of predicting immobility compared to the baselines. The code is available at https://github.com/tsinghua-fib-lab/DisasterMobLLM.

[579] Modeling enzyme temperature stability from sequence segment perspective

Ziqi Zhang, Shiheng Chen, Runze Yang, Zhisheng Wei, Wei Zhang, Lei Wang, Zhanzhi Liu, Fengshan Zhang, Jing Wu, Xiaoyong Pan, Hongbin Shen, Longbing Cao, Zhaohong Deng

Main category: cs.LG

TL;DR: A novel deep learning framework, Segment Transformer, is introduced for predicting enzyme temperature stability, achieving state-of-the-art performance and experimentally validated improvements in enzyme thermal behavior.

Details

Motivation: Experimental determination of enzyme thermal stability is costly and time-consuming, while existing computational methods face data limitations. This work addresses these challenges with a curated dataset and a new model.

Method: The Segment Transformer leverages a curated dataset for enzyme thermal modeling, using segment-level representations to account for unequal contributions of protein regions to thermal behavior.

Result: The model achieves RMSE of 24.03, MAE of 18.09, and Pearson/Spearman correlations of 0.33. Experimental validation showed a 1.64-fold improvement in enzyme activity post-heat treatment with 17 mutations.

Conclusion: The Segment Transformer effectively predicts enzyme thermal stability and guides enzyme engineering, demonstrating practical utility in industrial and research applications.

Abstract: Developing enzymes with desired thermal properties is crucial for a wide range of industrial and research applications, and determining temperature stability is an essential step in this process. Experimental determination of thermal parameters is labor-intensive, time-consuming, and costly. Moreover, existing computational approaches are often hindered by limited data availability and imbalanced distributions. To address these challenges, we introduce a curated temperature stability dataset designed for model development and benchmarking in enzyme thermal modeling. Leveraging this dataset, we present the \textit{Segment Transformer}, a novel deep learning framework that enables efficient and accurate prediction of enzyme temperature stability. The model achieves state-of-the-art performance with an RMSE of 24.03, MAE of 18.09, and Pearson and Spearman correlations of 0.33, respectively. These results highlight the effectiveness of incorporating segment-level representations, grounded in the biological observation that different regions of a protein sequence contribute unequally to thermal behavior. As a proof of concept, we applied the Segment Transformer to guide the engineering of a cutinase enzyme. Experimental validation demonstrated a 1.64-fold improvement in relative activity following heat treatment, achieved through only 17 mutations and without compromising catalytic function.

[580] Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation

Xin Zhang, Lissette Iturburu, Juan Nicolas Villamizar, Xiaoyu Liu, Manuel Salmeron, Shirley J. Dyke, Julio Ramirez

Main category: cs.LG

TL;DR: A generative AI method using LLM and RAG automates structural drawing creation in AutoCAD from natural language descriptions, reducing manual effort.

Details

Motivation: Structural drawings are labor-intensive to create manually despite software advancements, necessitating an automated solution.

Method: Uses a large language model (LLM) with retrieval-augmented generation (RAG) to process natural language and generate AutoCAD drawings.

Result: The method efficiently converts natural language descriptions into structural drawings, reducing workload and simplifying design iteration.

Conclusion: The AI-based approach streamlines structural drawing production, enhancing efficiency and reducing manual effort.

Abstract: Structural drawings are widely used in many fields, e.g., mechanical engineering, civil engineering, etc. In civil engineering, structural drawings serve as the main communication tool between architects, engineers, and builders to avoid conflicts, act as legal documentation, and provide a reference for future maintenance or evaluation needs. They are often organized using key elements such as title/subtitle blocks, scales, plan views, elevation view, sections, and detailed sections, which are annotated with standardized symbols and line types for interpretation by engineers and contractors. Despite advances in software capabilities, the task of generating a structural drawing remains labor-intensive and time-consuming for structural engineers. Here we introduce a novel generative AI-based method for generating structural drawings employing a large language model (LLM) agent. The method incorporates a retrieval-augmented generation (RAG) technique using externally-sourced facts to enhance the accuracy and reliability of the language model. This method is capable of understanding varied natural language descriptions, processing these to extract necessary information, and generating code to produce the desired structural drawing in AutoCAD. The approach developed, demonstrated and evaluated herein enables the efficient and direct conversion of a structural drawing’s natural language description into an AutoCAD drawing, significantly reducing the workload compared to current working process associated with manual drawing production, facilitating the typical iterative process of engineers for expressing design ideas in a simplified way.

[581] AI-Based Clinical Rule Discovery for NMIBC Recurrence through Tsetlin Machines

Saram Abbas, Naeem Soomro, Rishad Shafik, Rakesh Heer, Kabita Adhikari

Main category: cs.LG

TL;DR: An interpretable AI model using the Tsetlin Machine (TM) outperforms traditional methods in predicting bladder cancer recurrence, offering transparency and accuracy.

Details

Motivation: Current clinical tools for bladder cancer recurrence prediction are outdated and unreliable, especially for intermediate-risk cases.

Method: The Tsetlin Machine (TM), a symbolic learner, was tested on the PHOTO trial dataset (n=330) and compared with XGBoost, Logistic Regression, and EORTC risk tables.

Result: TM achieved an F1-score of 0.80, outperforming XGBoost (0.78), Logistic Regression (0.60), and EORTC (0.42), while providing transparent, human-readable logic.

Conclusion: TM is a trustworthy decision-support tool for bladder cancer recurrence prediction, ready for real-world adoption.

Abstract: Bladder cancer claims one life every 3 minutes worldwide. Most patients are diagnosed with non-muscle-invasive bladder cancer (NMIBC), yet up to 70% recur after treatment, triggering a relentless cycle of surgeries, monitoring, and risk of progression. Clinical tools like the EORTC risk tables are outdated and unreliable - especially for intermediate-risk cases. We propose an interpretable AI model using the Tsetlin Machine (TM), a symbolic learner that outputs transparent, human-readable logic. Tested on the PHOTO trial dataset (n=330), TM achieved an F1-score of 0.80, outperforming XGBoost (0.78), Logistic Regression (0.60), and EORTC (0.42). TM reveals the exact clauses behind each prediction, grounded in clinical features like tumour count, surgeon experience, and hospital stay - offering accuracy and full transparency. This makes TM a powerful, trustworthy decision-support tool ready for real-world adoption.

Tiantian Peng, Yuyang Liu, Shuo Yang, Qiuhe Hong, YongHong Tian

Main category: cs.LG

TL;DR: GNSP is a continual learning method for CLIP that prevents forgetting by projecting task-specific gradients onto the null space of prior knowledge, preserving zero-shot capabilities.

Details

Motivation: CLIP's zero-shot performance degrades during fine-tuning due to catastrophic forgetting and embedding misalignment.

Method: GNSP projects gradients onto the null space of prior tasks. Knowledge distillation and modality alignment preservation loss are used to maintain CLIP’s generalization.

Result: Achieves SOTA on MTIL benchmark (11 tasks) and preserves CLIP’s modality gap and cross-modal retrieval performance.

Conclusion: GNSP effectively maintains CLIP’s robustness and generalization in continual learning.

Abstract: Contrastive Language-Image Pretraining has demonstrated remarkable zero-shot generalization by aligning visual and textual modalities in a shared embedding space. However, when continuously fine-tuned on diverse tasks, CLIP suffers from catastrophic forgetting and degradation of its embedding alignment, undermining its zero-shot capabilities. In this work, we propose Gradient Null Space Projection (GNSP), an efficient continual learning method that projects task-specific gradients onto the null space of previously learned knowledge. This orthogonal projection mathematically prevents interference with previous tasks without relying on rehearsal or architectural modification. Furthermore, to preserve the inherent generalization property of CLIP, we introduce knowledge distillation and combine it with a modality alignment preservation loss inspired by CLIP pre-training to stabilize the structure of the multimodal embedding space during fine-tuning. On the MTIL benchmark consisting of 11 tasks, our method achieved SOTA performance on both the Average and Last key metrics. More importantly, experiments show that our method successfully maintains the original modality gap and cross-modal retrieval performance of CLIP, confirming its effectiveness in maintaining a robust visual-language space throughout the continual learning process.

[583] A Scalable and High Availability Solution for Recommending Resolutions to Problem Tickets

Harish S, Chetana K Nayak, Joy Bose

Main category: cs.LG

TL;DR: The paper proposes an ML-driven solution using clustering, supervised learning, and NLP to resolve telecom problem tickets, addressing challenges like data drift and missing data.

Details

Motivation: To improve incident resolution in telecom billing systems by leveraging historical data patterns despite challenges like data drift and free-text issues.

Method: Uses clustering, supervised learning (LDA, Siamese networks, One-shot learning), and NLP techniques like index embedding. Includes a real-time dashboard and Kubernetes deployment.

Result: High prediction accuracy demonstrated on both open-source (Bitext) and proprietary telecom datasets.

Conclusion: The proposed solution effectively tackles ticket resolution challenges, offering scalable and accurate results.

Abstract: Resolution of incidents or problem tickets is a common theme in service industries in any sector, including billing and charging systems in telecom domain. Machine learning can help to identify patterns and suggest resolutions for the problem tickets, based on patterns in the historical data of the tickets. However, this process may be complicated due to a variety of phenomena such as data drift and issues such as missing data, lack of data pertaining to resolutions of past incidents, too many similar sounding resolutions due to free text and similar sounding text. This paper proposes a robust ML-driven solution employing clustering, supervised learning, and advanced NLP models to tackle these challenges effectively. Building on previous work, we demonstrate clustering-based resolution identification, supervised classification with LDA, Siamese networks, and One-shot learning, Index embedding. Additionally, we present a real-time dashboard and a highly available Kubernetes-based production deployment. Our experiments with both the open-source Bitext customer-support dataset and proprietary telecom datasets demonstrate high prediction accuracy.

[584] Agentic Reinforced Policy Optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou

Main category: cs.LG

TL;DR: ARPO is a novel RL algorithm for multi-turn LLM-based agents, balancing reasoning and tool interactions with entropy-based adaptive rollouts, outperforming existing methods with fewer tool uses.

Details

Motivation: Current RL algorithms fail to balance LLMs' reasoning and multi-turn tool interactions, limiting their effectiveness in realistic scenarios.

Method: ARPO uses an entropy-based adaptive rollout mechanism and advantage attribution to dynamically balance exploration and stepwise tool interactions.

Result: ARPO outperforms trajectory-level RL algorithms across 13 benchmarks, achieving better performance with half the tool-use budget.

Conclusion: ARPO offers a scalable solution for aligning LLM-based agents with dynamic environments, enhancing multi-turn reasoning and tool interactions.

Abstract: Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models’ intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO’s superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO

[585] Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning

Aditya Sharma, Linh Nguyen, Ananya Gupta, Chengyu Wang, Chiamaka Adebayo, Jakub Kowalski

Main category: cs.LG

TL;DR: CWMI embeds causal physics understanding in LLMs via a Causal Physics Module and Causal Intervention Loss, improving zero-shot physical reasoning.

Details

Motivation: LLMs lack intuitive understanding of physical dynamics, limiting real-world causal reasoning.

Method: Introduces CWMI with a Causal Physics Module and Causal Intervention Loss to learn cause-and-effect from multimodal data.

Result: Outperforms state-of-the-art LLMs on zero-shot physical reasoning tasks like PIQA and PhysiCa-Bench.

Conclusion: Inducing a causal world model enhances reliability and generalizability of AI systems.

Abstract: Large Language Models (LLMs), despite their advanced linguistic capabilities, fundamentally lack an intuitive understanding of physical dynamics, which limits their effectiveness in real-world scenarios that require causal reasoning. In this paper, we introduce Causal World Model Induction (CWMI), a novel framework designed to embed an explicit model of causal physics within an LLM. Our approach incorporates a dedicated Causal Physics Module (CPM) and a new training objective called Causal Intervention Loss, encouraging the model to learn cause-and-effect relationships from multimodal data. By training the model to predict the outcomes of hypothetical interventions instead of merely capturing statistical correlations, CWMI develops a robust internal representation of physical laws. Experimental results show that CWMI significantly outperforms state-of-the-art LLMs on zero-shot physical reasoning tasks, including the PIQA benchmark and our newly proposed PhysiCa-Bench dataset. These findings demonstrate that inducing a causal world model is a critical step toward more reliable and generalizable AI systems.

[586] RestoreAI – Pattern-based Risk Estimation Of Remaining Explosives

Björn Kischelewski, Benjamin Guedj, David Wahl

Main category: cs.LG

TL;DR: RestoreAI uses AI to predict landmine risk from spatial patterns, improving clearance efficiency with linear, curved, and Bayesian methods.

Details

Motivation: Existing AI methods for landmine detection focus on object recognition, neglecting spatial pattern-based risk prediction, which could enhance clearance efficiency.

Method: RestoreAI implements three deminers: linear (PCA-based), curved (principal curves), and Bayesian (incorporating expert knowledge) for risk prediction.

Result: RestoreAI improved clearance efficiency by 14.37% in cleared landmines per timestep and reduced time by 24.45% compared to baselines. Linear and curved methods performed similarly.

Conclusion: RestoreAI demonstrates the viability of pattern-based risk prediction for landmine clearance, with linear patterns being a practical option.

Abstract: Landmine removal is a slow, resource-intensive process affecting over 60 countries. While AI has been proposed to enhance explosive ordnance (EO) detection, existing methods primarily focus on object recognition, with limited attention to prediction of landmine risk based on spatial pattern information. This work aims to answer the following research question: How can AI be used to predict landmine risk from landmine patterns to improve clearance time efficiency? To that effect, we introduce RestoreAI, an AI system for pattern-based risk estimation of remaining explosives. RestoreAI is the first AI system that leverages landmine patterns for risk prediction, improving the accuracy of estimating the residual risk of missing EO prior to land release. We particularly focus on the implementation of three instances of RestoreAI, respectively, linear, curved and Bayesian pattern deminers. First, the linear pattern deminer uses linear landmine patterns from a principal component analysis (PCA) for the landmine risk prediction. Second, the curved pattern deminer uses curved landmine patterns from principal curves. Finally, the Bayesian pattern deminer incorporates prior expert knowledge by using a Bayesian pattern risk prediction. Evaluated on real-world landmine data, RestoreAI significantly boosts clearance efficiency. The top-performing pattern-based deminers achieved a 14.37 percentage point increase in the average share of cleared landmines per timestep and required 24.45% less time than the best baseline deminer to locate all landmines. Interestingly, linear and curved pattern deminers showed no significant performance difference, suggesting that more efficient linear patterns are a viable option for risk prediction.

[587] CLoRA: Parameter-Efficient Continual Learning with Low-Rank Adaptation

Shishir Muralidhara, Didier Stricker, René Schuster

Main category: cs.LG

TL;DR: CLoRA applies Low-Rank Adaptation (LoRA) to continual learning, reducing computational demands while maintaining performance.

Details

Motivation: Address the computational inefficiency of retraining entire models in continual learning for resource-constrained environments.

Method: Uses LoRA for parameter-efficient fine-tuning in class-incremental semantic segmentation, leveraging a small set of shared parameters.

Result: CLoRA matches or exceeds baseline performance while significantly reducing hardware requirements.

Conclusion: CLoRA is a resource-efficient solution for continual learning in constrained environments, validated by NetScore.

Abstract: In the past, continual learning (CL) was mostly concerned with the problem of catastrophic forgetting in neural networks, that arises when incrementally learning a sequence of tasks. Current CL methods function within the confines of limited data access, without any restrictions imposed on computational resources. However, in real-world scenarios, the latter takes precedence as deployed systems are often computationally constrained. A major drawback of most CL methods is the need to retrain the entire model for each new task. The computational demands of retraining large models can be prohibitive, limiting the applicability of CL in environments with limited resources. Through CLoRA, we explore the applicability of Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method for class-incremental semantic segmentation. CLoRA leverages a small set of parameters of the model and uses the same set for learning across all tasks. Results demonstrate the efficacy of CLoRA, achieving performance on par with and exceeding the baseline methods. We further evaluate CLoRA using NetScore, underscoring the need to factor in resource efficiency and evaluate CL methods beyond task performance. CLoRA significantly reduces the hardware requirements for training, making it well-suited for CL in resource-constrained environments after deployment.

[588] A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction

Xiaohua Feng, Jiaming Zhang, Fengyuan Yu, Chengye Wang, Li Zhang, Kaixiang Li, Yuyuan Li, Chaochao Chen, Jianwei Yin

Main category: cs.LG

TL;DR: The paper reviews Generative Model Unlearning (GenMU), proposes a unified framework, and explores its connections with related techniques, highlighting practical applications and future challenges.

Details

Motivation: Address the lack of a unified framework for organizing and comparing existing work in Generative Model Unlearning (GenMU) due to diverse objectives and evaluation protocols.

Method: Comprehensive review of GenMU research, proposing a unified framework for categorizing objectives, strategies, and metrics, and exploring connections with related techniques.

Result: A unified analytical framework for GenMU, insights into its practical applications, and identification of key challenges and future directions.

Conclusion: The paper lays a foundation for advancements in GenMU by providing a systematic review, unified framework, and highlighting future research opportunities.

Abstract: With the rapid advancement of generative models, associated privacy concerns have attracted growing attention. To address this, researchers have begun adapting machine unlearning techniques from traditional classification models to generative settings. Although notable progress has been made in this area, a unified framework for systematically organizing and integrating existing work is still lacking. The substantial differences among current studies in terms of unlearning objectives and evaluation protocols hinder the objective and fair comparison of various approaches. While some studies focus on specific types of generative models, they often overlook the commonalities and systematic characteristics inherent in Generative Model Unlearning (GenMU). To bridge this gap, we provide a comprehensive review of current research on GenMU and propose a unified analytical framework for categorizing unlearning objectives, methodological strategies, and evaluation metrics. In addition, we explore the connections between GenMU and related techniques, including model editing, reinforcement learning from human feedback, and controllable generation. We further highlight the potential practical value of unlearning techniques in real-world applications. Finally, we identify key challenges and outline future research directions aimed at laying a solid foundation for further advancements in this field. We consistently maintain the related open-source materials at https://github.com/caxLee/Generative-model-unlearning-survey.

[589] Who Owns This Sample: Cross-Client Membership Inference Attack in Federated Graph Neural Networks

Kunhao Li, Di Wu, Jun Bai, Jing Xu, Lei Yang, Ziyi Zhang, Yiliao Song, Wencheng Yang, Taotao Cai, Yan Li

Main category: cs.LG

TL;DR: The paper studies cross-client membership inference attacks (CC-MIA) in federated GNNs, revealing privacy risks in node classification tasks.

Details

Motivation: To address privacy threats in federated GNNs, focusing on sample-to-client attribution, a unique risk in decentralized settings.

Method: A general attack framework exploiting aggregation behaviors, gradient updates, and embedding proximity to link samples to clients.

Result: High performance in membership inference and ownership identification across datasets.

Conclusion: Highlights client identity leakage risks, urging the need for robust GNN designs in federated learning.

Abstract: Graph-structured data is prevalent in many real-world applications, including social networks, financial systems, and molecular biology. Graph Neural Networks (GNNs) have become the de facto standard for learning from such data due to their strong representation capabilities. As GNNs are increasingly deployed in federated learning (FL) settings to preserve data locality and privacy, new privacy threats arise from the interaction between graph structures and decentralized training. In this paper, we present the first systematic study of cross-client membership inference attacks (CC-MIA) against node classification tasks of federated GNNs (FedGNNs), where a malicious client aims to infer which client owns the given data. Unlike prior centralized-focused work that focuses on whether a sample was included in training, our attack targets sample-to-client attribution, a finer-grained privacy risk unique to federated settings. We design a general attack framework that exploits FedGNNs’ aggregation behaviors, gradient updates, and embedding proximity to link samples to their source clients across training rounds. We evaluate our attack across multiple graph datasets under realistic FL setups. Results show that our method achieves high performance on both membership inference and ownership identification. Our findings highlight a new privacy threat in federated graph learning-client identity leakage through structural and model-level cues, motivating the need for attribution-robust GNN design.

[590] Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training

Yue Hu, Zanxia Cao, Yingchao Liu

Main category: cs.LG

TL;DR: DEO, a first-order method inspired by the Dimer technique, enhances neural network training by estimating curvature to escape saddle points and flat regions without full Hessian computation.

Details

Motivation: First-order methods like SGD and Adam struggle with complex loss landscapes (flat regions, plateaus, saddle points), while second-order methods are computationally expensive.

Method: DEO adapts the Dimer method to estimate curvature using gradient information, projecting gradients orthogonally to escape saddle points.

Result: Preliminary experiments on a Transformer model show DEO achieves competitive performance in navigating complex loss landscapes.

Conclusion: DEO effectively repurposes physics-inspired curvature estimation to improve neural network training efficiency.

Abstract: First-order optimization methods, such as SGD and Adam, are widely used for training large-scale deep neural networks due to their computational efficiency and robust performance. However, relying solely on gradient information, these methods often struggle to navigate complex loss landscapes with flat regions, plateaus, and saddle points. Second-order methods, which use curvature information from the Hessian matrix, can address these challenges but are computationally infeasible for large models. The Dimer method, a first-order technique that constructs two closely spaced points to probe the local geometry of a potential energy surface, efficiently estimates curvature using only gradient information. Inspired by its use in molecular dynamics simulations for locating saddle points, we propose Dimer-Enhanced Optimization (DEO), a novel framework to escape saddle points in neural network training. DEO adapts the Dimer method to explore a broader region of the loss landscape, approximating the Hessian’s smallest eigenvector without computing the full matrix. By periodically projecting the gradient onto the subspace orthogonal to the minimum curvature direction, DEO guides the optimizer away from saddle points and flat regions, enhancing training efficiency with non-stepwise updates. Preliminary experiments on a Transformer toy model show DEO achieves competitive performance compared to standard first-order methods, improving navigation of complex loss landscapes. Our work repurposes physics-inspired, first-order curvature estimation to enhance neural network training in high-dimensional spaces.

[591] Robust Taxi Fare Prediction Under Noisy Conditions: A Comparative Study of GAT, TimesNet, and XGBoost

Padmavathi Moorthy

Main category: cs.LG

TL;DR: The study evaluates GAT, XGBoost, and TimesNet for taxi fare prediction, analyzing data quality impact and model robustness.

Details

Motivation: Precise fare prediction is vital for ride-hailing platforms and urban mobility systems.

Method: Three models (GAT, XGBoost, TimesNet) are tested on a 55M-record dataset, with raw and denoised data. Pre-processing includes KNN imputation, noise injection, and autoencoder denoising.

Result: Key differences between classical and deep learning models are revealed, with insights into predictive accuracy, calibration, and robustness.

Conclusion: Practical guidelines for robust and scalable fare prediction models in urban systems are provided.

Abstract: Precise fare prediction is crucial in ride-hailing platforms and urban mobility systems. This study examines three machine learning models-Graph Attention Networks (GAT), XGBoost, and TimesNet to evaluate their predictive capabilities for taxi fares using a real-world dataset comprising over 55 million records. Both raw (noisy) and denoised versions of the dataset are analyzed to assess the impact of data quality on model performance. The study evaluated the models along multiple axes, including predictive accuracy, calibration, uncertainty estimation, out-of-distribution (OOD) robustness, and feature sensitivity. We also explore pre-processing strategies, including KNN imputation, Gaussian noise injection, and autoencoder-based denoising. The study reveals critical differences between classical and deep learning models under realistic conditions, offering practical guidelines for building robust and scalable models in urban fare prediction systems.

[592] FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging

Liu junkang, Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, Wei Feng

Main category: cs.LG

TL;DR: The paper proposes FedSWA and FedMoSWA, two federated learning algorithms, to improve generalization in highly heterogeneous data settings, outperforming FedSAM and FedAvg.

Details

Motivation: To address the poor generalization of FedSAM in highly heterogeneous data and improve FL performance.

Method: Introduces FedSWA for flatter minima and FedMoSWA for better local-global model alignment, with theoretical convergence and generalization analysis.

Result: Theoretical and empirical results show FedMoSWA has smaller errors and outperforms FedSAM and variants on CIFAR10/100 and Tiny ImageNet.

Conclusion: FedSWA and FedMoSWA are effective for FL in heterogeneous data, with FedMoSWA showing superior performance.

Abstract: For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for real-word applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \texttt{FedSWA}), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\texttt{FedMoSWA}), which is designed to better align local and global models. Theoretically, we provide both convergence analysis and generalization bounds for \texttt{FedSWA} and \texttt{FedMoSWA}. We also prove that the optimization and generalization errors of \texttt{FedMoSWA} are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts. Open source code at: https://github.com/junkangLiu0/FedSWA.

[593] Irredundant $k$-Fold Cross-Validation

Jesus S. Aguilar-Ruiz

Main category: cs.LG

TL;DR: Irredundant k-fold cross-validation ensures each instance is used once for training and once for testing, reducing redundancy and computational cost while maintaining performance.

Details

Motivation: Traditional k-fold cross-validation has redundancy, allowing instances to disproportionately influence learning. This method aims to balance dataset utilization and mitigate overfitting.

Method: The method guarantees non-overlapping training partitions, ensuring each instance is used exactly once for training and once for testing. It preserves stratification and is model-agnostic.

Result: Experimental results show consistent performance comparable to k-fold cross-validation, with less optimistic variance estimates and reduced computational cost.

Conclusion: Irredundant k-fold cross-validation offers a balanced, efficient alternative to traditional methods, improving model analysis and reducing overfitting.

Abstract: In traditional k-fold cross-validation, each instance is used ($k-1$) times for training and once for testing, leading to redundancy that lets many instances disproportionately influence the learning phase. We introduce Irredundant $k$-fold cross-validation, a novel method that guarantees each instance is used exactly once for training and once for testing across the entire validation procedure. This approach ensures a more balanced utilization of the dataset, mitigates overfitting due to instance repetition, and enables sharper distinctions in comparative model analysis. The method preserves stratification and remains model-agnostic, i.e., compatible with any classifier. Experimental results demonstrate that it delivers consistent performance estimates across diverse datasets – comparable to $k$-fold cross-validation – while providing less optimistic variance estimates because training partitions are non-overlapping, and significantly reducing the overall computational cost.

[594] $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

Weicong Chen, Vikash Singh, Zahra Rahmani, Debargha Ganguly, Mohsen Hariri, Vipin Chaudhary

Main category: cs.LG

TL;DR: $K^4$ is an unsupervised, parser-independent LogAD framework using k-NN statistics for fast, accurate anomaly detection.

Details

Motivation: Existing LogAD methods are slow, parsing-dependent, and use unrealistic evaluation protocols.

Method: $K^4$ transforms log embeddings into 4D descriptors (Precision, Recall, Density, Coverage) via k-NN statistics for lightweight anomaly scoring.

Result: $K^4$ achieves AUROC 0.995-0.999, outperforming baselines in speed (4s training, 4μs inference) and accuracy.

Conclusion: $K^4$ sets a new state-of-the-art for online LogAD, being fast, accurate, and parser-independent.

Abstract: Existing Log Anomaly Detection (LogAD) methods are often slow, dependent on error-prone parsing, and use unrealistic evaluation protocols. We introduce $K^4$, an unsupervised and parser-independent framework for high-performance online detection. $K^4$ transforms arbitrary log embeddings into compact four-dimensional descriptors (Precision, Recall, Density, Coverage) using efficient k-nearest neighbor (k-NN) statistics. These descriptors enable lightweight detectors to accurately score anomalies without retraining. Using a more realistic online evaluation protocol, $K^4$ sets a new state-of-the-art (AUROC: 0.995-0.999), outperforming baselines by large margins while being orders of magnitude faster, with training under 4 seconds and inference as low as 4 $\mu$s.

[595] What Can Grokking Teach Us About Learning Under Nonstationarity?

Clare Lyle, Gharda Sokar, Razvan Pascanu, Andras Gyorgy

Main category: cs.LG

TL;DR: The paper explores how feature-learning dynamics in neural networks, known to drive grokking, can address primacy bias in continual learning by enabling the overwriting of learned features. It proposes increasing the effective learning rate to induce these dynamics, improving generalization across tasks.

Details

Motivation: Primacy bias in neural networks hinders adaptation to new tasks in continual learning. The paper aims to leverage feature-learning dynamics, observed in grokking, to mitigate this bias.

Method: The study proposes increasing the effective learning rate (ratio of parameter to update norms) to induce feature-learning dynamics, tested in grokking, warm-starting, and reinforcement learning.

Result: The method successfully facilitates feature-learning and enhances generalization in various settings, including grokking and reinforcement learning.

Conclusion: Feature-learning dynamics, accelerated by adjusting the effective learning rate, offer a promising solution to primacy bias in non-stationary learning problems.

Abstract: In continual learning problems, it is often necessary to overwrite components of a neural network’s learned representation in response to changes in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network’s ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.

[596] ModShift: Model Privacy via Designed Shifts

Nomaan A. Kherani, Urbashi Mitra

Main category: cs.LG

TL;DR: The paper introduces shifts to protect model privacy in federated learning by treating learning as a parameter estimation problem, using Fisher Information to obscure updates, and securely sharing shifts to maintain accuracy.

Details

Motivation: To preserve model privacy against eavesdroppers in federated learning while maintaining accuracy.

Method: Treats learning as parameter estimation, derives Fisher Information from shifted updates to obscure them, and uses secure shift sharing. Includes a convergence test for tampering detection.

Result: Achieves higher model shift than noise injection with less bandwidth for secret channels.

Conclusion: The scheme effectively protects privacy, maintains accuracy, and detects tampering, outperforming noise injection in efficiency.

Abstract: In this paper, shifts are introduced to preserve model privacy against an eavesdropper in federated learning. Model learning is treated as a parameter estimation problem. This perspective allows us to derive the Fisher Information matrix of the model updates from the shifted updates and drive them to singularity, thus posing a hard estimation problem for Eve. The shifts are securely shared with the central server to maintain model accuracy at the server and participating devices. A convergence test is proposed to detect if model updates have been tampered with and we show that our scheme passes this test. Numerical results show that our scheme achieves a higher model shift when compared to a noise injection scheme while requiring a lesser bandwidth secret channel.

[597] Strategic Filtering for Content Moderation: Free Speech or Free of Distortion?

Saba Ahmadi, Avrim Blum, Haifeng Xu, Fan Yao

Main category: cs.LG

TL;DR: The paper addresses the challenge of balancing free speech and reducing harmful content on social media using mechanism design, proposing practical methods to approximate optimal moderation despite NP-hard complexity.

Details

Motivation: User-generated content on social media is prone to manipulation, requiring effective moderation while balancing free speech and minimizing social distortion.

Method: The study uses mechanism design to optimize the trade-off between free speech and content manipulation, proposing approximation methods for the NP-hard problem and providing generalization guarantees.

Result: Practical methods are introduced to approximate the optimal moderation solution, with guarantees on the required offline data for effective approximation.

Conclusion: The paper offers a framework to balance free speech and content moderation, providing actionable insights for social media platforms.

Abstract: User-generated content (UGC) on social media platforms is vulnerable to incitements and manipulations, necessitating effective regulations. To address these challenges, those platforms often deploy automated content moderators tasked with evaluating the harmfulness of UGC and filtering out content that violates established guidelines. However, such moderation inevitably gives rise to strategic responses from users, who strive to express themselves within the confines of guidelines. Such phenomena call for a careful balance between: 1. ensuring freedom of speech – by minimizing the restriction of expression; and 2. reducing social distortion – measured by the total amount of content manipulation. We tackle the problem of optimizing this balance through the lens of mechanism design, aiming at optimizing the trade-off between minimizing social distortion and maximizing free speech. Although determining the optimal trade-off is NP-hard, we propose practical methods to approximate the optimal solution. Additionally, we provide generalization guarantees determining the amount of finite offline data required to approximate the optimal moderator effectively.

[598] Geometric Operator Learning with Optimal Transport

Xinyi Li, Zongyi Li, Nikola Kovachki, Anima Anandkumar

Main category: cs.LG

TL;DR: Integrating optimal transport (OT) into operator learning for PDEs on complex geometries, replacing traditional mesh-based methods with OT-based instance-dependent deformation for better flexibility and efficiency.

Details

Motivation: Classical geometric learning methods rely on meshes, graphs, or point clouds, which may lack flexibility. The goal is to generalize these to mesh density functions using OT for improved adaptability.

Method: Formulate geometry embedding as an OT problem, mapping mesh density functions to a uniform density in a reference space. For 3D surfaces, embed into a 2D latent space for computational efficiency.

Result: Achieves better accuracy and reduces computational expenses (time and memory) on RANS equations for ShapeNet-Car and DrivAerNet-Car datasets. Improved accuracy on FlowBench dataset.

Conclusion: OT-based instance-dependent deformation enhances flexibility and efficiency in PDE operator learning, outperforming traditional methods on variable-geometry datasets.

Abstract: We propose integrating optimal transport (OT) into operator learning for partial differential equations (PDEs) on complex geometries. Classical geometric learning methods typically represent domains as meshes, graphs, or point clouds. Our approach generalizes discretized meshes to mesh density functions, formulating geometry embedding as an OT problem that maps these functions to a uniform density in a reference space. Compared to previous methods relying on interpolation or shared deformation, our OT-based method employs instance-dependent deformation, offering enhanced flexibility and effectiveness. For 3D simulations focused on surfaces, our OT-based neural operator embeds the surface geometry into a 2D parameterized latent space. By performing computations directly on this 2D representation of the surface manifold, it achieves significant computational efficiency gains compared to volumetric simulation. Experiments with Reynolds-averaged Navier-Stokes equations (RANS) on the ShapeNet-Car and DrivAerNet-Car datasets show that our method achieves better accuracy and also reduces computational expenses in terms of both time and memory usage compared to existing machine learning models. Additionally, our model demonstrates significantly improved accuracy on the FlowBench dataset, underscoring the benefits of employing instance-dependent deformation for datasets with highly variable geometries.

[599] PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

Aishwarya Mandyam, Jason Meng, Ge Gao, Jiankai Sun, Mac Schwager, Barbara E. Engelhardt, Emma Brunskill

Main category: cs.LG

TL;DR: The paper proposes two methods for constructing valid confidence intervals in off-policy evaluation (OPE) when using augmented data, addressing bias and uncertainty quantification issues.

Details

Motivation: Existing OPE methods lack principled uncertainty quantification when using auxiliary datasets, which is critical for high-stakes applications like healthcare.

Method: 1) A conformal prediction method for high-dimensional state MDPs to estimate policy performance conditioned on an initial state. 2) Doubly robust estimation and prediction-powered inference for average policy performance over many initial states.

Result: The methods produce valid confidence intervals that cover ground truth values across simulators (robotics, healthcare, inventory management) and a real healthcare dataset (MIMIC-IV).

Conclusion: The proposed approaches reliably quantify uncertainty in OPE with augmented data, outperforming prior methods.

Abstract: Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^{\pi}(s_0)$– such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.

[600] Sparse Equation Matching: A Derivative-Free Learning for General-Order Dynamical Systems

Jiaqiang Li, Jianbin Tan, Xueqin Wang

Main category: cs.LG

TL;DR: SEM is a derivative-free framework for equation discovery in general-order dynamical systems, validated through simulations and EEG data analysis.

Details

Motivation: Existing methods rely on accurate derivative estimation and are limited to first-order systems, restricting real-world applicability.

Method: SEM uses integral-based sparse regression with Green’s functions for derivative-free estimation in general-order systems.

Result: SEM outperforms derivative-based methods in simulations and identifies task-specific brain connectivity patterns in EEG data.

Conclusion: SEM provides a versatile and effective approach for uncovering dynamics in complex systems, demonstrated in brain connectivity analysis.

Abstract: Equation discovery is a fundamental learning task for uncovering the underlying dynamics of complex systems, with wide-ranging applications in areas such as brain connectivity analysis, climate modeling, gene regulation, and physical system simulation. However, many existing approaches rely on accurate derivative estimation and are limited to first-order dynamical systems, restricting their applicability to real-world scenarios. In this work, we propose sparse equation matching (SEM), a unified framework that encompasses several existing equation discovery methods under a common formulation. SEM introduces an integral-based sparse regression method using Green’s functions, enabling derivative-free estimation of differential operators and their associated driving functions in general-order dynamical systems. The effectiveness of SEM is demonstrated through extensive simulations, benchmarking its performance against derivative-based approaches. We then apply SEM to electroencephalographic (EEG) data recorded during multiple oculomotor tasks, collected from 52 participants in a brain-computer interface experiment. Our method identifies active brain regions across participants and reveals task-specific connectivity patterns. These findings offer valuable insights into brain connectivity and the underlying neural mechanisms.

[601] Cluster Purge Loss: Structuring Transformer Embeddings for Equivalent Mutants Detection

Adelaide Danilov, Aria Nourbakhsh, Christoph Schommer

Main category: cs.LG

TL;DR: A novel framework combining cross-entropy loss with Cluster Purge Loss improves embedding space structure for equivalent code mutant detection, outperforming traditional methods.

Details

Motivation: Existing fine-tuning methods for transformer models often fail to structure embedding spaces to reflect nuanced intra-class semantic relationships, crucial for tasks like equivalent mutant detection.

Method: The framework integrates cross-entropy loss with Cluster Purge Loss, focusing on fine-grained intra-class differences and dynamically adjusting borders for better separation.

Result: The approach achieves state-of-the-art performance in equivalent mutant detection and produces a more interpretable embedding space.

Conclusion: The proposed method effectively structures the embedding space, enhancing performance and interpretability for code processing tasks.

Abstract: Recent pre-trained transformer models achieve superior performance in various code processing objectives. However, although effective at optimizing decision boundaries, common approaches for fine-tuning them for downstream classification tasks - distance-based methods or training an additional classification head - often fail to thoroughly structure the embedding space to reflect nuanced intra-class semantic relationships. Equivalent code mutant detection is one of these tasks, where the quality of the embedding space is crucial to the performance of the models. We introduce a novel framework that integrates cross-entropy loss with a deep metric learning objective, termed Cluster Purge Loss. This objective, unlike conventional approaches, concentrates on adjusting fine-grained differences within each class, encouraging the separation of instances based on semantical equivalency to the class center using dynamically adjusted borders. Employing UniXCoder as the base model, our approach demonstrates state-of-the-art performance in the domain of equivalent mutant detection and produces a more interpretable embedding space.

[602] Feed-anywhere ANN (I) Steady Discrete $\to$ Diffusing on Graph Hidden States

Dmitry Pasechnyuk-Vilensky, Daniil Doroshenko

Main category: cs.LG

TL;DR: A novel framework for learning hidden graph structures using geometric analysis and nonlinear dynamics, with theoretical guarantees and improved generalization bounds.

Details

Motivation: To develop a method for uncovering hidden graph structures from data, leveraging geometric and dynamic properties for better generalization and theoretical guarantees.

Method: 1. Defines discrete Sobolev spaces on graphs. 2. Introduces gauge-equivalent nonlinear dynamics with stable solutions. 3. Develops a stochastic gradient algorithm with sparsity regularization.

Result: The model achieves topological correctness, metric convergence, and efficient search space utilization, outperforming standard neural networks in generalization.

Conclusion: The proposed dynamics-based framework effectively learns hidden graph structures with strong theoretical guarantees and improved performance.

Abstract: We propose a novel framework for learning hidden graph structures from data using geometric analysis and nonlinear dynamics. Our approach: (1) Defines discrete Sobolev spaces on graphs for scalar/vector fields, establishing key functional properties; (2) Introduces gauge-equivalent nonlinear Schr"odinger and Landau–Lifshitz dynamics with provable stable stationary solutions smoothly dependent on input data and graph weights; (3) Develops a stochastic gradient algorithm over graph moduli spaces with sparsity regularization. Theoretically, we guarantee: topological correctness (homology recovery), metric convergence (Gromov–Hausdorff), and efficient search space utilization. Our dynamics-based model achieves stronger generalization bounds than standard neural networks, with complexity dependent on the data manifold’s topology.

[603] Meta Fusion: A Unified Framework For Multimodality Fusion with Mutual Learning

Ziyi Liang, Annie Qu, Babak Shahbaba

Main category: cs.LG

TL;DR: Meta Fusion is a flexible framework unifying traditional multimodal data fusion methods, leveraging mutual and ensemble learning to enhance predictive performance through soft information sharing.

Details

Motivation: The need for improved predictive power in machine learning across applications like autonomous driving and medical diagnosis drives the development of Meta Fusion.

Method: Meta Fusion constructs a cohort of models from latent representations of modalities, using soft information sharing to boost performance. It is model-agnostic in learning representations.

Result: Theoretically reduces generalization error; empirically outperforms traditional fusion in simulations and real-world applications like Alzheimer’s detection.

Conclusion: Meta Fusion offers a principled, adaptable solution for multimodal data fusion, demonstrating superior performance over conventional methods.

Abstract: Developing effective multimodal data fusion strategies has become increasingly essential for improving the predictive power of statistical machine learning methods across a wide range of applications, from autonomous driving to medical diagnosis. Traditional fusion methods, including early, intermediate, and late fusion, integrate data at different stages, each offering distinct advantages and limitations. In this paper, we introduce Meta Fusion, a flexible and principled framework that unifies these existing strategies as special cases. Motivated by deep mutual learning and ensemble learning, Meta Fusion constructs a cohort of models based on various combinations of latent representations across modalities, and further boosts predictive performance through soft information sharing within the cohort. Our approach is model-agnostic in learning the latent representations, allowing it to flexibly adapt to the unique characteristics of each modality. Theoretically, our soft information sharing mechanism reduces the generalization error. Empirically, Meta Fusion consistently outperforms conventional fusion strategies in extensive simulation studies. We further validate our approach on real-world applications, including Alzheimer’s disease detection and neural decoding.

[604] EcoTransformer: Attention without Multiplication

Xin Gao, Xingming Xu

Main category: cs.LG

TL;DR: EcoTransformer replaces dot-product attention with a Laplacian kernel convolution, reducing energy costs while maintaining performance.

Details

Motivation: The computational intensity and high energy costs of the Transformer's scaled dot-product attention mechanism.

Method: Constructs output context vectors using Laplacian kernel convolution with L1 metric distances between queries and keys, avoiding matrix multiplication.

Result: Performs comparably or better than dot-product attention in NLP, bioinformatics, and vision tasks, with lower energy consumption.

Conclusion: EcoTransformer offers an efficient alternative to traditional attention mechanisms, balancing performance and energy efficiency.

Abstract: The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer architecture EcoTransformer, in which the output context vector is constructed as the convolution of the values using a Laplacian kernel, where the distances are measured by the L1 metric between the queries and keys. Compared to dot-product based attention, the new attention score calculation is free of matrix multiplication. It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks, while consuming significantly less energy.

[605] Graded Transformers: A Symbolic-Geometric Approach to Structured Learning

Tony Shaska Sr

Main category: cs.LG

TL;DR: The Graded Transformer framework introduces algebraic inductive biases into sequence models via grading transformations, proposing LGT and EGT architectures for hierarchical structure and efficiency.

Details

Motivation: To enhance structured data processing by embedding algebraic and geometric principles into transformers, improving interpretability and efficiency.

Method: Proposes Linearly Graded Transformer (LGT) and Exponentially Graded Transformer (EGT) with parameterized scaling operators and graded loss functions.

Result: Theoretical guarantees include universal approximation, reduced sample complexity, and robustness. Applications span diverse fields like NLP, physics, and biology.

Conclusion: The framework advances structured deep learning, offering interpretable and efficient alternatives to data-driven models.

Abstract: We introduce the Graded Transformer framework, a novel class of sequence models that embeds algebraic inductive biases through grading transformations on vector spaces. Extending the theory of Graded Neural Networks (GNNs), we propose two architectures: the Linearly Graded Transformer (LGT) and the Exponentially Graded Transformer (EGT). These models apply parameterized scaling operators-governed by fixed or learnable grading tuples and, for EGT, exponential factors to infuse hierarchical structure into attention and representation layers, enhancing efficiency for structured data. We derive rigorous theoretical guarantees, including universal approximation theorems for continuous and Sobolev functions, reduced sample complexity via effective VC dimension bounds, Lipschitz continuity of graded operations, and robustness to adversarial perturbations. A graded loss function ensures gradient stability and alignment with domain priors during optimization. By treating grades as differentiable parameters, the framework enables adaptive feature prioritization, overcoming limitations of fixed grades in prior work. The Graded Transformer holds transformative potential for hierarchical learning and neurosymbolic reasoning, with applications spanning algebraic geometry (e.g., moduli spaces and zeta functions), physics (e.g., multiscale simulations), natural language processing (e.g., syntactic parsing), biological sequence analysis (e.g., variant prediction), and emerging areas like graph neural networks and financial modeling. This work advances structured deep learning by fusing geometric and algebraic principles with attention mechanisms, offering a mathematically grounded alternative to data-driven models and paving the way for interpretable, efficient systems in complex domains.

[606] Online Learning with Probing for Sequential User-Centric Selection

Tianyi Xu, Yiting Chen, Henger Li, Zheyong Bian, Emiliano Dall’Anese, Zizhan Zheng

Main category: cs.LG

TL;DR: The paper introduces the PUCS framework for sequential decision-making with costly probing, offering offline and online solutions with performance guarantees.

Details

Motivation: Addresses scenarios like ridesharing and content recommendation where resources and rewards are initially unknown, and probing is expensive.

Method: Offline: greedy probing algorithm with approximation guarantee. Online: OLPA algorithm with regret bounds.

Result: Offline achieves ζ = (e-1)/(2e-1) approximation. Online achieves O(√T + ln² T) regret, with a matching lower bound Ω(√T).

Conclusion: The proposed solutions are effective, as validated by real-world experiments.

Abstract: We formalize sequential decision-making with information acquisition as the probing-augmented user-centric selection (PUCS) framework, where a learner first probes a subset of arms to obtain side information on resources and rewards, and then assigns $K$ plays to $M$ arms. PUCS covers applications such as ridesharing, wireless scheduling, and content recommendation, in which both resources and payoffs are initially unknown and probing is costly. For the offline setting with known distributions, we present a greedy probing algorithm with a constant-factor approximation guarantee $\zeta = (e-1)/(2e-1)$. For the online setting with unknown distributions, we introduce OLPA, a stochastic combinatorial bandit algorithm that achieves a regret bound $\mathcal{O}(\sqrt{T} + \ln^{2} T)$. We also prove a lower bound $\Omega(\sqrt{T})$, showing that the upper bound is tight up to logarithmic factors. Experiments on real-world data demonstrate the effectiveness of our solutions.

[607] Wine Characterisation with Spectral Information and Predictive Artificial Intelligence

Jianping Yao, Son N. Tran, Hieu Nguyen, Samantha Sawyer, Rocco Longo

Main category: cs.LG

TL;DR: The paper uses UV-Vis spectroscopy and ML to predict grape juice attributes and classify wine origin, achieving over 91% accuracy with SVM.

Details

Motivation: To improve traditional wine analysis by integrating spectroscopy and ML for sensory data and origin classification.

Method: Combines UV-Vis spectrophotometry with ML techniques, focusing on spectral fingerprinting and SVM for prediction.

Result: SVM performed best, with accuracy and F1 scores over 91%. Key wavelengths were identified (250-420 nm).

Conclusion: The study offers a scalable, AI-driven solution for wine analysis, aiding ‘Smart Wineries’ and beverage industries.

Abstract: The purpose of this paper is to use absorbance data obtained by human tasting and an ultraviolet-visible (UV-Vis) scanning spectrophotometer to predict the attributes of grape juice (GJ) and to classify the wine’s origin, respectively. The approach combined machine learning (ML) techniques with spectroscopy to find a relatively simple way to apply them in two stages of winemaking and help improve the traditional wine analysis methods regarding sensory data and wine’s origins. This new technique has overcome the disadvantages of the complex sensors by taking advantage of spectral fingerprinting technology and forming a comprehensive study of the employment of AI in the wine analysis domain. In the results, Support Vector Machine (SVM) was the most efficient and robust in both attributes and origin prediction tasks. Both the accuracy and F1 score of the origin prediction exceed 91%. The feature ranking approach found that the more influential wavelengths usually appear at the lower end of the scan range, 250 nm (nanometers) to 420 nm, which is believed to be of great help for selecting appropriate validation methods and sensors to extract wine data in future research. The knowledge of this research provides new ideas and early solutions for the wine industry or other beverage industries to integrate big data and IoT in the future, which significantly promotes the development of ‘Smart Wineries’.

[608] Aggregation-aware MLP: An Unsupervised Approach for Graph Message-passing

Xuanting Xie, Bingheng Li, Erlin Pan, Zhao Kang, Wenyu Chen

Main category: cs.LG

TL;DR: Proposes AMLP, an unsupervised framework adapting MLP to graph aggregation, improving performance in heterophilic graphs without relying on labeled data.

Details

Motivation: GNNs use fixed aggregators, leading to poor performance in heterophilic graphs. Existing solutions require labeled data, which is scarce.

Method: AMLP uses graph reconstruction for high-order grouping and a single-layer network to encode heterophily, adapting MLP to aggregation.

Result: AMLP outperforms in node clustering and classification, showing adaptability to diverse graph scenarios.

Conclusion: AMLP offers a lightweight, unsupervised solution for graph learning, enhancing performance without labeled data.

Abstract: Graph Neural Networks (GNNs) have become a dominant approach to learning graph representations, primarily because of their message-passing mechanisms. However, GNNs typically adopt a fixed aggregator function such as Mean, Max, or Sum without principled reasoning behind the selection. This rigidity, especially in the presence of heterophily, often leads to poor, problem dependent performance. Although some attempts address this by designing more sophisticated aggregation functions, these methods tend to rely heavily on labeled data, which is often scarce in real-world tasks. In this work, we propose a novel unsupervised framework, “Aggregation-aware Multilayer Perceptron” (AMLP), which shifts the paradigm from directly crafting aggregation functions to making MLP adaptive to aggregation. Our lightweight approach consists of two key steps: First, we utilize a graph reconstruction method that facilitates high-order grouping effects, and second, we employ a single-layer network to encode varying degrees of heterophily, thereby improving the capacity and applicability of the model. Extensive experiments on node clustering and classification demonstrate the superior performance of AMLP, highlighting its potential for diverse graph learning scenarios.

[609] Generative molecule evolution using 3D pharmacophore for efficient Structure-Based Drug Design

Yi He, Ailun Wang, Zhi Wang, Yu Liu, Xingyuan Xu, Wen Yan

Main category: cs.LG

TL;DR: MEVO is an evolutionary framework for structure-based drug design (SBDD) that addresses data scarcity by combining a VQ-VAE, diffusion model, and evolutionary strategy to generate high-affinity binders.

Details

Motivation: The limited training data for generative models in SBDD tasks hinders their application, despite advances in generative models for other fields.

Method: MEVO integrates a VQ-VAE for molecule representation, a diffusion model for pharmacophore-guided generation, and an evolutionary strategy for optimization with physics-based scoring.

Result: MEVO successfully generates high-affinity binders for protein targets, including potent inhibitors for KRASG12D, validated by FEP calculations.

Conclusion: MEVO provides a versatile, data-efficient solution for SBDD tasks, overcoming data constraints and demonstrating strong performance in ligand design.

Abstract: Recent advances in generative models, particularly diffusion and auto-regressive models, have revolutionized fields like computer vision and natural language processing. However, their application to structure-based drug design (SBDD) remains limited due to critical data constraints. To address the limitation of training data for models targeting SBDD tasks, we propose an evolutionary framework named MEVO, which bridges the gap between billion-scale small molecule dataset and the scarce protein-ligand complex dataset, and effectively increase the abundance of training data for generative SBDD models. MEVO is composed of three key components: a high-fidelity VQ-VAE for molecule representation in latent space, a diffusion model for pharmacophore-guided molecule generation, and a pocket-aware evolutionary strategy for molecule optimization with physics-based scoring function. This framework efficiently generate high-affinity binders for various protein targets, validated with predicted binding affinities using free energy perturbation (FEP) methods. In addition, we showcase the capability of MEVO in designing potent inhibitors to KRAS$^{\textrm{G12D}}$, a challenging target in cancer therapeutics, with similar affinity to the known highly active inhibitor evaluated by FEP calculations. With high versatility and generalizability, MEVO offers an effective and data-efficient model for various tasks in structure-based ligand design.

[610] Awesome-OL: An Extensible Toolkit for Online Learning

Zeyi Liu, Songqiao Hu, Pengyu Han, Jiaming Liu, Xiao He

Main category: cs.LG

TL;DR: Awesome-OL is a Python toolkit for online learning research, offering state-of-the-art algorithms, benchmark datasets, and visualization tools.

Details

Motivation: To support algorithm development and practical deployment in online learning by providing a unified, user-friendly framework.

Method: Built on scikit-multiflow, Awesome-OL integrates advanced algorithms, datasets, and visualization for reproducible comparisons.

Result: A publicly available toolkit (https://github.com/liuzy0708/Awesome-OL) that balances usability with research flexibility.

Conclusion: Awesome-OL facilitates online learning research with its extensible and reproducible framework.

Abstract: In recent years, online learning has attracted increasing attention due to its adaptive capability to process streaming and non-stationary data. To facilitate algorithm development and practical deployment in this area, we introduce Awesome-OL, an extensible Python toolkit tailored for online learning research. Awesome-OL integrates state-of-the-art algorithm, which provides a unified framework for reproducible comparisons, curated benchmark datasets, and multi-modal visualization. Built upon the scikit-multiflow open-source infrastructure, Awesome-OL emphasizes user-friendly interactions without compromising research flexibility or extensibility. The source code is publicly available at: https://github.com/liuzy0708/Awesome-OL.

[611] ASNN: Learning to Suggest Neural Architectures from Performance Distributions

Jinwook Hong

Main category: cs.LG

TL;DR: ASNN learns the relationship between NN architecture and accuracy, suggesting improved architectures, outperforming random search.

Details

Motivation: No general function maps NN structure to accuracy, making architecture design heuristic or search-based. ASNN aims to automate and improve this process.

Method: ASNN is trained on datasets of TensorFlow models with varying layers and nodes, using accuracy as input and architecture as output. It iteratively predicts better architectures.

Result: ASNN suggested architectures outperforming original training data in 2-layer and 3-layer cases, improving mean test accuracies.

Conclusion: ASNN offers an efficient alternative to random search for NN architecture optimization, promising for automating design.

Abstract: The architecture of a neural network (NN) plays a critical role in determining its performance. However, there is no general closed-form function that maps between network structure and accuracy, making the process of architecture design largely heuristic or search-based. In this study, we propose the Architecture Suggesting Neural Network (ASNN), a model designed to learn the relationship between NN architecture and its test accuracy, and to suggest improved architectures accordingly. To train ASNN, we constructed datasets using TensorFlow-based models with varying numbers of layers and nodes. Experimental results were collected for both 2-layer and 3-layer architectures across a grid of configurations, each evaluated with 10 repeated trials to account for stochasticity. Accuracy values were treated as inputs, and architectural parameters as outputs. The trained ASNN was then used iteratively to predict architectures that yield higher performance. In both 2-layer and 3-layer cases, ASNN successfully suggested architectures that outperformed the best results found in the original training data. Repeated prediction and retraining cycles led to the discovery of architectures with improved mean test accuracies, demonstrating the model’s capacity to generalize the performance-structure relationship. These results suggest that ASNN provides an efficient alternative to random search for architecture optimization, and offers a promising approach toward automating neural network design. “Parts of the manuscript, including text editing and expression refinement, were supported by OpenAI’s ChatGPT. All content was reviewed and verified by the authors.”

[612] Partial Domain Adaptation via Importance Sampling-based Shift Correction

Cheng-Jun Guo, Chuan-Xian Ren, You-Wei Luo, Xiao-Lin Xu, Hong Yan

Main category: cs.LG

TL;DR: The paper proposes IS$^2$C, a novel importance sampling-based method for partial domain adaptation (PDA), addressing label distribution shifts and improving model generalization.

Details

Motivation: PDA faces challenges in correcting label distribution shifts and avoiding overfitting on the source domain. Existing reweighing techniques are insufficient.

Method: IS$^2$C samples labeled data from a new domain with target-like label distribution, uses mixture distribution for shift correction, and aligns conditional distributions via optimal transport.

Result: Theoretical guarantees show IS$^2$C dominates generalization error. Experiments on PDA benchmarks confirm its superiority over existing methods.

Conclusion: IS$^2$C effectively addresses PDA challenges, offering interpretability and improved performance.

Abstract: Partial domain adaptation (PDA) is a challenging task in real-world machine learning scenarios. It aims to transfer knowledge from a labeled source domain to a related unlabeled target domain, where the support set of the source label distribution subsumes the target one. Previous PDA works managed to correct the label distribution shift by weighting samples in the source domain. However, the simple reweighing technique cannot explore the latent structure and sufficiently use the labeled data, and then models are prone to over-fitting on the source domain. In this work, we propose a novel importance sampling-based shift correction (IS$^2$C) method, where new labeled data are sampled from a built sampling domain, whose label distribution is supposed to be the same as the target domain, to characterize the latent structure and enhance the generalization ability of the model. We provide theoretical guarantees for IS$^2$C by proving that the generalization error can be sufficiently dominated by IS$^2$C. In particular, by implementing sampling with the mixture distribution, the extent of shift between source and sampling domains can be connected to generalization error, which provides an interpretable way to build IS$^2$C. To improve knowledge transfer, an optimal transport-based independence criterion is proposed for conditional distribution alignment, where the computation of the criterion can be adjusted to reduce the complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$ in realistic PDA scenarios. Extensive experiments on PDA benchmarks validate the theoretical results and demonstrate the effectiveness of our IS$^2$C over existing methods.

[613] Technical Indicator Networks (TINs): An Interpretable Neural Architecture Modernizing Classic al Technical Analysis for Adaptive Algorithmic Trading

Longfei Lu

Main category: cs.LG

TL;DR: Classical financial indicators are special cases of neural networks with fixed weights. Technical Indicator Networks (TINs) generalize and upgrade these indicators using neural architectures.

Details

Motivation: To bridge traditional technical analysis with modern AI by showing that classical indicators are neural networks with fixed weights.

Method: Proposes TINs, a neural architecture that replicates and upgrades traditional indicators, supporting n-dimensional inputs like price, volume, and sentiment.

Result: TINs modernize technical analysis by encoding domain knowledge into neural structures, enhancing algorithmic trading.

Conclusion: TINs connect legacy indicators with AI, advancing algorithmic trading.

Abstract: This work proposes that a vast majority of classical technical indicators in financial analysis are, in essence, special cases of neural networks with fixed and interpretable weights. It is shown that nearly all such indicators, such as moving averages, momentum-based oscillators, volatility bands, and other commonly used technical constructs, can be reconstructed topologically as modular neural network components. Technical Indicator Networks (TINs) are introduced as a general neural architecture that replicates and structurally upgrades traditional indicators by supporting n-dimensional inputs such as price, volume, sentiment, and order book data. By encoding domain-specific knowledge into neural structures, TINs modernize the foundational logic of technical analysis and propel algorithmic trading into a new era, bridging the legacy of proven indicators with the potential of contemporary AI systems.

[614] Protein-SE(3): Benchmarking SE(3)-based Generative Models for Protein Structure Design

Lang Yu, Zhangyang Gao, Cheng Tan, Qin Chen, Jie Zhou, Liang He

Main category: cs.LG

TL;DR: The paper introduces Protein-SE(3), a modular benchmark for evaluating SE(3)-based generative models in protein structure design, unifying training and evaluation for fair comparison.

Details

Motivation: The lack of a standardized benchmark for SE(3)-based protein modeling hinders fair comparison and comprehensive investigation of methods.

Method: Proposes Protein-SE(3), integrating diverse generative models (DDPM, Score Matching, Flow Matching) under a unified framework with shared training data and metrics.

Result: Provides a high-level mathematical abstraction for fast prototyping and releases the first comprehensive benchmark for SE(3)-based protein design.

Conclusion: Protein-SE(3) enables fair comparison and accelerates future research in protein structure design.

Abstract: SE(3)-based generative models have shown great promise in protein geometry modeling and effective structure design. However, the field currently lacks a modularized benchmark to enable comprehensive investigation and fair comparison of different methods. In this paper, we propose Protein-SE(3), a new benchmark based on a unified training framework, which comprises protein scaffolding tasks, integrated generative models, high-level mathematical abstraction, and diverse evaluation metrics. Recent advanced generative models designed for protein scaffolding, from multiple perspectives like DDPM (Genie1 and Genie2), Score Matching (FrameDiff and RfDiffusion) and Flow Matching (FoldFlow and FrameFlow) are integrated into our framework. All integrated methods are fairly investigated with the same training dataset and evaluation metrics. Furthermore, we provide a high-level abstraction of the mathematical foundations behind the generative models, enabling fast prototyping of future algorithms without reliance on explicit protein structures. Accordingly, we release the first comprehensive benchmark built upon unified training framework for SE(3)-based protein structure design, which is publicly accessible at https://github.com/BruthYU/protein-se3.

[615] Learning from Expert Factors: Trajectory-level Reward Shaping for Formulaic Alpha Mining

Junjie Zhao, Chengxi Zhang, Chenkai Wang, Peng Yang

Main category: cs.LG

TL;DR: TLRS improves RL for mining alpha factors by providing dense rewards and reducing training variance, outperforming existing methods in predictive power and efficiency.

Details

Motivation: Existing RL methods for mining alpha factors suffer from sparse rewards, limiting exploration and destabilizing training.

Method: Proposes Trajectory-level Reward Shaping (TLRS), which offers dense rewards via subsequence-level similarity to expert formulas and includes a reward centering mechanism.

Result: TLRS boosts predictive power (9.29% improvement in Rank Information Coefficient) and achieves constant time complexity, enhancing efficiency.

Conclusion: TLRS effectively addresses sparse rewards and training instability, outperforming baselines in both performance and computational efficiency.

Abstract: Reinforcement learning (RL) has successfully automated the complex process of mining formulaic alpha factors, for creating interpretable and profitable investment strategies. However, existing methods are hampered by the sparse rewards given the underlying Markov Decision Process. This inefficiency limits the exploration of the vast symbolic search space and destabilizes the training process. To address this, Trajectory-level Reward Shaping (TLRS), a novel reward shaping method, is proposed. TLRS provides dense, intermediate rewards by measuring the subsequence-level similarity between partially generated expressions and a set of expert-designed formulas. Furthermore, a reward centering mechanism is introduced to reduce training variance. Extensive experiments on six major Chinese and U.S. stock indices show that TLRS significantly improves the predictive power of mined factors, boosting the Rank Information Coefficient by 9.29% over existing potential-based shaping algorithms. Notably, TLRS achieves a major leap in computational efficiency by reducing its time complexity with respect to the feature dimension from linear to constant, which is a significant improvement over distance-based baselines.

[616] Data-Efficient Prediction-Powered Calibration via Cross-Validation

Seonghoon Yoo, Houssem Sifaou, Sangwoo Park, Joonhyuk Kang, Osvaldo Simeone

Main category: cs.LG

TL;DR: A novel method efficiently uses limited calibration data to fine-tune a predictor and estimate synthetic label bias, ensuring rigorous coverage for AI decisions.

Details

Motivation: Overcoming the scarcity of calibration data for quantifying AI decision uncertainty by leveraging synthetic labels.

Method: Simultaneously fine-tunes a predictor and estimates synthetic label bias using limited calibration data.

Result: Produces prediction sets with rigorous coverage guarantees, validated on an indoor localization problem.

Conclusion: The method effectively addresses calibration data scarcity and improves performance.

Abstract: Calibration data are necessary to formally quantify the uncertainty of the decisions produced by an existing artificial intelligence (AI) model. To overcome the common issue of scarce calibration data, a promising approach is to employ synthetic labels produced by a (generally different) predictive model. However, fine-tuning the label-generating predictor on the inference task of interest, as well as estimating the residual bias of the synthetic labels, demand additional data, potentially exacerbating the calibration data scarcity problem. This paper introduces a novel approach that efficiently utilizes limited calibration data to simultaneously fine-tune a predictor and estimate the bias of the synthetic labels. The proposed method yields prediction sets with rigorous coverage guarantees for AI-generated decisions. Experimental results on an indoor localization problem validate the effectiveness and performance gains of our solution.

[617] Approximating Full Conformal Prediction for Neural Network Regression with Gauss-Newton Influence

Dharmesh Tailor, Alvaro H. C. Correia, Eric Nalisnick, Christos Louizos

Main category: cs.LG

TL;DR: The paper proposes a method to construct prediction intervals for neural network regressors post-hoc without held-out data, improving calibration and sharpness compared to existing methods like Laplace’s method and split conformal prediction.

Details

Motivation: Deep learning models in safety-critical areas require reliable uncertainty quantification, but existing methods like Laplace's method and split-CP have limitations (miscalibration or reduced efficiency).

Method: The method approximates full conformal prediction (full-CP) by training once and perturbing model parameters locally using Gauss-Newton influence, coupled with network linearization to efficiently compute prediction intervals.

Result: The proposed method produces locally-adaptive and often tighter prediction intervals than split-CP on standard regression benchmarks and bounding box localization tasks.

Conclusion: The approach offers a practical and efficient way to quantify uncertainty post-hoc without held-out data, addressing limitations of existing methods.

Abstract: Uncertainty quantification is an important prerequisite for the deployment of deep learning models in safety-critical areas. Yet, this hinges on the uncertainty estimates being useful to the extent the prediction intervals are well-calibrated and sharp. In the absence of inherent uncertainty estimates (e.g. pretrained models predicting only point estimates), popular approaches that operate post-hoc include Laplace’s method and split conformal prediction (split-CP). However, Laplace’s method can be miscalibrated when the model is misspecified and split-CP requires sample splitting, and thus comes at the expense of statistical efficiency. In this work, we construct prediction intervals for neural network regressors post-hoc without held-out data. This is achieved by approximating the full conformal prediction method (full-CP). Whilst full-CP nominally requires retraining the model for every test point and candidate label, we propose to train just once and locally perturb model parameters using Gauss-Newton influence to approximate the effect of retraining. Coupled with linearization of the network, we express the absolute residual nonconformity score as a piecewise linear function of the candidate label allowing for an efficient procedure that avoids the exhaustive search over the output space. On standard regression benchmarks and bounding box localization, we show the resulting prediction intervals are locally-adaptive and often tighter than those of split-CP.

[618] MIPS: a Multimodal Infinite Polymer Sequence Pre-training Framework for Polymer Property Prediction

Jiaxi Wang, Yaosen Min, Xun Zhu, Miao Li, Ji Wu

Main category: cs.LG

TL;DR: The paper introduces MIPS, a framework for polymer property prediction by modeling polymers as infinite monomer sequences, integrating topological and spatial information, and outperforming existing methods.

Details

Motivation: Accurate polymer property prediction is crucial for design and application, but current monomer-based models fail to capture polymerization-induced property changes.

Method: MIPS uses multimodal pre-training, combining topological (generalized MPM and LGA) and spatial (3D descriptors) information, with a cross-modal fusion mechanism.

Result: MIPS achieves state-of-the-art performance across eight polymer property prediction tasks.

Conclusion: MIPS effectively addresses limitations of existing models by leveraging infinite sequence representation and multimodal fusion, enhancing polymer property prediction.

Abstract: Polymers, composed of repeating structural units called monomers, are fundamental materials in daily life and industry. Accurate property prediction for polymers is essential for their design, development, and application. However, existing modeling approaches, which typically represent polymers by the constituent monomers, struggle to capture the whole properties of polymer, since the properties change during the polymerization process. In this study, we propose a Multimodal Infinite Polymer Sequence (MIPS) pre-training framework, which represents polymers as infinite sequences of monomers and integrates both topological and spatial information for comprehensive modeling. From the topological perspective, we generalize message passing mechanism (MPM) and graph attention mechanism (GAM) to infinite polymer sequences. For MPM, we demonstrate that applying MPM to infinite polymer sequences is equivalent to applying MPM on the induced star-linking graph of monomers. For GAM, we propose to further replace global graph attention with localized graph attention (LGA). Moreover, we show the robustness of the “star linking” strategy through Repeat and Shift Invariance Test (RSIT). Despite its robustness, “star linking” strategy exhibits limitations when monomer side chains contain ring structures, a common characteristic of polymers, as it fails the Weisfeiler-Lehman~(WL) test. To overcome this issue, we propose backbone embedding to enhance the capability of MPM and LGA on infinite polymer sequences. From the spatial perspective, we extract 3D descriptors of repeating monomers to capture spatial information. Finally, we design a cross-modal fusion mechanism to unify the topological and spatial information. Experimental validation across eight diverse polymer property prediction tasks reveals that MIPS achieves state-of-the-art performance.

[619] Cultivating Helpful, Personalized, and Creative AI Tutors: A Framework for Pedagogical Alignment using Reinforcement Learning

Siyu Song, Wentao Liu, Ye Lu, Ruohua Zhang, Tao Liu, Jinze Lv, Xinyun Wang, Aimin Zhou, Fei Tan, Bo Jiang, Hao Hao

Main category: cs.LG

TL;DR: EduAlign is a framework to align LLMs with educational principles (Helpfulness, Personalization, Creativity) using a reward model and fine-tuning, improving their effectiveness as educational assistants.

Details

Motivation: Standard LLMs lack alignment with pedagogical principles, limiting their educational utility. EduAlign aims to bridge this gap.

Method: 1. Curate and annotate 8k educational interactions for HPC dimensions. 2. Train HPC-RM (reward model). 3. Fine-tune LLM using GRPO with HPC-RM as reward.

Result: Fine-tuned model shows significant improvement in pedagogical alignment (HPC) while maintaining general-domain performance.

Conclusion: EduAlign offers a scalable method to enhance LLMs for education, enabling more pedagogically aligned AI tutors.

Abstract: The integration of large language models (LLMs) into education presents unprecedented opportunities for scalable personalized learning. However, standard LLMs often function as generic information providers, lacking alignment with fundamental pedagogical principles such as helpfulness, student-centered personalization, and creativity cultivation. To bridge this gap, we propose EduAlign, a novel framework designed to guide LLMs toward becoming more effective and responsible educational assistants. EduAlign consists of two main stages. In the first stage, we curate a dataset of 8k educational interactions and annotate them-both manually and automatically-along three key educational dimensions: Helpfulness, Personalization, and Creativity (HPC). These annotations are used to train HPC-RM, a multi-dimensional reward model capable of accurately scoring LLM outputs according to these educational principles. We further evaluate the consistency and reliability of this reward model. In the second stage, we leverage HPC-RM as a reward signal to fine-tune a pre-trained LLM using Group Relative Policy Optimization (GRPO) on a set of 2k diverse prompts. We then assess the pre- and post-finetuning models on both educational and general-domain benchmarks across the three HPC dimensions. Experimental results demonstrate that the fine-tuned model exhibits significantly improved alignment with pedagogical helpfulness, personalization, and creativity stimulation. This study presents a scalable and effective approach to aligning LLMs with nuanced and desirable educational traits, paving the way for the development of more engaging, pedagogically aligned AI tutors.

[620] From Observations to Causations: A GNN-based Probabilistic Prediction Framework for Causal Discovery

Rezaur Rashid, Gabriel Terejanu

Main category: cs.LG

TL;DR: A novel GNN-based probabilistic framework for causal discovery outperforms traditional and recent methods in accuracy and scalability.

Details

Motivation: Traditional causal discovery methods struggle with scalability and capturing global structural information in large datasets.

Method: Uses a GNN to encode node and edge attributes into a unified graph representation, trained on synthetic datasets with statistical measures.

Result: Outperforms traditional and non-GNN-based methods in accuracy and scalability on synthetic and real-world datasets.

Conclusion: The probabilistic framework improves causal structure learning, benefiting decision-making and scientific discovery.

Abstract: Causal discovery from observational data is challenging, especially with large datasets and complex relationships. Traditional methods often struggle with scalability and capturing global structural information. To overcome these limitations, we introduce a novel graph neural network (GNN)-based probabilistic framework that learns a probability distribution over the entire space of causal graphs, unlike methods that output a single deterministic graph. Our framework leverages a GNN that encodes both node and edge attributes into a unified graph representation, enabling the model to learn complex causal structures directly from data. The GNN model is trained on a diverse set of synthetic datasets augmented with statistical and information-theoretic measures, such as mutual information and conditional entropy, capturing both local and global data properties. We frame causal discovery as a supervised learning problem, directly predicting the entire graph structure. Our approach demonstrates superior performance, outperforming both traditional and recent non-GNN-based methods, as well as a GNN-based approach, in terms of accuracy and scalability on synthetic and real-world datasets without further training. This probabilistic framework significantly improves causal structure learning, with broad implications for decision-making and scientific discovery across various fields.

[621] Computational Advantages of Multi-Grade Deep Learning: Convergence Analysis and Performance Insights

Ronglong Fang, Yuesheng Xu

Main category: cs.LG

TL;DR: MGDL outperforms SGDL in image tasks like regression, denoising, and deblurring, with better robustness to learning rate choices and training stability.

Details

Motivation: To investigate the computational advantages of MGDL over SGDL, focusing on performance in image tasks and understanding its mathematical underpinnings.

Method: Analyzed gradient descent (GD) convergence for MGDL and SGDL, and studied eigenvalue distributions of Jacobian matrices from GD iterations.

Result: MGDL is more robust to learning rate choices and exhibits enhanced training stability compared to SGDL.

Conclusion: MGDL’s superior performance and stability make it a promising alternative to SGDL for image-related tasks.

Abstract: Multi-grade deep learning (MGDL) has been shown to significantly outperform the standard single-grade deep learning (SGDL) across various applications. This work aims to investigate the computational advantages of MGDL focusing on its performance in image regression, denoising, and deblurring tasks, and comparing it to SGDL. We establish convergence results for the gradient descent (GD) method applied to these models and provide mathematical insights into MGDL’s improved performance. In particular, we demonstrate that MGDL is more robust to the choice of learning rate under GD than SGDL. Furthermore, we analyze the eigenvalue distributions of the Jacobian matrices associated with the iterative schemes arising from the GD iterations, offering an explanation for MGDL’s enhanced training stability.

[622] Wafer Defect Root Cause Analysis with Partial Trajectory Regression

Kohei Miyaguchi, Masao Joko, Rebekah Sheraw, Tsuyoshi Idé

Main category: cs.LG

TL;DR: A novel framework, Partial Trajectory Regression (PTR), is introduced for wafer defect root cause analysis, addressing limitations of conventional methods by handling variable-length processing routes and using new representation learning techniques.

Details

Motivation: The challenge lies in identifying upstream processes causing wafer defects due to complex, variable process flows and inherent variability like rework operations and random waiting times.

Method: The PTR framework uses partial process trajectories and a new algorithm to compute process attribution scores, leveraging proc2vec and route2vec for representation learning.

Result: The framework’s effectiveness is demonstrated using real wafer history data from the NY CREATES fab in Albany.

Conclusion: PTR offers a robust solution for root cause analysis in wafer defects, overcoming traditional method limitations.

Abstract: Identifying upstream processes responsible for wafer defects is challenging due to the combinatorial nature of process flows and the inherent variability in processing routes, which arises from factors such as rework operations and random process waiting times. This paper presents a novel framework for wafer defect root cause analysis, called Partial Trajectory Regression (PTR). The proposed framework is carefully designed to address the limitations of conventional vector-based regression models, particularly in handling variable-length processing routes that span a large number of heterogeneous physical processes. To compute the attribution score of each process given a detected high defect density on a specific wafer, we propose a new algorithm that compares two counterfactual outcomes derived from partial process trajectories. This is enabled by new representation learning methods, proc2vec and route2vec. We demonstrate the effectiveness of the proposed framework using real wafer history data from the NY CREATES fab in Albany.

[623] MH-GIN: Multi-scale Heterogeneous Graph-based Imputation Network for AIS Data (Extended Version)

Hengyu Liu, Tianyi Li, Yuqiang He, Kristian Torp, Yushuai Li, Christian S. Jensen

Main category: cs.LG

TL;DR: MH-GIN, a Multi-scale Heterogeneous Graph-based Imputation Network, improves maritime tracking data imputation by capturing multi-scale dependencies, reducing errors by 57% over state-of-the-art methods.

Details

Motivation: Missing values in maritime tracking data hinder safety and monitoring applications, and existing methods fail to address multi-scale dependencies among attributes.

Method: MH-GIN extracts multi-scale temporal features, constructs a heterogeneous graph to model dependencies, and uses graph propagation for imputation.

Result: MH-GIN reduces imputation errors by 57% on real-world datasets while remaining computationally efficient.

Conclusion: MH-GIN effectively addresses multi-scale dependencies in maritime data, offering a significant improvement in imputation accuracy.

Abstract: Location-tracking data from the Automatic Identification System, much of which is publicly available, plays a key role in a range of maritime safety and monitoring applications. However, the data suffers from missing values that hamper downstream applications. Imputing the missing values is challenging because the values of different heterogeneous attributes are updated at diverse rates, resulting in the occurrence of multi-scale dependencies among attributes. Existing imputation methods that assume similar update rates across attributes are unable to capture and exploit such dependencies, limiting their imputation accuracy. We propose MH-GIN, a Multi-scale Heterogeneous Graph-based Imputation Network that aims improve imputation accuracy by capturing multi-scale dependencies. Specifically, MH-GIN first extracts multi-scale temporal features for each attribute while preserving their intrinsic heterogeneous characteristics. Then, it constructs a multi-scale heterogeneous graph to explicitly model dependencies between heterogeneous attributes to enable more accurate imputation of missing values through graph propagation. Experimental results on two real-world datasets find that MH-GIN is capable of an average 57% reduction in imputation errors compared to state-of-the-art methods, while maintaining computational efficiency. The source code and implementation details of MH-GIN are publicly available https://github.com/hyLiu1994/MH-GIN.

[624] Sequence-Aware Inline Measurement Attribution for Good-Bad Wafer Diagnosis

Kohei Miyaguchi, Masao Joko, Rebekah Sheraw, Tsuyoshi Idé

Main category: cs.LG

TL;DR: The paper introduces Trajectory Shapley Attribution (TSA) to identify problematic upstream processes in semiconductor manufacturing by extending Shapley values, addressing limitations like sequence ignorance and arbitrary reference points.

Details

Motivation: The complexity of semiconductor manufacturing makes root cause analysis for wafer defects challenging, necessitating a method to pinpoint relevant upstream processes.

Method: Proposes TSA, an extension of Shapley values, to account for process sequences and avoid arbitrary references, applied to wafer defect diagnosis in a fab.

Result: TSA successfully identifies measurement items linked to abnormal defect occurrences in experimental processes.

Conclusion: TSA offers a robust framework for root cause analysis in semiconductor manufacturing, overcoming key limitations of traditional Shapley values.

Abstract: How can we identify problematic upstream processes when a certain type of wafer defect starts appearing at a quality checkpoint? Given the complexity of modern semiconductor manufacturing, which involves thousands of process steps, cross-process root cause analysis for wafer defects has been considered highly challenging. This paper proposes a novel framework called Trajectory Shapley Attribution (TSA), an extension of Shapley values (SV), a widely used attribution algorithm in explainable artificial intelligence research. TSA overcomes key limitations of standard SV, including its disregard for the sequential nature of manufacturing processes and its reliance on an arbitrarily chosen reference point. We applied TSA to a good-bad wafer diagnosis task in experimental front-end-of-line processes at the NY CREATES Albany NanoTech fab, aiming to identify measurement items (serving as proxies for process parameters) most relevant to abnormal defect occurrence.

[625] Clustering by Attention: Leveraging Prior Fitted Transformers for Data Partitioning

Ahmed Shokry, Ayman Khalafallah

Main category: cs.LG

TL;DR: A novel meta-learning-based clustering approach eliminates parameter tuning, outperforms state-of-the-art methods, and leverages pre-clustered samples for accurate, scalable clustering.

Details

Motivation: Existing clustering algorithms face challenges like parameter tuning, high complexity, poor interpretability, and suboptimal accuracy, especially for large datasets.

Method: Uses a Prior-Data Fitted Transformer Network (PFN) to compute attention between pre-clustered and unclustered samples, inferring cluster assignments in a single forward pass.

Result: Outperforms state-of-the-art techniques, generalizes well with few pre-clustered samples, and works even without them on well-separated data.

Conclusion: The approach is effective, scalable, and a promising alternative to traditional clustering methods.

Abstract: Clustering is a core task in machine learning with wide-ranging applications in data mining and pattern recognition. However, its unsupervised nature makes it inherently challenging. Many existing clustering algorithms suffer from critical limitations: they often require careful parameter tuning, exhibit high computational complexity, lack interpretability, or yield suboptimal accuracy, especially when applied to large-scale datasets. In this paper, we introduce a novel clustering approach based on meta-learning. Our approach eliminates the need for parameter optimization while achieving accuracy that outperforms state-of-the-art clustering techniques. The proposed technique leverages a few pre-clustered samples to guide the clustering process for the entire dataset in a single forward pass. Specifically, we employ a pre-trained Prior-Data Fitted Transformer Network (PFN) to perform clustering. The algorithm computes attention between the pre-clustered samples and the unclustered samples, allowing it to infer cluster assignments for the entire dataset based on the learned relation. We theoretically and empirically demonstrate that, given just a few pre-clustered examples, the model can generalize to accurately cluster the rest of the dataset. Experiments on challenging benchmark datasets show that our approach can successfully cluster well-separated data without any pre-clustered samples, and significantly improves performance when a few clustered samples are provided. We show that our approach is superior to the state-of-the-art techniques. These results highlight the effectiveness and scalability of our approach, positioning it as a promising alternative to existing clustering techniques.

[626] WBHT: A Generative Attention Architecture for Detecting Black Hole Anomalies in Backbone Networks

Kiymet Kaya, Elif Ak, Sule Gunduz Oguducu

Main category: cs.LG

TL;DR: WBHT framework detects black hole anomalies in networks using generative modeling, sequential learning, and attention mechanisms, outperforming existing models in F1 score.

Details

Motivation: Black hole anomalies cause packet loss without notifications, disrupting connectivity and causing financial losses, necessitating improved detection methods.

Method: Combines Wasserstein GAN with attention mechanisms, LSTM for long-term dependencies, and convolutional layers for local patterns. Uses latent space encoding for anomaly distinction.

Result: WBHT achieves significant F1 score improvements (1.65% to 58.76%) and detects previously undetected anomalies.

Conclusion: WBHT is efficient and valuable for proactive network monitoring, especially in mission-critical networks.

Abstract: We propose the Wasserstein Black Hole Transformer (WBHT) framework for detecting black hole (BH) anomalies in communication networks. These anomalies cause packet loss without failure notifications, disrupting connectivity and leading to financial losses. WBHT combines generative modeling, sequential learning, and attention mechanisms to improve BH anomaly detection. It integrates a Wasserstein generative adversarial network with attention mechanisms for stable training and accurate anomaly identification. The model uses long-short-term memory layers to capture long-term dependencies and convolutional layers for local temporal patterns. A latent space encoding mechanism helps distinguish abnormal network behavior. Tested on real-world network data, WBHT outperforms existing models, achieving significant improvements in F1 score (ranging from 1.65% to 58.76%). Its efficiency and ability to detect previously undetected anomalies make it a valuable tool for proactive network monitoring and security, especially in mission-critical networks.

[627] Set-based Implicit Likelihood Inference of Galaxy Cluster Mass

Bonny Y. Wang, Leander Thiele

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: We present a set-based machine learning framework that infers posterior distributions of galaxy cluster masses from projected galaxy dynamics. Our model combines Deep Sets and conditional normalizing flows to incorporate both positional and velocity information of member galaxies to predict residual corrections to the $M$-$\sigma$ relation for improved interpretability. Trained on the Uchuu-UniverseMachine simulation, our approach significantly reduces scatter and provides well-calibrated uncertainties across the full mass range compared to traditional dynamical estimates.

[628] Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning

Tolga Dimlioglu, Anna Choromanska

Main category: cs.LG

TL;DR: The paper introduces the Distributed Pull-Push Force (DPPF) algorithm to improve communication efficiency and model performance in distributed DNN training by leveraging flat-minima hypothesis and a new sharpness measure, Inverse Mean Valley.

Details

Motivation: To enhance the trade-off between communication efficiency and generalization performance in distributed DNN training by encouraging collaborative search for wide minima.

Method: Proposes DPPF, incorporating a lightweight regularizer based on Inverse Mean Valley to balance consensus (pull) and flat-minima seeking (push) forces.

Result: DPPF outperforms communication-efficient methods, reduces overhead, and achieves better generalization, with empirical validation of flatter minima.

Conclusion: DPPF effectively balances communication and performance, supported by theoretical guarantees on valley width, generalization, and convergence.

Abstract: We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while significantly reducing communication overhead. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.

[629] ResCap-DBP: A Lightweight Residual-Capsule Network for Accurate DNA-Binding Protein Prediction Using Global ProteinBERT Embeddings

Samiul Based Shuvo, Tasnia Binte Mamun, U Rajendra Acharya

Main category: cs.LG

TL;DR: A novel deep learning framework, ResCap-DBP, combines residual learning and Capsule Networks to predict DNA-binding proteins from raw sequences, outperforming state-of-the-art methods.

Details

Motivation: Experimental DBP identification is costly and time-consuming, necessitating efficient computational prediction techniques.

Method: ResCap-DBP uses a residual learning-based encoder with 1D-CapsNet, incorporating dilated convolutions and dynamic routing for hierarchical feature extraction.

Result: Achieved high AUC scores (e.g., 98.0% on PDB14189) and balanced sensitivity/specificity across datasets.

Conclusion: The model demonstrates efficacy and generalizability for scalable DBP prediction in diverse genomic contexts.

Abstract: DNA-binding proteins (DBPs) are integral to gene regulation and cellular processes, making their accurate identification essential for understanding biological functions and disease mechanisms. Experimental methods for DBP identification are time-consuming and costly, driving the need for efficient computational prediction techniques. In this study, we propose a novel deep learning framework, ResCap-DBP, that combines a residual learning-based encoder with a one-dimensional Capsule Network (1D-CapsNet) to predict DBPs directly from raw protein sequences. Our architecture incorporates dilated convolutions within residual blocks to mitigate vanishing gradient issues and extract rich sequence features, while capsule layers with dynamic routing capture hierarchical and spatial relationships within the learned feature space. We conducted comprehensive ablation studies comparing global and local embeddings from ProteinBERT and conventional one-hot encoding. Results show that ProteinBERT embeddings substantially outperform other representations on large datasets. Although one-hot encoding showed marginal advantages on smaller datasets, such as PDB186, it struggled to scale effectively. Extensive evaluations on four pairs of publicly available benchmark datasets demonstrate that our model consistently outperforms current state-of-the-art methods. It achieved AUC scores of 98.0% and 89.5% on PDB14189andPDB1075, respectively. On independent test sets PDB2272 and PDB186, the model attained top AUCs of 83.2% and 83.3%, while maintaining competitive performance on larger datasets such as PDB20000. Notably, the model maintains a well balanced sensitivity and specificity across datasets. These results demonstrate the efficacy and generalizability of integrating global protein representations with advanced deep learning architectures for reliable and scalable DBP prediction in diverse genomic contexts.

Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang

Main category: cs.LG

TL;DR: Proposes a scalable and adaptable multi-modal guardrail using precedents for flexible content filtering, outperforming existing methods.

Details

Motivation: Address challenges of deploying customizable content guardrails with minimal retraining and limited examples.

Method: Leverages precedents (reasoning processes of similar data) and introduces a critique-revise mechanism for high-quality precedents and robust prediction strategies.

Result: Outperforms previous methods in few-shot and full-dataset scenarios, with better generalization to new policies.

Conclusion: Precedent-based approach enhances flexibility and adaptability of content guardrails, offering superior performance.

Abstract: A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model’s judgment on “precedents”, which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.

[631] FAST: Similarity-based Knowledge Transfer for Efficient Policy Learning

Alessandro Capurso, Elia Piccoli, Davide Bacciu

Main category: cs.LG

TL;DR: FAST improves transfer learning by using visual and textual data to estimate task similarity, reducing training steps while maintaining performance.

Details

Motivation: Address challenges in transfer learning like negative transfer and inefficiency in selecting source policies, especially in dynamic domains like game development.

Method: FAST uses visual frames and textual descriptions to create latent task representations, estimating similarity between environments to guide policy transfer.

Result: FAST achieves competitive performance with fewer training steps compared to learning-from-scratch methods in racing track experiments.

Conclusion: Embedding-driven task similarity estimation is effective for efficient and adaptive transfer learning.

Abstract: Transfer Learning (TL) offers the potential to accelerate learning by transferring knowledge across tasks. However, it faces critical challenges such as negative transfer, domain adaptation and inefficiency in selecting solid source policies. These issues often represent critical problems in evolving domains, i.e. game development, where scenarios transform and agents must adapt. The continuous release of new agents is costly and inefficient. In this work we challenge the key issues in TL to improve knowledge transfer, agents performance across tasks and reduce computational costs. The proposed methodology, called FAST - Framework for Adaptive Similarity-based Transfer, leverages visual frames and textual descriptions to create a latent representation of tasks dynamics, that is exploited to estimate similarity between environments. The similarity scores guides our method in choosing candidate policies from which transfer abilities to simplify learning of novel tasks. Experimental results, over multiple racing tracks, demonstrate that FAST achieves competitive final performance compared to learning-from-scratch methods while requiring significantly less training steps. These findings highlight the potential of embedding-driven task similarity estimations.

[632] Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang, Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang, Jianzhou Wang, Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Wenhao Wu, Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Xiaofei Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye, Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Haobing Zhan, Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin Zheng, Shaojie Zheng, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Zhen Zhu, Weiyu Zhuang, Xinxing Zu

Main category: cs.LG

TL;DR: Kimi K2 is a Mixture-of-Experts (MoE) model with 32B activated and 1T total parameters, trained using the MuonClip optimizer for stability. It excels in agentic tasks and benchmarks, outperforming most open and closed-source models.

Details

Motivation: To develop a highly capable open-source large language model, particularly for software engineering and agentic tasks, addressing training instability and improving performance.

Method: Uses MuonClip optimizer with QK-clip for stable training, pre-trains on 15.5T tokens, and undergoes multi-stage post-training with agentic data synthesis and joint RL.

Result: Achieves SOTA in non-thinking benchmarks (e.g., Tau2-Bench: 66.1, ACEBench: 76.5) and excels in coding, math, and reasoning (e.g., LiveCodeBench: 53.7, GPQA-Diamond: 75.1).

Conclusion: Kimi K2 is a leading open-source model for agentic and software tasks, with released checkpoints to advance research.

Abstract: We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual – surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.

[633] BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis Tool

Vicente Ramos, Sundous Hussein, Mohamed Abdel-Hafiz, Arunangshu Sarkar, Weixuan Liu, Katerina J. Kechris, Russell P. Bowler, Leslie Lange, Farnoush Banaei-Kashani

Main category: cs.LG

TL;DR: BioNeuralNet is a Python framework using GNNs for multi-omics network analysis, offering modular tools for network construction, embedding generation, and downstream tasks.

Details

Motivation: High dimensionality and complexity of multi-omics data require specialized tools for effective network-based analysis.

Method: Leverages Graph Neural Networks (GNNs) to create low-dimensional embeddings from multi-omics networks, supporting diverse analytical stages.

Result: BioNeuralNet provides a flexible, open-source solution compatible with popular Python packages, enhancing usability and reproducibility.

Conclusion: BioNeuralNet addresses the need for a versatile, user-friendly framework for network-based multi-omics analysis in precision medicine.

Abstract: Multi-omics data offer unprecedented insights into complex biological systems, yet their high dimensionality, sparsity, and intricate interactions pose significant analytical challenges. Network-based approaches have advanced multi-omics research by effectively capturing biologically relevant relationships among molecular entities. While these methods are powerful for representing molecular interactions, there remains a need for tools specifically designed to effectively utilize these network representations across diverse downstream analyses. To fulfill this need, we introduce BioNeuralNet, a flexible and modular Python framework tailored for end-to-end network-based multi-omics data analysis. BioNeuralNet leverages Graph Neural Networks (GNNs) to learn biologically meaningful low-dimensional representations from multi-omics networks, converting these complex molecular networks into versatile embeddings. BioNeuralNet supports all major stages of multi-omics network analysis, including several network construction techniques, generation of low-dimensional representations, and a broad range of downstream analytical tasks. Its extensive utilities, including diverse GNN architectures, and compatibility with established Python packages (e.g., scikit-learn, PyTorch, NetworkX), enhance usability and facilitate quick adoption. BioNeuralNet is an open-source, user-friendly, and extensively documented framework designed to support flexible and reproducible multi-omics network analysis in precision medicine.

[634] Provable In-Context Learning of Nonlinear Regression with Transformers

Hongbo Li, Lingjie Duan, Yingbin Liang

Main category: cs.LG

TL;DR: The paper explores how transformers perform in-context learning (ICL) for complex nonlinear regression tasks, revealing the role of the Lipschitz constant in convergence dynamics.

Details

Motivation: To advance theoretical understanding of ICL beyond simple tasks by analyzing transformers' training dynamics in nonlinear settings.

Method: Analyzes stage-wise attention dynamics during training, focusing on how attention scores evolve for relevant and irrelevant features. Introduces proof techniques linking the Lipschitz constant to convergence.

Result: Identifies the Lipschitz constant as a key factor in convergence. Derives time bounds for near-zero prediction error in two regimes (low vs. high L). Shows transformers consistently attend to relevant features at convergence.

Conclusion: Transformers exhibit ICL capability for unseen nonlinear functions, with convergence dynamics governed by the Lipschitz constant.

Abstract: The transformer architecture, which processes sequences of input tokens to produce outputs for query tokens, has revolutionized numerous areas of machine learning. A defining feature of transformers is their ability to perform previously unseen tasks using task-specific prompts without updating parameters, a phenomenon known as in-context learning (ICL). Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks such as linear regression and binary classification. To advance the theoretical understanding of ICL, this paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities in these settings. We analyze the stage-wise dynamics of attention during training: attention scores between a query token and its target features grow rapidly in the early phase, then gradually converge to one, while attention to irrelevant features decays more slowly and exhibits oscillatory behavior. Our analysis introduces new proof techniques that explicitly characterize how the nature of general non-degenerate L-Lipschitz task functions affects attention weights. Specifically, we identify that the Lipschitz constant L of nonlinear function classes as a key factor governing the convergence dynamics of transformers in ICL. Leveraging these insights, for two distinct regimes depending on whether L is below or above a threshold, we derive different time bounds to guarantee near-zero prediction error. Notably, despite the convergence time depending on the underlying task functions, we prove that query tokens consistently attend to prompt tokens with highly relevant features at convergence, demonstrating the ICL capability of transformers for unseen functions.

[635] BOASF: A Unified Framework for Speeding up Automatic Machine Learning via Adaptive Successive Filtering

Guanghui Zhu, Xin Fang, Lei Wang, Wenzhong Chen, Rong Gu, Chunfeng Yuan, Yihua Huang

Main category: cs.LG

TL;DR: BOASF combines Bayesian Optimization and Adaptive Successive Filtering to automate model selection and hyperparameter optimization, improving speed and performance.

Details

Motivation: Non-expert practitioners struggle with model selection and hyperparameter tuning due to lack of expertise.

Method: BOASF uses Bayesian Optimization for promising configurations and ASF to discard poor performers, with Softmax for resource allocation.

Result: BOASF speeds up optimization, outperforms state-of-the-art methods, and achieves robust performance under time constraints.

Conclusion: BOASF is effective for automating ML tasks, offering better performance and efficiency.

Abstract: Machine learning has been making great success in many application areas. However, for the non-expert practitioners, it is always very challenging to address a machine learning task successfully and efficiently. Finding the optimal machine learning model or the hyperparameter combination set from a large number of possible alternatives usually requires considerable expert knowledge and experience. To tackle this problem, we propose a combined Bayesian Optimization and Adaptive Successive Filtering algorithm (BOASF) under a unified multi-armed bandit framework to automate the model selection or the hyperparameter optimization. Specifically, BOASF consists of multiple evaluation rounds in each of which we select promising configurations for each arm using the Bayesian optimization. Then, ASF can early discard the poor-performed arms adaptively using a Gaussian UCB-based probabilistic model. Furthermore, a Softmax model is employed to adaptively allocate available resources for each promising arm that advances to the next round. The arm with a higher probability of advancing will be allocated more resources. Experimental results show that BOASF is effective for speeding up the model selection and hyperparameter optimization processes while achieving robust and better prediction performance than the existing state-of-the-art automatic machine learning methods. Moreover, BOASF achieves better anytime performance under various time budgets.

[636] Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Ansh Poonia, Maeghal Jain

Main category: cs.LG

TL;DR: The study explores how assigning personas to LLMs affects their reasoning, revealing early MLP layers process both syntax and semantics, while middle MHA layers shape outputs. Specific attention heads focus on racial and color-based identities.

Details

Motivation: To understand how personas influence LLM reasoning and identify key model components encoding persona-specific information.

Method: Uses activation patching to analyze how early MLP and middle MHA layers process and transform persona tokens.

Result: Early MLP layers handle syntax and semantics; middle MHA layers use transformed representations. Specific attention heads focus on racial/color identities.

Conclusion: Personas impact LLM reasoning through distinct layer roles, with some attention heads highlighting biases in identity processing.

Abstract: Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model’s reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model’s output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.

[637] WEEP: A Differentiable Nonconvex Sparse Regularizer via Weakly-Convex Envelope

Takanobu Furuhashi, Hidekata Hontani, Tatsuya Yokota

Main category: cs.LG

TL;DR: WEEP is a novel, differentiable sparse regularizer that resolves the conflict between strong sparsity and gradient-based optimization, outperforming traditional methods like L1-norm.

Details

Motivation: The dilemma of non-differentiable sparsity-inducing penalties conflicting with gradient-based optimizers motivates the need for a differentiable alternative.

Method: WEEP is derived from the weakly-convex envelope framework, ensuring strong sparsity, full differentiability, and L-smoothness.

Result: WEEP outperforms L1-norm and other non-convex regularizers in signal and image denoising tasks.

Conclusion: WEEP successfully bridges the gap between statistical performance and computational tractability in sparse regularization.

Abstract: Sparse regularization is fundamental in signal processing for efficient signal recovery and feature extraction. However, it faces a fundamental dilemma: the most powerful sparsity-inducing penalties are often non-differentiable, conflicting with gradient-based optimizers that dominate the field. We introduce WEEP (Weakly-convex Envelope of Piecewise Penalty), a novel, fully differentiable sparse regularizer derived from the weakly-convex envelope framework. WEEP provides strong, unbiased sparsity while maintaining full differentiability and L-smoothness, making it natively compatible with any gradient-based optimizer. This resolves the conflict between statistical performance and computational tractability. We demonstrate superior performance compared to the L1-norm and other established non-convex sparse regularizers on challenging signal and image denoising tasks.

[638] Your Attention Matters: to Improve Model Robustness to Noise and Spurious Correlations

Camilo Tamayo-Rousseau, Yunjia Zhao, Yiqun Zhang, Randall Balestriero

Main category: cs.LG

TL;DR: The paper evaluates the robustness of five self-attention variants (Softmax, Sigmoid, Linear, Doubly Stochastic, and Cosine) in Vision Transformers under data corruption, finding Doubly Stochastic attention the most robust.

Details

Motivation: To study the robustness of self-attention mechanisms in Transformers to noise and spurious correlations, which is underexplored.

Method: Tested five self-attention variants in Vision Transformers under data corruption scenarios using CIFAR-10, CIFAR-100, and Imagenette datasets.

Result: Doubly Stochastic attention was the most robust among the variants tested.

Conclusion: The findings guide self-attention selection for applications with imperfect data.

Abstract: Self-attention mechanisms are foundational to Transformer architectures, supporting their impressive success in a wide range of tasks. While there are many self-attention variants, their robustness to noise and spurious correlations has not been well studied. This study evaluates Softmax, Sigmoid, Linear, Doubly Stochastic, and Cosine attention within Vision Transformers under different data corruption scenarios. Through testing across the CIFAR-10, CIFAR-100, and Imagenette datasets, we show that Doubly Stochastic attention is the most robust. Our findings inform self-attention selection in contexts with imperfect data.

[639] LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

Yining Huang, Bin Li, Keke Tang, Meilian Chen

Main category: cs.LG

TL;DR: LoRA-PAR is a dual-system LoRA framework that partitions data and parameters for System 1 (fast) and System 2 (slow) tasks, improving efficiency and performance with fewer parameters.

Details

Motivation: Existing PEFT methods focus on domain adaptation or layer-wise allocation but don't tailor data and parameters to different response demands. Inspired by 'Thinking, Fast and Slow,' the paper aims to optimize LLM performance by aligning parameter usage with task types.

Method: LoRA-PAR classifies tasks as System 1 or System 2, partitions parameters by importance scoring, and uses a two-stage fine-tuning strategy: SFT for System 1 tasks and RL for System 2 tasks.

Result: The framework reduces active parameter usage while matching or surpassing state-of-the-art PEFT baselines.

Conclusion: LoRA-PAR effectively balances efficiency and performance by aligning parameter usage with task demands, offering a scalable solution for LLM fine-tuning.

Abstract: Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.

[640] Diagonally-Weighted Generalized Method of Moments Estimation for Gaussian Mixture Modeling

Liu Zhang, Oscar Mickelin, Sheng Xu, Amit Singer

Main category: cs.LG

TL;DR: The paper introduces Diagonally-Weighted GMM (DGMM) to address computational and storage inefficiencies in MM and GMM, demonstrating its effectiveness for high-dimensional data.

Details

Motivation: The computational and storage complexities of MM and GMM grow exponentially with data dimensions, making them impractical for high-dimensional data or higher-order moments. DGMM aims to overcome these bottlenecks.

Method: Proposes DGMM, a method that balances statistical efficiency, computational complexity, and numerical stability. It avoids explicit computation of moment tensors.

Result: DGMM achieves smaller estimation errors and shorter runtime compared to MM and GMM in numerical studies.

Conclusion: DGMM is a computationally efficient and stable alternative to MM and GMM, particularly for high-dimensional data.

Abstract: Since Pearson [Philosophical Transactions of the Royal Society of London. A, 185 (1894), pp. 71-110] first applied the method of moments (MM) for modeling data as a mixture of one-dimensional Gaussians, moment-based estimation methods have proliferated. Among these methods, the generalized method of moments (GMM) improves the statistical efficiency of MM by weighting the moments appropriately. However, the computational complexity and storage complexity of MM and GMM grow exponentially with the dimension, making these methods impractical for high-dimensional data or when higher-order moments are required. Such computational bottlenecks are more severe in GMM since it additionally requires estimating a large weighting matrix. To overcome these bottlenecks, we propose the diagonally-weighted GMM (DGMM), which achieves a balance among statistical efficiency, computational complexity, and numerical stability. We apply DGMM to study the parameter estimation problem for weakly separated heteroscedastic low-rank Gaussian mixtures and design a computationally efficient and numerically stable algorithm that obtains the DGMM estimator without explicitly computing or storing the moment tensors. We implement the proposed algorithm and empirically validate the advantages of DGMM: in numerical studies, DGMM attains smaller estimation errors while requiring substantially shorter runtime than MM and GMM. The code and data will be available upon publication at https://github.com/liu-lzhang/dgmm.

[641] Shapley-Value-Based Graph Sparsification for GNN Inference

Selahattin Akkas, Ariful Azad

Main category: cs.LG

TL;DR: Shapley value-based graph sparsification improves GNN efficiency and interpretability by preserving influential edges and removing misleading ones.

Details

Motivation: Existing GNN explainability methods often produce non-negative scores, limiting their effectiveness for graph sparsification. Shapley values offer a more robust and fair way to assign importance.

Method: Uses Shapley values to evaluate edge importance, enabling better pruning strategies compared to gradient or perturbation-based methods.

Result: Maintains predictive performance while significantly reducing graph complexity.

Conclusion: Shapley value-based sparsification enhances GNN efficiency and interpretability.

Abstract: Graph sparsification is a key technique for improving inference efficiency in Graph Neural Networks by removing edges with minimal impact on predictions. GNN explainability methods generate local importance scores, which can be aggregated into global scores for graph sparsification. However, many explainability methods produce only non-negative scores, limiting their applicability for sparsification. In contrast, Shapley value based methods assign both positive and negative contributions to node predictions, offering a theoretically robust and fair allocation of importance by evaluating many subsets of graphs. Unlike gradient-based or perturbation-based explainers, Shapley values enable better pruning strategies that preserve influential edges while removing misleading or adversarial connections. Our approach shows that Shapley value-based graph sparsification maintains predictive performance while significantly reducing graph complexity, enhancing both interpretability and efficiency in GNN inference.

[642] Conditional Diffusion Models for Global Precipitation Map Inpainting

Daiko Kishikawa, Yuka Muto, Shunji Kotsuki

Main category: cs.LG

TL;DR: A machine learning approach using conditional diffusion models and a 3D U-Net is proposed to complete incomplete satellite-based precipitation maps, outperforming conventional methods.

Details

Motivation: Incomplete satellite precipitation data (e.g., GSMaP) due to orbital gaps and poor interpolation methods hinder global monitoring.

Method: Formulated as a video inpainting task, the method uses a 3D U-Net with a 3D condition encoder, leveraging spatio-temporal data from infrared images, grids, and time inputs. Trained on ERA5 data with pseudo-GSMaP masks.

Result: Produces more consistent inpainted maps than traditional methods, validated on 2024 data.

Conclusion: Conditional diffusion models show promise for enhancing global precipitation monitoring.

Abstract: Incomplete satellite-based precipitation presents a significant challenge in global monitoring. For example, the Global Satellite Mapping of Precipitation (GSMaP) from JAXA suffers from substantial missing regions due to the orbital characteristics of satellites that have microwave sensors, and its current interpolation methods often result in spatial discontinuities. In this study, we formulate the completion of the precipitation map as a video inpainting task and propose a machine learning approach based on conditional diffusion models. Our method employs a 3D U-Net with a 3D condition encoder to reconstruct complete precipitation maps by leveraging spatio-temporal information from infrared images, latitude-longitude grids, and physical time inputs. Training was carried out on ERA5 hourly precipitation data from 2020 to 2023. We generated a pseudo-GSMaP dataset by randomly applying GSMaP masks to ERA maps. Performance was evaluated for the calendar year 2024, and our approach produces more spatio-temporally consistent inpainted precipitation maps compared to conventional methods. These results indicate the potential to improve global precipitation monitoring using the conditional diffusion models.

[643] HIAL: A New Paradigm for Hypergraph Active Learning via Influence Maximization

Yanheng Hou, Xunkai Li, Zhenjun Li, Bing Zhou, Ronghua Li, Guoren Wang

Main category: cs.LG

TL;DR: HIAL introduces a native active learning framework for hypergraphs, outperforming existing methods by preserving high-order structural information and using a dual-perspective influence function.

Details

Motivation: Existing Graph Active Learning methods fail to preserve high-order structural information in hypergraphs, leading to suboptimal performance.

Method: HIAL reformulates Hypergraph Active Learning as an Influence Maximization task, using a dual-perspective influence function with a novel HOI-Aware propagation mechanism.

Result: HIAL significantly outperforms state-of-the-art baselines in performance, efficiency, generality, and robustness across seven datasets.

Conclusion: HIAL establishes an efficient and powerful paradigm for active learning on hypergraphs, leveraging high-order interactions effectively.

Abstract: In recent years, Hypergraph Neural Networks (HNNs) have demonstrated immense potential in handling complex systems with high-order interactions. However, acquiring large-scale, high-quality labeled data for these models is costly, making Active Learning (AL) a critical technique. Existing Graph Active Learning (GAL) methods, when applied to hypergraphs, often rely on techniques like “clique expansion,” which destroys the high-order structural information crucial to a hypergraph’s success, thereby leading to suboptimal performance. To address this challenge, we introduce HIAL (Hypergraph Active Learning), a native active learning framework designed specifically for hypergraphs. We innovatively reformulate the Hypergraph Active Learning (HAL) problem as an Influence Maximization task. The core of HIAL is a dual-perspective influence function that, based on our novel “High-Order Interaction-Aware (HOI-Aware)” propagation mechanism, synergistically evaluates a node’s feature-space coverage (via Magnitude of Influence, MoI) and its topological influence (via Expected Diffusion Value, EDV). We prove that this objective function is monotone and submodular, thus enabling the use of an efficient greedy algorithm with a formal (1-1/e) approximation guarantee. Extensive experiments on seven public datasets demonstrate that HIAL significantly outperforms state-of-the-art baselines in terms of performance, efficiency, generality, and robustness, establishing an efficient and powerful new paradigm for active learning on hypergraphs.

[644] Mixture of Length and Pruning Experts for Knowledge Graphs Reasoning

Enjun Du, Siyi Liu, Yongqi Zhang

Main category: cs.LG

TL;DR: MoKGR is a framework for KG reasoning that adaptively selects and weights path lengths and prunes paths based on query complexity, outperforming existing methods.

Details

Motivation: Existing GNNs use rigid, query-agnostic path-exploration strategies, limiting adaptability to diverse linguistic contexts and semantic nuances.

Method: MoKGR uses a mixture-of-experts framework with length and pruning experts to personalize path exploration.

Result: MoKGR achieves superior performance in transductive and inductive settings on diverse benchmarks.

Conclusion: Personalized path exploration enhances KG reasoning, as demonstrated by MoKGR’s effectiveness.

Abstract: Knowledge Graph (KG) reasoning, which aims to infer new facts from structured knowledge repositories, plays a vital role in Natural Language Processing (NLP) systems. Its effectiveness critically depends on constructing informative and contextually relevant reasoning paths. However, existing graph neural networks (GNNs) often adopt rigid, query-agnostic path-exploration strategies, limiting their ability to adapt to diverse linguistic contexts and semantic nuances. To address these limitations, we propose \textbf{MoKGR}, a mixture-of-experts framework that personalizes path exploration through two complementary components: (1) a mixture of length experts that adaptively selects and weights candidate path lengths according to query complexity, providing query-specific reasoning depth; and (2) a mixture of pruning experts that evaluates candidate paths from a complementary perspective, retaining the most informative paths for each query. Through comprehensive experiments on diverse benchmark, MoKGR demonstrates superior performance in both transductive and inductive settings, validating the effectiveness of personalized path exploration in KGs reasoning.

[645] DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning

Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, Sunil Gupta

Main category: cs.LG

TL;DR: DmC is a novel framework for cross-domain offline RL with limited target data, using k-NN for domain proximity and diffusion models to enhance source-target alignment, outperforming existing methods.

Details

Motivation: Addressing challenges in cross-domain offline RL with limited target data, such as dataset imbalance and partial domain overlap, to improve sample efficiency.

Method: Proposes DmC, which uses k-NN for domain proximity estimation and a nearest-neighbor-guided diffusion model to generate better-aligned source samples.

Result: DmC significantly outperforms state-of-the-art methods in MuJoCo environments, achieving substantial performance gains.

Conclusion: DmC effectively mitigates overfitting and enhances policy learning in cross-domain offline RL with limited target data.

Abstract: Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes $k$-nearest neighbor ($k$-NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.

[646] Attributed Graph Clustering with Multi-Scale Weight-Based Pairwise Coarsening and Contrastive Learning

Binxiong Li, Yuefei Wang, Binyu Zhao, Heyang Gao, Benhan Yang, Quanzhou Luo, Xue Li, Xu Xiang, Yujie Liu, Huijie Tang

Main category: cs.LG

TL;DR: MPCCL is a novel attributed graph clustering model addressing long-range dependency, feature collapse, and information loss through multi-scale coarsening and contrastive learning, achieving significant performance improvements.

Details

Motivation: Existing methods struggle with high-order graph features, feature diversity, and loss of fine-grained details during coarsening. MPCCL aims to bridge these gaps.

Method: MPCCL uses multi-scale coarsening to preserve structural details and one-to-many contrastive learning to enhance feature diversity, incorporating graph reconstruction loss and KL divergence for consistency.

Result: MPCCL improves clustering performance, with a 15.24% NMI increase on ACM and robust gains on Citeseer, Cora, and DBLP datasets.

Conclusion: MPCCL effectively addresses limitations of existing methods, demonstrating superior performance in attributed graph clustering.

Abstract: This study introduces the Multi-Scale Weight-Based Pairwise Coarsening and Contrastive Learning (MPCCL) model, a novel approach for attributed graph clustering that effectively bridges critical gaps in existing methods, including long-range dependency, feature collapse, and information loss. Traditional methods often struggle to capture high-order graph features due to their reliance on low-order attribute information, while contrastive learning techniques face limitations in feature diversity by overemphasizing local neighborhood structures. Similarly, conventional graph coarsening methods, though reducing graph scale, frequently lose fine-grained structural details. MPCCL addresses these challenges through an innovative multi-scale coarsening strategy, which progressively condenses the graph while prioritizing the merging of key edges based on global node similarity to preserve essential structural information. It further introduces a one-to-many contrastive learning paradigm, integrating node embeddings with augmented graph views and cluster centroids to enhance feature diversity, while mitigating feature masking issues caused by the accumulation of high-frequency node weights during multi-scale coarsening. By incorporating a graph reconstruction loss and KL divergence into its self-supervised learning framework, MPCCL ensures cross-scale consistency of node representations. Experimental evaluations reveal that MPCCL achieves a significant improvement in clustering performance, including a remarkable 15.24% increase in NMI on the ACM dataset and notable robust gains on smaller-scale datasets such as Citeseer, Cora and DBLP.

[647] Efficient Proxy Raytracer for Optical Systems using Implicit Neural Representations

Shiva Sinaei, Chuanjun Zheng, Kaan Akşit, Daisuke Iwai

Main category: cs.LG

TL;DR: Ray2Ray uses implicit neural representations to model optical systems efficiently, avoiding surface-by-surface computations, achieving high accuracy.

Details

Motivation: Traditional ray tracing is computationally intensive due to sequential surface-by-surface calculations.

Method: Ray2Ray learns mappings between input and output rays using neural representations, trained on nine optical systems.

Result: Achieves positional errors of ~1µm and angular deviations of ~0.01 degrees.

Conclusion: Neural representations can effectively replace traditional raytracing for optical systems.

Abstract: Ray tracing is a widely used technique for modeling optical systems, involving sequential surface-by-surface computations, which can be computationally intensive. We propose Ray2Ray, a novel method that leverages implicit neural representations to model optical systems with greater efficiency, eliminating the need for surface-by-surface computations in a single pass end-to-end model. Ray2Ray learns the mapping between rays emitted from a given source and their corresponding rays after passing through a given optical system in a physically accurate manner. We train Ray2Ray on nine off-the-shelf optical systems, achieving positional errors on the order of 1{\mu}m and angular deviations on the order 0.01 degrees in the estimated output rays. Our work highlights the potential of neural representations as a proxy for optical raytracer.

[648] Kernel Learning for Sample Constrained Black-Box Optimization

Rajalaxmi Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury

Main category: cs.LG

TL;DR: KOBO introduces a method to optimize Gaussian Process kernels via a variational autoencoder, reducing sample budgets in black box optimization.

Details

Motivation: Reducing the high sample budget in black box optimization by learning the function's kernel structure.

Method: Uses a variational autoencoder to create a continuous kernel space and auxiliary optimization to find the best kernel.

Result: Outperforms state-of-the-art methods, achieving optimization with fewer samples in synthetic and real-world applications.

Conclusion: KOBO effectively reduces sample costs in BBO, demonstrated in hearing aid personalization and generative model convergence.

Abstract: Black box optimization (BBO) focuses on optimizing unknown functions in high-dimensional spaces. In many applications, sampling the unknown function is expensive, imposing a tight sample budget. Ongoing work is making progress on reducing the sample budget by learning the shape/structure of the function, known as kernel learning. We propose a new method to learn the kernel of a Gaussian Process. Our idea is to create a continuous kernel space in the latent space of a variational autoencoder, and run an auxiliary optimization to identify the best kernel. Results show that the proposed method, Kernel Optimized Blackbox Optimization (KOBO), outperforms state of the art by estimating the optimal at considerably lower sample budgets. Results hold not only across synthetic benchmark functions but also in real applications. We show that a hearing aid may be personalized with fewer audio queries to the user, or a generative model could converge to desirable images from limited user ratings.

[649] Improving Group Fairness in Tensor Completion via Imbalance Mitigating Entity Augmentation

Dawon Ahn, Jun-Gi Jang, Evangelos E. Papalexakis

Main category: cs.LG

TL;DR: STAFF improves fairness in tensor decomposition by minimizing error gaps between groups while reducing overall completion error.

Details

Motivation: Address performance degradation and group fairness issues in tensor decomposition to prevent discrimination.

Method: Augment tensor with entities to mitigate imbalance and bias, evaluated under various tensor models.

Result: STAFF achieves 36% lower MSE and 59% lower MADE than baselines, balancing error and fairness.

Conclusion: STAFF effectively enhances fairness and performance in tensor decomposition.

Abstract: Group fairness is important to consider in tensor decomposition to prevent discrimination based on social grounds such as gender or age. Although few works have studied group fairness in tensor decomposition, they suffer from performance degradation. To address this, we propose STAFF(Sparse Tensor Augmentation For Fairness) to improve group fairness by minimizing the gap in completion errors of different groups while reducing the overall tensor completion error. Our main idea is to augment a tensor with augmented entities including sufficient observed entries to mitigate imbalance and group bias in the sparse tensor. We evaluate \method on tensor completion with various datasets under conventional and deep learning-based tensor models. STAFF consistently shows the best trade-off between completion error and group fairness; at most, it yields 36% lower MSE and 59% lower MADE than the second-best baseline.

[650] DAG-AFL:Directed Acyclic Graph-based Asynchronous Federated Learning

Shuaipeng Zhang, Lanju Kong, Yixin Zhang, Wei He, Yongqing Zheng, Han Yu, Lizhen Cui

Main category: cs.LG

TL;DR: Proposes DAG-AFL, a blockchain-based FL framework, to address inefficiency and resource issues in traditional FL by using DAG for asynchronous client participation and data heterogeneity.

Details

Motivation: Challenges in FL include vulnerability, coordination needs, and inefficiency due to traditional blockchain consensus mechanisms.

Method: Introduces DAG-AFL with a tip selection algorithm focusing on temporal freshness, node reachability, and model accuracy, plus DAG-based verification.

Result: Improves training efficiency by 22.7% and model accuracy by 6.5% on average compared to eight state-of-the-art methods.

Conclusion: DAG-AFL effectively enhances FL efficiency and accuracy while minimizing resource overhead.

Abstract: Due to the distributed nature of federated learning (FL), the vulnerability of the global model and the need for coordination among many client devices pose significant challenges. As a promising decentralized, scalable and secure solution, blockchain-based FL methods have attracted widespread attention in recent years. However, traditional consensus mechanisms designed for Proof of Work (PoW) similar to blockchain incur substantial resource consumption and compromise the efficiency of FL, particularly when participating devices are wireless and resource-limited. To address asynchronous client participation and data heterogeneity in FL, while limiting the additional resource overhead introduced by blockchain, we propose the Directed Acyclic Graph-based Asynchronous Federated Learning (DAG-AFL) framework. We develop a tip selection algorithm that considers temporal freshness, node reachability and model accuracy, with a DAG-based trusted verification strategy. Extensive experiments on 3 benchmarking datasets against eight state-of-the-art approaches demonstrate that DAG-AFL significantly improves training efficiency and model accuracy by 22.7% and 6.5% on average, respectively.

[651] Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy

Yaxin Xiao, Qingqing Ye, Li Hu, Huadi Zheng, Haibo Hu, Zi Liang, Haoyang Li, Yijie Jiao

Main category: cs.LG

TL;DR: Machine unlearning algorithms leave implicit residuals, enabling privacy attacks like the Reminiscence Attack (ReA). A dual-phase unlearning framework mitigates this risk while maintaining efficiency.

Details

Motivation: To address privacy risks in approximate unlearning algorithms, which fail to fully protect unlearned data due to persistent residuals.

Method: Proposes the Reminiscence Attack (ReA) to exploit residuals and introduces a dual-phase unlearning framework to eliminate deep-layer traces and ensure convergence stability.

Result: ReA outperforms prior attacks (1.90x and 1.12x higher accuracy). The framework reduces attack accuracy to near-random guess with 2-12% computational cost of full retraining.

Conclusion: The dual-phase framework effectively mitigates residual-induced privacy risks while maintaining unlearning efficacy and computational efficiency.

Abstract: Machine unlearning enables the removal of specific data from ML models to uphold the right to be forgotten. While approximate unlearning algorithms offer efficient alternatives to full retraining, this work reveals that they fail to adequately protect the privacy of unlearned data. In particular, these algorithms introduce implicit residuals which facilitate privacy attacks targeting at unlearned data. We observe that these residuals persist regardless of model architectures, parameters, and unlearning algorithms, exposing a new attack surface beyond conventional output-based leakage. Based on this insight, we propose the Reminiscence Attack (ReA), which amplifies the correlation between residuals and membership privacy through targeted fine-tuning processes. ReA achieves up to 1.90x and 1.12x higher accuracy than prior attacks when inferring class-wise and sample-wise membership, respectively. To mitigate such residual-induced privacy risk, we develop a dual-phase approximate unlearning framework that first eliminates deep-layer unlearned data traces and then enforces convergence stability to prevent models from “pseudo-convergence”, where their outputs are similar to retrained models but still preserve unlearned residuals. Our framework works for both classification and generation tasks. Experimental evaluations confirm that our approach maintains high unlearning efficacy, while reducing the adaptive privacy attack accuracy to nearly random guess, at the computational cost of 2-12% of full retraining from scratch.

[652] Fusing CFD and measurement data using transfer learning

Alexander Barklage, Philipp Bekemeyer

Main category: cs.LG

TL;DR: A data-driven neural network method is introduced to combine simulation and measurement data for aerodynamic analysis, outperforming traditional linear methods like proper orthogonal decomposition.

Details

Motivation: Current aerodynamic analysis methods vary in accuracy and resolution, creating a need for a unified approach that leverages the strengths of both simulation and measurement data.

Method: A neural network is trained first on high-resolution simulation data, then fine-tuned with sparse but accurate measurement data using transfer learning.

Result: The method improves accuracy near nonlinearities and provides solutions for arbitrary flow conditions, benefiting design and certification processes.

Conclusion: The proposed neural network approach is effective and generalizable, offering potential for more complex architectures in future applications.

Abstract: Aerodynamic analysis during aircraft design usually involves methods of varying accuracy and spatial resolution, which all have their advantages and disadvantages. It is therefore desirable to create data-driven models which effectively combine these advantages. Such data fusion methods for distributed quantities mainly rely on proper orthogonal decomposition as of now, which is a linear method. In this paper, we introduce a non-linear method based on neural networks combining simulation and measurement data via transfer learning. The network training accounts for the heterogeneity of the data, as simulation data usually features a high spatial resolution, while measurement data is sparse but more accurate. In a first step, the neural network is trained on simulation data to learn spatial features of the distributed quantities. The second step involves transfer learning on the measurement data to correct for systematic errors between simulation and measurement by only re-training a small subset of the entire neural network model. This approach is applied to a multilayer perceptron architecture and shows significant improvements over the established method based on proper orthogonal decomposition by producing more physical solutions near nonlinearities. In addition, the neural network provides solutions at arbitrary flow conditions, thus making the model useful for flight mechanical design, structural sizing, and certification. As the proposed training strategy is very general, it can also be applied to more complex neural network architectures in the future.

[653] PhaseNAS: Language-Model Driven Architecture Search with Dynamic Phase Adaptation

Fei Kong, Xiaohan Shan, Yanwei Hu, Jianmin Li

Main category: cs.LG

TL;DR: PhaseNAS is an LLM-based NAS framework with dynamic phase transitions and structured architecture templates, improving efficiency and accuracy across vision tasks.

Details

Motivation: Addressing the trade-off between exploration and efficiency in NAS, and overcoming limitations of static search strategies and ambiguous representations in LLM-based NAS.

Method: Uses dynamic phase transitions guided by real-time score thresholds and a structured architecture template language for consistent code generation.

Result: Achieves higher accuracy and better rank on NAS-Bench-Macro, reduces search time by up to 86% for CIFAR-10/100, and produces YOLOv8 variants with higher mAP and lower cost.

Conclusion: PhaseNAS is efficient, adaptive, and generalizable for diverse vision tasks.

Abstract: Neural Architecture Search (NAS) is challenged by the trade-off between search space exploration and efficiency, especially for complex tasks. While recent LLM-based NAS methods have shown promise, they often suffer from static search strategies and ambiguous architecture representations. We propose PhaseNAS, an LLM-based NAS framework with dynamic phase transitions guided by real-time score thresholds and a structured architecture template language for consistent code generation. On the NAS-Bench-Macro benchmark, PhaseNAS consistently discovers architectures with higher accuracy and better rank. For image classification (CIFAR-10/100), PhaseNAS reduces search time by up to 86% while maintaining or improving accuracy. In object detection, it automatically produces YOLOv8 variants with higher mAP and lower resource cost. These results demonstrate that PhaseNAS enables efficient, adaptive, and generalizable NAS across diverse vision tasks.

[654] Deep Generative Models of Evolution: SNP-level Population Adaptation by Genomic Linkage Incorporation

Julia Siekiera, Christian Schlötterer, Stefan Kramer

Main category: cs.LG

TL;DR: A deep generative neural network is introduced to model allele frequency trajectories in E&R experiments, addressing limitations of classic models and improving LD estimation in Pool-Seq data.

Details

Motivation: Classic statistical models like Wright-Fisher oversimplify assumptions and lack accuracy in population genomics. Deep generative models offer potential but face challenges in data demands and interpretability.

Method: A deep generative neural network integrates empirical observations and neighboring loci information to estimate allele frequency trajectories and LD.

Result: The model effectively captures allele frequency distributions and provides competitive LD estimation in Pool-Seq data, outperforming existing methods in high-LD scenarios.

Conclusion: Deep generative models are promising for evolutionary studies, offering improved accuracy and insights into LD in Pool-Seq data.

Abstract: The investigation of allele frequency trajectories in populations evolving under controlled environmental pressures has become a popular approach to study evolutionary processes on the molecular level. Statistical models based on well-defined evolutionary concepts can be used to validate different hypotheses about empirical observations. Despite their popularity, classic statistical models like the Wright-Fisher model suffer from simplified assumptions such as the independence of selected loci along a chromosome and uncertainty about the parameters. Deep generative neural networks offer a powerful alternative known for the integration of multivariate dependencies and noise reduction. Due to their high data demands and challenging interpretability they have, so far, not been widely considered in the area of population genomics. To address the challenges in the area of Evolve and Resequencing experiments (E&R) based on pooled sequencing (Pool-Seq) data, we introduce a deep generative neural network that aims to model a concept of evolution based on empirical observations over time. The proposed model estimates the distribution of allele frequency trajectories by embedding the observations from single nucleotide polymorphisms (SNPs) with information from neighboring loci. Evaluation on simulated E&R experiments demonstrates the model’s ability to capture the distribution of allele frequency trajectories and illustrates the representational power of deep generative models on the example of linkage disequilibrium (LD) estimation. Inspecting the internally learned representations enables estimating pairwise LD, which is typically inaccessible in Pool-Seq data. Our model provides competitive LD estimation in Pool-Seq data high degree of LD when compared to existing methods.

[655] Novel Pivoted Cholesky Decompositions for Efficient Gaussian Process Inference

Filip de Roos, Fabio Muratore

Main category: cs.LG

TL;DR: The paper explores novel pivoting strategies for Cholesky decomposition, linking it to Bayesian nonparametric inference and improving efficiency in Gaussian processes.

Details

Motivation: Improving numerical stability and efficiency of Cholesky decomposition, especially for early termination, by leveraging connections to Bayesian nonparametric inference.

Method: Introduces new pivoting strategies inspired by greedy entropy maximization, tailored for Cholesky decomposition, and benchmarks them on Gaussian process tasks.

Result: The new strategies match or outperform traditional methods with minimal additional computational cost.

Conclusion: The proposed pivoting strategies enhance Cholesky decomposition’s efficiency and applicability in Gaussian processes.

Abstract: The Cholesky decomposition is a fundamental tool for solving linear systems with symmetric and positive definite matrices which are ubiquitous in linear algebra, optimization, and machine learning. Its numerical stability can be improved by introducing a pivoting strategy that iteratively permutes the rows and columns of the matrix. The order of pivoting indices determines how accurately the intermediate decomposition can reconstruct the original matrix, thus is decisive for the algorithm’s efficiency in the case of early termination. Standard implementations select the next pivot from the largest value on the diagonal. In the case of Bayesian nonparametric inference, this strategy corresponds to greedy entropy maximization, which is often used in active learning and design of experiments. We explore this connection in detail and deduce novel pivoting strategies for the Cholesky decomposition. The resulting algorithms are more efficient at reducing the uncertainty over a data set, can be updated to include information about observations, and additionally benefit from a tailored implementation. We benchmark the effectiveness of the new selection strategies on two tasks important to Gaussian processes: sparse regression and inference based on preconditioned iterative solvers. Our results show that the proposed selection strategies are either on par or, in most cases, outperform traditional baselines while requiring a negligible amount of additional computation.

[656] Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks

Valentin Lafargue, Adriana Laurindo Monteiro, Emmanuelle Claeys, Laurent Risser, Jean-Michel Loubes

Main category: cs.LG

TL;DR: The paper addresses the challenge of ensuring AI algorithm compliance with fairness regulations, focusing on manipulating data to meet fairness criteria and detecting such manipulations.

Details

Motivation: With AI's growing real-life applications, proving compliance and avoiding biased behaviors is critical under regulations like the EU AI Act. Current fairness audits rely on global metrics like Disparate Impact, which are sample-dependent.

Method: The study introduces methods to manipulate data samples to satisfy fairness constraints (using entropic or optimal transport projections) and explores detection techniques for such manipulations.

Result: The paper provides mathematically sound methods for modifying datasets under fairness constraints, examines potential circumvention of audits, and offers detection recommendations.

Conclusion: Validated on classical datasets, the findings highlight the need for robust auditing techniques to ensure genuine fairness in AI algorithms.

Abstract: Proving the compliance of AI algorithms has become an important challenge with the growing deployment of such algorithms for real-life applications. Inspecting possible biased behaviors is mandatory to satisfy the constraints of the regulations of the EU Artificial Intelligence’s Act. Regulation-driven audits increasingly rely on global fairness metrics, with Disparate Impact being the most widely used. Yet such global measures depend highly on the distribution of the sample on which the measures are computed. We investigate first how to manipulate data samples to artificially satisfy fairness criteria, creating minimally perturbed datasets that remain statistically indistinguishable from the original distribution while satisfying prescribed fairness constraints. Then we study how to detect such manipulation. Our analysis (i) introduces mathematically sound methods for modifying empirical distributions under fairness constraints using entropic or optimal transport projections, (ii) examines how an auditee could potentially circumvent fairness inspections, and (iii) offers recommendations to help auditors detect such data manipulations. These results are validated through experiments on classical tabular datasets in bias detection.

[657] Prostate Cancer Classification Using Multimodal Feature Fusion and Explainable AI

Asma Sadia Khan, Fariba Tasnia Khan, Tanjim Mahmud, Salman Karim Khan, Rishita Chakma, Nahed Sharmen, Mohammad Shahadat Hossain, Karl Andersson

Main category: cs.LG

TL;DR: An explainable AI system combining BERT and Random Forest for prostate cancer diagnosis achieves high accuracy (98%) and AUC (99%), with improved recall for intermediate stages, while maintaining interpretability.

Details

Motivation: Prostate cancer diagnosis requires advanced, interpretable tools. The study aims to improve diagnostic performance using a multimodal AI approach.

Method: Combines BERT for clinical notes and Random Forest for lab data via a novel fusion strategy, validated on the PLCO-NIH dataset.

Result: Achieves 98% accuracy, 99% AUC, and improved recall (0.900) for intermediate stages. SHAP analysis ensures interpretability.

Conclusion: The BERT+RF pipeline offers a high-performance, interpretable, and efficient solution for prostate cancer diagnostics.

Abstract: Prostate cancer, the second most prevalent male malignancy, requires advanced diagnostic tools. We propose an explainable AI system combining BERT (for textual clinical notes) and Random Forest (for numerical lab data) through a novel multimodal fusion strategy, achieving superior classification performance on PLCO-NIH dataset (98% accuracy, 99% AUC). While multimodal fusion is established, our work demonstrates that a simple yet interpretable BERT+RF pipeline delivers clinically significant improvements - particularly for intermediate cancer stages (Class 2/3 recall: 0.900 combined vs 0.824 numerical/0.725 textual). SHAP analysis provides transparent feature importance rankings, while ablation studies prove textual features’ complementary value. This accessible approach offers hospitals a balance of high performance (F1=89%), computational efficiency, and clinical interpretability - addressing critical needs in prostate cancer diagnostics.

[658] Uncertainty-driven Embedding Convolution

Sungjun Lim, Kangjun Noh, Youngjun Choi, Heeyoung Lee, Kyungwoo Song

Main category: cs.LG

TL;DR: UEC improves NLP embedding ensembles by incorporating uncertainty, enhancing performance and robustness.

Details

Motivation: Existing ensemble methods for text embeddings ignore model-specific uncertainty, limiting robustness.

Method: UEC transforms deterministic embeddings into probabilistic ones, computes adaptive ensemble weights based on uncertainty, and introduces an uncertainty-aware similarity function.

Result: UEC consistently improves performance and robustness in retrieval, classification, and semantic similarity tasks.

Conclusion: UEC leverages principled uncertainty modeling to enhance embedding ensembles, outperforming deterministic methods.

Abstract: Text embeddings are essential components in modern NLP pipelines. While numerous embedding models have been proposed, their performance varies across domains, and no single model consistently excels across all tasks. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble weights based on embedding uncertainty, grounded in a Bayes-optimal solution under a surrogate loss. Additionally, UEC introduces an uncertainty-aware similarity function that directly incorporates uncertainty into similarity scoring. Extensive experiments on retrieval, classification, and semantic similarity benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.

[659] First Hallucination Tokens Are Different from Conditional Ones

Jakob Snel, Seong Joon Oh

Main category: cs.LG

TL;DR: The paper analyzes token-level hallucination signals in foundational models, finding the first hallucinated token is more detectable than subsequent ones.

Details

Motivation: Understanding token-level hallucination signals is crucial for real-time filtering and correction in foundational models.

Method: The study uses the RAGTruth corpus with token-level annotations and reproduced logits to analyze hallucination signals.

Result: The first hallucinated token has a stronger and more detectable signal than conditional tokens.

Conclusion: The findings improve understanding of token-level hallucination, with tools and code released for further research.

Abstract: Hallucination, the generation of untruthful content, is one of the major concerns regarding foundational models. Detecting hallucinations at the token level is vital for real-time filtering and targeted correction, yet the variation of hallucination signals within token sequences is not fully understood. Leveraging the RAGTruth corpus with token-level annotations and reproduced logits, we analyse how these signals depend on a token’s position within hallucinated spans, contributing to an improved understanding of token-level hallucination. Our results show that the first hallucinated token carries a stronger signal and is more detectable than conditional tokens. We release our analysis framework, along with code for logit reproduction and metric computation at https://github.com/jakobsnl/RAGTruth_Xtended.

[660] BuildSTG: A Multi-building Energy Load Forecasting Method using Spatio-Temporal Graph Neural Network

Yongzheng Liu, Yiming Wang, Po Xu, Yingjie Xu, Yuntian Chen, Dongxiao Zhang

Main category: cs.LG

TL;DR: A spatio-temporal graph neural network approach is proposed for multi-building energy load prediction, outperforming traditional methods by capturing spatial dependencies and offering interpretability.

Details

Motivation: Conventional methods fail to capture spatial dependencies in building energy patterns, prompting the need for a data-driven approach that leverages similarities between buildings.

Method: The method involves constructing a graph based on building characteristics and environmental factors, using a multi-level graph convolutional architecture with attention for prediction, and interpreting the optimized graph structure.

Result: The approach outperforms baselines like XGBoost, SVR, FCNN, GRU, and Naive on the Building Data Genome Project 2 dataset, demonstrating robustness and generalization.

Conclusion: The proposed spatio-temporal graph neural network effectively captures building similarities and spatial relationships, offering superior performance and interpretability in energy load prediction.

Abstract: Due to the extensive availability of operation data, data-driven methods show strong capabilities in predicting building energy loads. Buildings with similar features often share energy patterns, reflected by spatial dependencies in their operational data, which conventional prediction methods struggle to capture. To overcome this, we propose a multi-building prediction approach using spatio-temporal graph neural networks, comprising graph representation, graph learning, and interpretation. First, a graph is built based on building characteristics and environmental factors. Next, a multi-level graph convolutional architecture with attention is developed for energy prediction. Lastly, a method interpreting the optimized graph structure is introduced. Experiments on the Building Data Genome Project 2 dataset confirm superior performance over baselines such as XGBoost, SVR, FCNN, GRU, and Naive, highlighting the method’s robustness, generalization, and interpretability in capturing meaningful building similarities and spatial relationships.

[661] Towards Explainable Deep Clustering for Time Series Data

Udo Schlegel, Gabriel Marques Tavares, Thomas Seidl

Main category: cs.LG

TL;DR: A survey on explainable deep clustering for time series, highlighting methods, gaps, and research opportunities to improve interpretability and trustworthiness.

Details

Motivation: Deep clustering is powerful but lacks transparency, limiting its use in safety-critical applications. This survey aims to bridge the gap by reviewing methods and proposing future directions.

Method: Analyzes peer-reviewed and preprint papers, focusing on autoencoder and attention architectures, and evaluates their applicability in domains like healthcare, finance, IoT, and climate science.

Result: Identifies reliance on certain architectures, gaps in handling streaming/irregular data, and interpretability as an afterthought. Proposes six research opportunities to advance the field.

Conclusion: Interpretability should be a primary design goal. The survey lays groundwork for trustworthy deep clustering in time series analytics.

Abstract: Deep clustering uncovers hidden patterns and groups in complex time series data, yet its opaque decision-making limits use in safety-critical settings. This survey offers a structured overview of explainable deep clustering for time series, collecting current methods and their real-world applications. We thoroughly discuss and compare peer-reviewed and preprint papers through application domains across healthcare, finance, IoT, and climate science. Our analysis reveals that most work relies on autoencoder and attention architectures, with limited support for streaming, irregularly sampled, or privacy-preserved series, and interpretability is still primarily treated as an add-on. To push the field forward, we outline six research opportunities: (1) combining complex networks with built-in interpretability; (2) setting up clear, faithfulness-focused evaluation metrics for unsupervised explanations; (3) building explainers that adapt to live data streams; (4) crafting explanations tailored to specific domains; (5) adding human-in-the-loop methods that refine clusters and explanations together; and (6) improving our understanding of how time series clustering models work internally. By making interpretability a primary design goal rather than an afterthought, we propose the groundwork for the next generation of trustworthy deep clustering time series analytics.

[662] Geometry of Neural Reinforcement Learning in Continuous State and Action Spaces

Saket Tiwari, Omer Gottesman, George Konidaris

Main category: cs.LG

TL;DR: The paper proposes a geometric approach to understand the theoretical underpinnings of RL in continuous state and action spaces, linking the dimensionality of attainable states to the action space.

Details

Motivation: Despite RL's success in continuous spaces, theoretical work mostly focuses on finite spaces. The paper aims to bridge this gap by analyzing the geometry of attainable states.

Method: A geometric lens is used to study the manifold of attainable states induced by a two-layer neural policy trained via actor-critic. Theoretical bounds on manifold dimensionality are derived and empirically validated.

Result: The dimensionality of the manifold of attainable states is shown to scale with the action space’s dimensionality, empirically verified in MuJoCo and toy environments.

Conclusion: The findings enable practical improvements, such as a manifold learning layer for high-DOF control tasks, demonstrating the theory’s applicability.

Abstract: Advances in reinforcement learning (RL) have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens to understand the locally attained set of states. The set of all parametrised policies learnt through a semi-gradient based approach induces a set of attainable states in RL. We show that the training dynamics of a two-layer neural policy induce a low dimensional manifold of attainable states embedded in the high-dimensional nominal state space trained using an actor-critic algorithm. We prove that, under certain conditions, the dimensionality of this manifold is of the order of the dimensionality of the action space. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments and also demonstrate the results in a toy environment with varying dimensionality. We also show the applicability of this theoretical result by introducing a local manifold learning layer to the policy and value function networks to improve the performance in control environments with very high degrees of freedom by changing one layer of the neural network to learn sparse representations.

[663] Bi-cephalic self-attended model to classify Parkinson’s disease patients with freezing of gait

Shomoita Jahid Mitin, Rodrigue Rizk, Maximilian Scherer, Thomas Koeglsperger, Daniel Lench, KC Santosh, Arun Singh

Main category: cs.LG

TL;DR: A multi-modal model using EEG and clinical data achieves 88% accuracy in detecting Parkinson’s gait dysfunction.

Details

Motivation: Current methods for detecting gait dysfunction in Parkinson's disease are subjective or require specialized tools. This study aims to create an objective, data-driven solution.

Method: Used resting-state EEG signals and clinical variables to train a Bi-cephalic Self-Attention Model (BiSAM) on 124 participants (PD patients with/without FOG and healthy controls).

Result: Multi-modal models (EEG + clinical data) outperformed signal-only or descriptive-only models, achieving 88% accuracy with minimal EEG channels.

Conclusion: The study presents a scalable, efficient method for detecting PD-related gait dysfunction, suitable for clinical use.

Abstract: Parkinson Disease (PD) often results in motor and cognitive impairments, including gait dysfunction, particularly in patients with freezing of gait (FOG). Current detection methods are either subjective or reliant on specialized gait analysis tools. This study aims to develop an objective, data-driven, and multi-modal classification model to detect gait dysfunction in PD patients using resting-state EEG signals combined with demographic and clinical variables. We utilized a dataset of 124 participants: 42 PD patients with FOG (PDFOG+), 41 without FOG (PDFOG-), and 41 age-matched healthy controls. Features extracted from resting-state EEG and descriptive variables (age, education, disease duration) were used to train a novel Bi-cephalic Self-Attention Model (BiSAM). We tested three modalities: signal-only, descriptive-only, and multi-modal, across different EEG channel subsets (BiSAM-63, -16, -8, and -4). Signal-only and descriptive-only models showed limited performance, achieving a maximum accuracy of 55% and 68%, respectively. In contrast, the multi-modal models significantly outperformed both, with BiSAM-8 and BiSAM-4 achieving the highest classification accuracy of 88%. These results demonstrate the value of integrating EEG with objective descriptive features for robust PDFOG+ detection. This study introduces a multi-modal, attention-based architecture that objectively classifies PDFOG+ using minimal EEG channels and descriptive variables. This approach offers a scalable and efficient alternative to traditional assessments, with potential applications in routine clinical monitoring and early diagnosis of PD-related gait dysfunction.

[664] Online hierarchical partitioning of the output space in extreme multi-label data stream

Lara Neves, Afonso Lourenço, Alberto Cano, Goreti Marreiros

Main category: cs.LG

TL;DR: iHOMER is an online multi-label learning framework that dynamically clusters labels and adapts to concept drift, outperforming state-of-the-art methods by significant margins.

Details

Motivation: Addressing challenges in multi-label data streams like evolving distributions, high-dimensional label spaces, and concept drift affecting label correlations.

Method: Uses incremental divisive-agglomerative clustering and a global tree-based learner with drift detection at global and local levels.

Result: Outperforms 5 global baselines by 23% and 12 local baselines by 32% in experiments on 23 datasets.

Conclusion: iHOMER is robust for online multi-label classification, effectively handling non-stationarity and label dependencies.

Abstract: Mining data streams with multi-label outputs poses significant challenges due to evolving distributions, high-dimensional label spaces, sparse label occurrences, and complex label dependencies. Moreover, concept drift affects not only input distributions but also label correlations and imbalance ratios over time, complicating model adaptation. To address these challenges, structured learners are categorized into local and global methods. Local methods break down the task into simpler components, while global methods adapt the algorithm to the full output space, potentially yielding better predictions by exploiting label correlations. This work introduces iHOMER (Incremental Hierarchy Of Multi-label Classifiers), an online multi-label learning framework that incrementally partitions the label space into disjoint, correlated clusters without relying on predefined hierarchies. iHOMER leverages online divisive-agglomerative clustering based on \textit{Jaccard} similarity and a global tree-based learner driven by a multivariate \textit{Bernoulli} process to guide instance partitioning. To address non-stationarity, it integrates drift detection mechanisms at both global and local levels, enabling dynamic restructuring of label partitions and subtrees. Experiments across 23 real-world datasets show iHOMER outperforms 5 state-of-the-art global baselines, such as MLHAT, MLHT of Pruned Sets and iSOUPT, by 23%, and 12 local baselines, such as binary relevance transformations of kNN, EFDT, ARF, and ADWIN bagging/boosting ensembles, by 32%, establishing its robustness for online multi-label classification.

[665] Modeling User Behavior from Adaptive Surveys with Supplemental Context

Aman Shukla, Daniel Patrick Scantlebury, Rishabh Kumar

Main category: cs.LG

TL;DR: LANTERN is a modular architecture for user behavior modeling, combining survey data with contextual signals to improve prediction accuracy while maintaining survey primacy.

Details

Motivation: Surveys alone are limited by user fatigue and incomplete responses, necessitating a method to enrich behavioral data with contextual signals.

Method: LANTERN uses selective gating, residual connections, and late fusion via cross-attention to integrate survey data with external modalities.

Result: LANTERN outperforms survey-only baselines in multi-label prediction and shows benefits of selective modality reliance.

Conclusion: LANTERN offers a scalable, extensible solution for behavior modeling in survey-centric applications.

Abstract: Modeling user behavior is critical across many industries where understanding preferences, intent, or decisions informs personalization, targeting, and strategic outcomes. Surveys have long served as a classical mechanism for collecting such behavioral data due to their interpretability, structure, and ease of deployment. However, surveys alone are inherently limited by user fatigue, incomplete responses, and practical constraints on their length making them insufficient for capturing user behavior. In this work, we present LANTERN (Late-Attentive Network for Enriched Response Modeling), a modular architecture for modeling user behavior by fusing adaptive survey responses with supplemental contextual signals. We demonstrate the architectural value of maintaining survey primacy through selective gating, residual connections and late fusion via cross-attention, treating survey data as the primary signal while incorporating external modalities only when relevant. LANTERN outperforms strong survey-only baselines in multi-label prediction of survey responses. We further investigate threshold sensitivity and the benefits of selective modality reliance through ablation and rare/frequent attribute analysis. LANTERN’s modularity supports scalable integration of new encoders and evolving datasets. This work provides a practical and extensible blueprint for behavior modeling in survey-centric applications.

[666] Zero-Shot Learning with Subsequence Reordering Pretraining for Compound-Protein Interaction

Hongzhi Zhang, Zhonglie Liu, Kun Meng, Jiameng Chen, Jia Wu, Bo Du, Di Lin, Yan Che, Wenbin Hu

Main category: cs.LG

TL;DR: A novel approach for zero-shot compound-protein interaction (CPI) prediction using subsequence reordering and length-variable protein augmentation improves performance, especially in data-scarce scenarios.

Details

Motivation: Address challenges in CPI prediction, such as overlooked interdependencies in protein sequences and reliance on large datasets, to enhance scalability and efficiency.

Method: Pretrains protein representations using subsequence reordering and applies length-variable protein augmentation for small datasets.

Result: Outperforms baseline methods, showing superior performance in zero-shot and data-scarce scenarios.

Conclusion: The proposed method effectively improves CPI prediction, particularly where training data is limited.

Abstract: Given the vastness of chemical space and the ongoing emergence of previously uncharacterized proteins, zero-shot compound-protein interaction (CPI) prediction better reflects the practical challenges and requirements of real-world drug development. Although existing methods perform adequately during certain CPI tasks, they still face the following challenges: (1) Representation learning from local or complete protein sequences often overlooks the complex interdependencies between subsequences, which are essential for predicting spatial structures and binding properties. (2) Dependence on large-scale or scarce multimodal protein datasets demands significant training data and computational resources, limiting scalability and efficiency. To address these challenges, we propose a novel approach that pretrains protein representations for CPI prediction tasks using subsequence reordering, explicitly capturing the dependencies between protein subsequences. Furthermore, we apply length-variable protein augmentation to ensure excellent pretraining performance on small training datasets. To evaluate the model’s effectiveness and zero-shot learning ability, we combine it with various baseline methods. The results demonstrate that our approach can improve the baseline model’s performance on the CPI task, especially in the challenging zero-shot scenario. Compared to existing pre-training models, our model demonstrates superior performance, particularly in data-scarce scenarios where training samples are limited. Our implementation is available at https://github.com/Hoch-Zhang/PSRP-CPI.

[667] Breaking the Precision Ceiling in Physics-Informed Neural Networks: A Hybrid Fourier-Neural Architecture for Ultra-High Accuracy

Wei Shan Lee, Chi Kiu Althina Chau, Kei Chon Sio, Kam Ian Leong

Main category: cs.LG

TL;DR: A hybrid Fourier-neural architecture breaks the precision ceiling of PINNs for fourth-order PDEs, achieving a 17-fold improvement in L2 error over standard PINNs and outperforming traditional methods by 15-500x.

Details

Motivation: The precision plateau of PINNs (errors of $10^{-3}$-$10^{-4}$) limits their adoption in engineering. This work aims to surpass this barrier.

Method: Combines a truncated Fourier series (10 harmonics) with a deep neural network for adaptive corrections, using a two-phase optimization strategy (Adam + L-BFGS) and adaptive weight balancing.

Result: Achieves unprecedented L2 error of $1.94 \times 10^{-7}$, with training completed in under 30 minutes on GPU.

Conclusion: Proper design enables ultra-precision in machine learning for scientific computing, surpassing traditional numerical methods.

Abstract: Physics-informed neural networks (PINNs) have plateaued at errors of $10^{-3}$-$10^{-4}$ for fourth-order partial differential equations, creating a perceived precision ceiling that limits their adoption in engineering applications. We break through this barrier with a hybrid Fourier-neural architecture for the Euler-Bernoulli beam equation, achieving unprecedented L2 error of $1.94 \times 10^{-7}$-a 17-fold improvement over standard PINNs and (15-500\times) better than traditional numerical methods. Our approach synergistically combines a truncated Fourier series capturing dominant modal behavior with a deep neural network providing adaptive residual corrections. A systematic harmonic optimization study revealed a counter-intuitive discovery: exactly 10 harmonics yield optimal performance, with accuracy catastrophically degrading from $10^{-7}$ to $10^{-1}$ beyond this threshold. The two-phase optimization strategy (Adam followed by L-BFGS) and adaptive weight balancing enable stable ultra-precision convergence. GPU-accelerated implementation achieves sub-30-minute training despite fourth-order derivative complexity. By addressing 12 critical gaps in existing approaches-from architectural rigidity to optimization landscapes-this work demonstrates that ultra-precision is achievable through proper design, opening new paradigms for scientific computing where machine learning can match or exceed traditional numerical methods.

[668] PySHRED: A Python package for SHallow REcurrent Decoding for sparse sensing, model reduction and scientific discovery

David Ye, Jan Williams, Mars Gao, Stefano Riva, Matteo Tomasetto, David Zoro, J. Nathan Kutz

Main category: cs.LG

TL;DR: PySHRED 1.0 is a Python package implementing SHRED, a deep learning strategy for modeling high-dimensional dynamical systems, with features for robust sensing, reduced order modeling, and physics discovery.

Details

Motivation: To provide a tool for handling real-world spatiotemporal data that may be noisy, multi-scale, or high-dimensional.

Method: Uses SHallow REcurrent Decoders (SHRED) and includes preprocessors and advanced methods for noisy, nonlinear data.

Result: A modular, well-documented Python package (PySHRED) with extensive examples, released under MIT license.

Conclusion: PySHRED 1.0 is a versatile, accessible tool for modeling complex dynamical systems, with potential for future extensions.

Abstract: SHallow REcurrent Decoders (SHRED) provide a deep learning strategy for modeling high-dimensional dynamical systems and/or spatiotemporal data from dynamical system snapshot observations. PySHRED is a Python package that implements SHRED and several of its major extensions, including for robust sensing, reduced order modeling and physics discovery. In this paper, we introduce the version 1.0 release of PySHRED, which includes data preprocessors and a number of cutting-edge SHRED methods specifically designed to handle real-world data that may be noisy, multi-scale, parameterized, prohibitively high-dimensional, and strongly nonlinear. The package is easy to install, thoroughly-documented, supplemented with extensive code examples, and modularly-structured to support future additions. The entire codebase is released under the MIT license and is available at https://github.com/pyshred-dev/pyshred.

[669] PROVCREATOR: Synthesizing Complex Heterogenous Graphs with Node and Edge Attributes

Tianhao Wang, Simon Klancher, Kunal Mukherjee, Josh Wiedemeier, Feng Chen, Murat Kantarcioglu, Kangkook Jee

Main category: cs.LG

TL;DR: ProvCreator is a synthetic graph framework for complex heterogeneous graphs, using transformer-based models for realistic and privacy-aware graph generation.

Details

Motivation: Addressing the challenge of synthetic graph generation for real-world graphs with complex, heterogeneous schemas, which existing methods fail to handle effectively.

Method: ProvCreator formulates graph synthesis as a sequence generation task, employing a graph-to-sequence encoder-decoder with lossless encoding, efficient compression, and end-to-end learnable generation.

Result: Validated on cybersecurity provenance graphs and knowledge graphs, ProvCreator successfully captures intricate structure-semantics dependencies, generating realistic synthetic datasets.

Conclusion: ProvCreator advances synthetic graph generation by handling complex, heterogeneous graphs with high fidelity, proving effective in real-world applications.

Abstract: The rise of graph-structured data has driven interest in graph learning and synthetic data generation. While successful in text and image domains, synthetic graph generation remains challenging – especially for real-world graphs with complex, heterogeneous schemas. Existing research has focused mostly on homogeneous structures with simple attributes, limiting their usefulness and relevance for application domains requiring semantic fidelity. In this research, we introduce ProvCreator, a synthetic graph framework designed for complex heterogeneous graphs with high-dimensional node and edge attributes. ProvCreator formulates graph synthesis as a sequence generation task, enabling the use of transformer-based large language models. It features a versatile graph-to-sequence encoder-decoder that 1. losslessly encodes graph structure and attributes, 2. efficiently compresses large graphs for contextual modeling, and 3. supports end-to-end, learnable graph generation. To validate our research, we evaluate ProvCreator on two challenging domains: system provenance graphs in cybersecurity and knowledge graphs from IntelliGraph Benchmark Dataset. In both cases, ProvCreator captures intricate dependencies between structure and semantics, enabling the generation of realistic and privacy-aware synthetic datasets.

[670] From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation

Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang

Main category: cs.LG

TL;DR: DARSD is a novel UDA framework addressing domain shift in time series by decomposing representations into transferable and non-transferable components, outperforming 12 UDA methods.

Details

Motivation: Domain shift in time series causes models to fail when applied to target domains with different distributions. Current UDA methods ignore intrinsic feature compositions.

Method: DARSD uses (I) an adversarial invariant basis, (II) prototypical pseudo-labeling, and (III) hybrid contrastive optimization to disentangle and align features.

Result: DARSD outperforms 12 UDA algorithms, achieving optimal performance in 35 out of 53 cross-domain scenarios on four benchmark datasets.

Conclusion: DARSD effectively addresses domain shift by decomposing representations, offering a principled approach for UDA in time series.

Abstract: Domain shift poses a fundamental challenge in time series analysis, where models trained on source domain often fail dramatically when applied in target domain with different yet similar distributions. While current unsupervised domain adaptation (UDA) methods attempt to align cross-domain feature distributions, they typically treat features as indivisible entities, ignoring their intrinsic compositions that governs domain adaptation. We introduce DARSD, a novel UDA framework with theoretical explainability that explicitly realizes UDA tasks from the perspective of representation space decomposition. Our core insight is that effective domain adaptation requires not just alignment, but principled disentanglement of transferable knowledge from mixed representations. DARSD consists three synergistic components: (I) An adversarial learnable common invariant basis that projects original features into a domain-invariant subspace while preserving semantic content; (II) A prototypical pseudo-labeling mechanism that dynamically separates target features based on confidence, hindering error accumulation; (III) A hybrid contrastive optimization strategy that simultaneously enforces feature clustering and consistency while mitigating emerging distribution gaps. Comprehensive experiments conducted on four benchmark datasets (WISDM, HAR, HHAR, and MFD) demonstrate DARSD’s superiority against 12 UDA algorithms, achieving optimal performance in 35 out of 53 cross-domain scenarios.

[671] Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder

Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao

Main category: cs.LG

TL;DR: SAE Debias is a lightweight, model-agnostic framework using a k-sparse autoencoder to mitigate gender bias in text-to-image models by identifying and suppressing biased directions in the latent space.

Details

Motivation: Addressing gender bias in T2I diffusion models, which often generate stereotypical associations between professions and gender.

Method: Leverages a pre-trained k-sparse autoencoder to identify gender-relevant directions in the latent space and suppresses them during inference.

Result: Substantially reduces gender bias across multiple T2I models (e.g., Stable Diffusion variants) without compromising generation quality.

Conclusion: SAE Debias offers an interpretable, reusable, and model-agnostic solution for fairer T2I generation, advancing socially responsible AI.

Abstract: Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.

[672] SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen

Main category: cs.LG

TL;DR: SmallThinker is a family of LLMs designed for local devices, overcoming computational, memory, and storage constraints with innovative architecture and co-designed inference, achieving high performance on consumer CPUs.

Details

Motivation: To enable LLM deployment on local devices with limited resources, challenging the reliance on GPU-powered cloud infrastructure.

Method: Introduces a deployment-aware architecture with a two-level sparse structure (MoE + sparse feed-forward), pre-attention router for I/O efficiency, and NoPE-RoPE hybrid sparse attention for memory savings.

Result: SmallThinker models outperform larger LLMs, achieving high speeds (20+ tokens/s) on consumer CPUs with minimal memory usage (1GB and 8GB).

Conclusion: SmallThinker demonstrates that efficient, high-performance LLMs can run on local devices, reducing dependency on cloud infrastructure.

Abstract: While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.

[673] Personalized Treatment Effect Estimation from Unstructured Data

Henri Arno, Thomas Demeester

Main category: cs.LG

TL;DR: The paper introduces methods for estimating personalized treatment effects using unstructured data, addressing confounding and sampling biases with theoretically grounded estimators and a regression-based correction.

Details

Motivation: Existing methods rely on structured covariates, limiting their use with unstructured data like clinical notes or medical images, which have significant potential in fields like healthcare.

Method: The authors propose a ‘plug-in’ method for neural representations of unstructured data and introduce two estimators to avoid confounding bias. They also add a regression-based correction for sampling bias in non-representative subsets.

Result: Experiments on benchmark datasets show the plug-in method performs well across settings, despite its simplicity.

Conclusion: The proposed methods effectively leverage unstructured data for causal inference, addressing key biases and demonstrating practical applicability.

Abstract: Existing methods for estimating personalized treatment effects typically rely on structured covariates, limiting their applicability to unstructured data. Yet, leveraging unstructured data for causal inference has considerable application potential, for instance in healthcare, where clinical notes or medical images are abundant. To this end, we first introduce an approximate ‘plug-in’ method trained directly on the neural representations of unstructured data. However, when these fail to capture all confounding information, the method may be subject to confounding bias. We therefore introduce two theoretically grounded estimators that leverage structured measurements of the confounders during training, but allow estimating personalized treatment effects purely from unstructured inputs, while avoiding confounding bias. When these structured measurements are only available for a non-representative subset of the data, these estimators may suffer from sampling bias. To address this, we further introduce a regression-based correction that accounts for the non-uniform sampling, assuming the sampling mechanism is known or can be well-estimated. Our experiments on two benchmark datasets show that the plug-in method, directly trainable on large unstructured datasets, achieves strong empirical performance across all settings, despite its simplicity.

[674] Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Haris Khan, Shumaila Asif, Sadia Asif

Main category: cs.LG

TL;DR: MDM-OC is a framework for scalable, interference-free, and reversible model merging in continual learning, outperforming prior methods.

Details

Motivation: Addressing issues like task interference, catastrophic forgetting, and lack of reversibility in model merging and continual learning.

Method: Encodes task-specific models as orthogonal deltas from a shared base, merged via gradient-based optimization. Supports unmerging and stability techniques.

Result: Outperforms baselines in accuracy, backward transfer, and unmerge fidelity, while being memory-efficient.

Conclusion: MDM-OC provides a principled solution for modular and compliant AI system design.

Abstract: In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.

[675] Compositional Function Networks: A High-Performance Alternative to Deep Neural Networks with Built-in Interpretability

Fang Li

Main category: cs.LG

TL;DR: CFNs introduce interpretable models using compositional mathematical functions, achieving competitive performance while maintaining transparency.

Details

Motivation: Address the black-box nature of DNNs in high-stakes domains requiring transparency.

Method: Compose elementary mathematical functions with clear semantics, supporting diverse patterns (sequential, parallel, conditional) and enabling efficient gradient-based training.

Result: CFNs achieve 96.24% accuracy on CIFAR-10, outperforming interpretable models like Explainable Boosting Machines.

Conclusion: CFNs combine deep learning’s expressiveness with interpretability, ideal for performance-critical and accountable applications.

Abstract: Deep Neural Networks (DNNs) deliver impressive performance but their black-box nature limits deployment in high-stakes domains requiring transparency. We introduce Compositional Function Networks (CFNs), a novel framework that builds inherently interpretable models by composing elementary mathematical functions with clear semantics. Unlike existing interpretable approaches that are limited to simple additive structures, CFNs support diverse compositional patterns – sequential, parallel, and conditional – enabling complex feature interactions while maintaining transparency. A key innovation is that CFNs are fully differentiable, allowing efficient training through standard gradient descent. We demonstrate CFNs’ versatility across multiple domains, from symbolic regression to image classification with deep hierarchical networks. Our empirical evaluation shows CFNs achieve competitive performance against black-box models (96.24% accuracy on CIFAR-10) while outperforming state-of-the-art interpretable models like Explainable Boosting Machines. By combining the hierarchical expressiveness and efficient training of deep learning with the intrinsic interpretability of well-defined mathematical functions, CFNs offer a powerful framework for applications where both performance and accountability are paramount.

[676] Predicting Cognition from fMRI:A Comparative Study of Graph, Transformer, and Kernel Models Across Task and Rest Conditions

Jagruti Patel, Mikkel Schöttner, Thomas A. W. Bolton, Patric Hagmann

Main category: cs.LG

TL;DR: The study benchmarks machine learning and deep learning models for cognitive prediction using fMRI data, finding task-based fMRI outperforms resting-state fMRI. GNN combining SC and FC performed best, while TGNN showed promise for task-fMRI but struggled with RS data.

Details

Motivation: To understand neural mechanisms of cognition and improve precision medicine and early detection of neurological/psychiatric conditions.

Method: Compared Kernel Ridge Regression (KRR), Graph Neural Networks (GNN), and Transformer-GNN (TGNN) using resting-state and task-based fMRI data from the Human Connectome Project.

Result: Task-based fMRI outperformed resting-state fMRI. GNN with SC and FC achieved the highest performance, though not significantly better than KRR with FC. TGNN performed well for task-fMRI but poorly for RS data.

Conclusion: Appropriate model architectures and feature representations are crucial for leveraging neuroimaging data. Multimodal DL models and Transformer-based approaches show promise for cognitive prediction.

Abstract: Predicting cognition from neuroimaging data in healthy individuals offers insights into the neural mechanisms underlying cognitive abilities, with potential applications in precision medicine and early detection of neurological and psychiatric conditions. This study systematically benchmarked classical machine learning (Kernel Ridge Regression (KRR)) and advanced deep learning (DL) models (Graph Neural Networks (GNN) and Transformer-GNN (TGNN)) for cognitive prediction using Resting-state (RS), Working Memory, and Language task fMRI data from the Human Connectome Project Young Adult dataset. Our results, based on R2 scores, Pearson correlation coefficient, and mean absolute error, revealed that task-based fMRI, eliciting neural responses directly tied to cognition, outperformed RS fMRI in predicting cognitive behavior. Among the methods compared, a GNN combining structural connectivity (SC) and functional connectivity (FC) consistently achieved the highest performance across all fMRI modalities; however, its advantage over KRR using FC alone was not statistically significant. The TGNN, designed to model temporal dynamics with SC as a prior, performed competitively with FC-based approaches for task-fMRI but struggled with RS data, where its performance aligned with the lower-performing GNN that directly used fMRI time-series data as node features. These findings emphasize the importance of selecting appropriate model architectures and feature representations to fully leverage the spatial and temporal richness of neuroimaging data. This study highlights the potential of multimodal graph-aware DL models to combine SC and FC for cognitive prediction, as well as the promise of Transformer-based approaches for capturing temporal dynamics. By providing a comprehensive comparison of models, this work serves as a guide for advancing brain-behavior modeling using fMRI, SC and DL.

[677] Behavior-Specific Filtering for Enhanced Pig Behavior Classification in Precision Livestock Farming

Zhen Zhang, Dong Sam Ha, Gota Morota, Sook Shin

Main category: cs.LG

TL;DR: A behavior-specific filtering method improves pig behavior classification accuracy in Precision Livestock Farming, outperforming traditional uniform methods.

Details

Motivation: To enhance behavior classification accuracy by addressing the limitations of uniform filtering methods.

Method: Combines Wavelet Denoising with a Low Pass Filter, tailored to active and inactive pig behaviors.

Result: Achieved a peak accuracy of 94.73%, surpassing the 91.58% of traditional methods.

Conclusion: Behavior-specific filtering enhances monitoring, supporting better health management and farm efficiency.

Abstract: This study proposes a behavior-specific filtering method to improve behavior classification accuracy in Precision Livestock Farming. While traditional filtering methods, such as wavelet denoising, achieved an accuracy of 91.58%, they apply uniform processing to all behaviors. In contrast, the proposed behavior-specific filtering method combines Wavelet Denoising with a Low Pass Filter, tailored to active and inactive pig behaviors, and achieved a peak accuracy of 94.73%. These results highlight the effectiveness of behavior-specific filtering in enhancing animal behavior monitoring, supporting better health management and farm efficiency.

[678] On Using the Shapley Value for Anomaly Localization: A Statistical Investigation

Rick S. Blum, Franziska Freytag

Main category: cs.LG

TL;DR: Using a fixed term in Shapley value calculation simplifies anomaly localization in sensor data systems with no loss in accuracy for independent observations.

Details

Motivation: To simplify anomaly localization in sensor data systems while maintaining accuracy.

Method: Employ a single fixed term in Shapley value calculation instead of the full Shapley value.

Result: Achieves lower complexity with the same error probability for independent observations.

Conclusion: The fixed-term approach is effective for independent cases, but its validity for dependent cases remains unproven.

Abstract: Recent publications have suggested using the Shapley value for anomaly localization for sensor data systems. Using a reasonable mathematical anomaly model for full control, experiments indicate that using a single fixed term in the Shapley value calculation achieves a lower complexity anomaly localization test, with the same probability of error, as a test using the Shapley value for all cases tested. A proof demonstrates these conclusions must be true for all independent observation cases. For dependent observation cases, no proof is available.

[679] Optimization Performance of Factorization Machine with Annealing under Limited Training Data

Mayumi Nakano, Yuya Seki, Shuta Kikuchi, Shu Tanaka

Main category: cs.LG

TL;DR: Proposes a sequential dataset construction method for FMA to improve optimization performance by focusing on recent data points.

Details

Motivation: Performance stagnation in FMA due to diluted impact of new data points as dataset grows.

Method: Retains only the most recent data points to enhance surrogate model accuracy.

Result: Achieves lower-cost solutions with fewer function evaluations than conventional FMA.

Conclusion: The proposed method effectively addresses stagnation and improves FMA performance.

Abstract: Black-box (BB) optimization problems aim to identify an input that minimizes the output of a function (the BB function) whose input-output relationship is unknown. Factorization machine with annealing (FMA) is a promising approach to this task, employing a factorization machine (FM) as a surrogate model to iteratively guide the solution search via an Ising machine. Although FMA has demonstrated strong optimization performance across various applications, its performance often stagnates as the number of optimization iterations increases. One contributing factor to this stagnation is the growing number of data points in the dataset used to train FM. It is hypothesized that as more data points are accumulated, the contribution of newly added data points becomes diluted within the entire dataset, thereby reducing their impact on improving the prediction accuracy of FM. To address this issue, we propose a novel method for sequential dataset construction that retains at most a specified number of the most recently added data points. This strategy is designed to enhance the influence of newly added data points on the surrogate model. Numerical experiments demonstrate that the proposed FMA achieves lower-cost solutions with fewer BB function evaluations compared to the conventional FMA.

[680] When Brain Foundation Model Meets Cauchy-Schwarz Divergence: A New Framework for Cross-Subject Motor Imagery Decoding

Jinzhou Wu, Baoping Tang, Qikang Li, Yi Wang, Cheng Li, Shujian Yu

Main category: cs.LG

TL;DR: A novel MSDA framework using a pretrained Brain Foundation Model (BFM) for dynamic source selection and CS/CCS divergences for feature and decision-level alignment improves MI-EEG decoding, outperforming existing methods.

Details

Motivation: Address challenges in MI-EEG decoding like inter-subject variability and limited labeled data, avoiding negative transfer and high computational costs.

Method: Leverages BFM for informed source selection and uses CS/CCS divergences for joint feature and decision-level alignment.

Result: Outperforms state-of-the-art baselines on benchmark datasets, with BFM-guided selection reducing training time without performance loss.

Conclusion: The proposed framework enhances MI-EEG decoding by addressing key limitations, offering scalability and efficiency.

Abstract: Decoding motor imagery (MI) electroencephalogram (EEG) signals, a key non-invasive brain-computer interface (BCI) paradigm for controlling external systems, has been significantly advanced by deep learning. However, MI-EEG decoding remains challenging due to substantial inter-subject variability and limited labeled target data, which necessitate costly calibration for new users. Many existing multi-source domain adaptation (MSDA) methods indiscriminately incorporate all available source domains, disregarding the large inter-subject differences in EEG signals, which leads to negative transfer and excessive computational costs. Moreover, while many approaches focus on feature distribution alignment, they often neglect the explicit dependence between features and decision-level outputs, limiting their ability to preserve discriminative structures. To address these gaps, we propose a novel MSDA framework that leverages a pretrained large Brain Foundation Model (BFM) for dynamic and informed source subject selection, ensuring only relevant sources contribute to adaptation. Furthermore, we employ Cauchy-Schwarz (CS) and Conditional CS (CCS) divergences to jointly perform feature-level and decision-level alignment, enhancing domain invariance while maintaining class discriminability. Extensive evaluations on two benchmark MI-EEG datasets demonstrate that our framework outperforms a broad range of state-of-the-art baselines. Additional experiments with a large source pool validate the scalability and efficiency of BFM-guided selection, which significantly reduces training time without sacrificing performance.

[681] Transformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential Improvements

Aditya Ravuri, Neil D. Lawrence

Main category: cs.LG

TL;DR: The paper interprets transformers probabilistically using Laplacian Eigenmaps, showing they perform linear dimensionality reduction at initialization and that a graph Laplacian term arises in transformer blocks. Modifying the attention matrix improves model performance.

Details

Motivation: To provide a probabilistic interpretation of transformers and explore their behavior through the lens of Laplacian Eigenmaps, aiming to enhance understanding and performance.

Method: The authors derive transformers as unrolled inference steps under a probabilistic Laplacian Eigenmaps model, analyzing initialization behavior and the role of the graph Laplacian term in transformer blocks. They modify the attention matrix by subtracting the identity to test performance.

Result: The study finds that transformers initially perform linear dimensionality reduction and that replacing the attention matrix with a graph Laplacian term (via subtracting the identity) improves validation performance in language and vision tasks.

Conclusion: The probabilistic interpretation and modification of the attention matrix offer insights into transformer behavior and demonstrate practical performance improvements.

Abstract: We propose a probabilistic interpretation of transformers as unrolled inference steps assuming a probabilistic Laplacian Eigenmaps model from the ProbDR framework. Our derivation shows that at initialisation, transformers perform “linear” dimensionality reduction. We also show that within the transformer block, a graph Laplacian term arises from our arguments, rather than an attention matrix (which we interpret as an adjacency matrix). We demonstrate that simply subtracting the identity from the attention matrix (and thereby taking a graph diffusion step) improves validation performance on a language model and a simple vision transformer.

[682] Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

Zedong Wang, Siyuan Li, Dan Xu

Main category: cs.LG

TL;DR: Rep-MTL introduces representation-level task saliency to enhance multi-task learning by balancing task-specific optimization and shared representation learning, achieving competitive performance.

Details

Motivation: Existing multi-task optimization techniques focus on conflict resolution but neglect inter-task complementarity. Rep-MTL leverages shared representation space to address this gap.

Method: Rep-MTL uses representation-level task saliency, entropy-based penalization, and sample-wise cross-task alignment to mitigate negative transfer and promote complementary sharing.

Result: Experiments on four benchmarks show Rep-MTL achieves competitive gains with efficiency, balancing task-specific learning and cross-task sharing.

Conclusion: Rep-MTL effectively enhances multi-task learning by leveraging representation-level interactions, outperforming traditional conflict-focused methods.

Abstract: Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL’s efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.

[683] Flow Matching Policy Gradients

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa

Main category: cs.LG

TL;DR: FPO integrates flow matching into policy gradient RL, avoiding exact likelihood computation while maintaining generative capabilities, and outperforms Gaussian policies in multimodal tasks.

Details

Motivation: To leverage flow-based models' strengths in high-dimensional spaces for reinforcement learning without being tied to specific sampling methods.

Method: FPO uses advantage-weighted ratio from flow matching loss within PPO-clip, making it compatible with various diffusion or flow integration methods.

Result: FPO trains diffusion-style policies effectively in continuous control tasks, capturing multimodal distributions and outperforming Gaussian policies.

Conclusion: FPO is a versatile and effective approach for integrating flow-based models into RL, enhancing performance in complex tasks.

Abstract: Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.

[684] LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Zhengfei Chen, Graziano Chesi, Ngai Wong, Hao Yu

Main category: cs.LG

TL;DR: LLM-Barber is a one-shot pruning framework for large language models (LLMs) that rebuilds sparsity masks without retraining, using block-aware error optimization and a novel pruning metric (weights * gradients). It achieves efficient pruning with state-of-the-art results.

Details

Motivation: Existing post-training pruning methods overlook dynamic weight importance changes, leading to performance degradation. LLM-Barber addresses this by optimizing sparsity masks globally.

Method: LLM-Barber uses block-aware error optimization across Self-Attention and MLP blocks and introduces weights * gradients as a pruning metric. It prunes models in one shot without retraining.

Result: LLM-Barber prunes LLaMA and OPT models (7B to 13B) in 30 minutes on a single A100 GPU, achieving top perplexity and zero-shot performance on language benchmarks.

Conclusion: LLM-Barber offers an efficient, accurate, and computationally lightweight solution for pruning LLMs, outperforming existing methods.

Abstract: Large language models (LLMs) have seen substantial growth, necessitating efficient model pruning techniques. Existing post-training pruning methods primarily measure weight importance in converged dense models, often overlooking changes in weight significance during the pruning process, leading to performance degradation. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, facilitating global performance optimization. We are the first to employ the product of weights and gradients as a pruning metric in the context of LLM post-training pruning. This enables accurate identification of weight importance in massive models and significantly reduces computational complexity compared to methods using secondorder information. Our experiments show that LLM-Barber efficiently prunes models from LLaMA and OPT families (7B to 13B) on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.

[685] Preference learning made easy: Everything should be understood through win rate

Lily H. Zhang, Rajesh Ranganath

Main category: cs.LG

TL;DR: The paper introduces a framework for understanding preference learning, emphasizing win rate as the key metric. It categorizes methods as win rate optimization (WRO) or non-WRO, highlighting theoretical benefits of WRO and practical challenges.

Details

Motivation: To address the lack of conceptual maturity in preference learning compared to other tasks like classification, by providing a principled framework based on pairwise preference data.

Method: Analyzes preference learning methods as WRO or non-WRO, presents new WRO instances, and evaluates theoretical and practical aspects.

Result: WRO methods have theoretical advantages but face optimization challenges, while non-WRO methods lack these benefits but can be mitigated.

Conclusion: Recommends aligning non-WRO methods with WRO or improving WRO optimization for better preference learning outcomes.

Abstract: Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective’s solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.

[686] Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

Thomas Chen, Patrícia Muñoz Ewald

Main category: cs.LG

TL;DR: The paper explores cost minimization in underparametrized shallow ReLU networks by constructing upper bounds without gradient descent, focusing on geometric structures of minimizers. It proves an $O(\delta_P)$ upper bound for the cost function and identifies exact local minima in special cases.

Details

Motivation: To address cost minimization in underparametrized ReLU networks by leveraging data structure and avoiding gradient descent, while analyzing geometric properties of minimizers.

Method: Constructs upper bounds for the cost function using the structure of classification data, focusing on an $\mathcal{L}^2$ cost function and large training samples. Analyzes geometric properties and derives explicit minimizers.

Result: Proves an upper bound of order $O(\delta_P)$ for the cost function and identifies exact degenerate local minima in the case $M=Q$, with a relative error $O(\delta_P^2)$. The trained network metrizes a $Q$-dimensional subspace.

Conclusion: The study provides insights into cost minimization and geometric structures in underparametrized ReLU networks, with constructive training methods and bounds on minimizers.

Abstract: In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $\mathcal{L}^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(\delta_P)$ where $\delta_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(\delta_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

[687] Deep Unsupervised Domain Adaptation for Time Series Classification: a Benchmark

Hassan Ismail Fawaz, Ganesh Del Grosso, Tanguy Kerdoncuff, Aurelie Boisbunon, Illyyne Saffar

Main category: cs.LG

TL;DR: The paper introduces a benchmark for evaluating Unsupervised Domain Adaptation (UDA) techniques in time series classification, addressing a gap in research for this data type.

Details

Motivation: UDA is underexplored for time series data despite its real-world applications in fields like medicine and manufacturing.

Method: The authors propose a comprehensive benchmark with seven datasets, using deep learning methods (e.g., Inception) to evaluate UDA techniques.

Result: The benchmark provides standardized assessments, revealing strengths and limitations of UDA approaches for time series.

Conclusion: This work advances UDA research for time series, offering a practical resource for researchers and practitioners, with code available for implementation.

Abstract: Unsupervised Domain Adaptation (UDA) aims to harness labeled source data to train models for unlabeled target data. Despite extensive research in domains like computer vision and natural language processing, UDA remains underexplored for time series data, which has widespread real-world applications ranging from medicine and manufacturing to earth observation and human activity recognition. Our paper addresses this gap by introducing a comprehensive benchmark for evaluating UDA techniques for time series classification, with a focus on deep learning methods. We provide seven new benchmark datasets covering various domain shifts and temporal dynamics, facilitating fair and standardized UDA method assessments with state of the art neural network backbones (e.g. Inception) for time series data. This benchmark offers insights into the strengths and limitations of the evaluated approaches while preserving the unsupervised nature of domain adaptation, making it directly applicable to practical problems. Our paper serves as a vital resource for researchers and practitioners, advancing domain adaptation solutions for time series data and fostering innovation in this critical field. The implementation code of this benchmark is available at https://github.com/EricssonResearch/UDA-4-TSC.

[688] LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

Main category: cs.LG

TL;DR: The paper addresses the vulnerability of aligned LLMs to safety degradation during fine-tuning, proposing LoX, a training-free method to enhance safety robustness by extrapolating low-rank safety subspaces.

Details

Motivation: Safety concerns in LLMs persist despite alignment efforts, as fine-tuning can undermine protections. The study aims to mitigate this vulnerability.

Method: Proposes Low-Rank Extrapolation (LoX), a training-free technique to extrapolate safety-critical subspaces in LLM parameters.

Result: LoX reduces attack success rates by 11% to 54% against benign/malicious fine-tuning, preserving task adaptability.

Conclusion: LoX effectively enhances safety robustness by moving parameters to a flatter, less perturbation-sensitive zone.

Abstract: Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX.

[689] The Effect of Data Poisoning on Counterfactual Explanations

André Artelt, Shubham Sharma, Freddy Lecué, Barbara Hammer

Main category: cs.LG

TL;DR: The paper examines the vulnerability of counterfactual explanations to data poisoning, demonstrating their susceptibility and the failure of current defenses to detect such manipulations.

Details

Motivation: Counterfactual explanations are popular for interpreting black-box systems but are prone to manipulation, raising concerns about their reliability in critical applications.

Method: The study introduces and investigates data poisoning in counterfactual explanations, analyzing its impact on recourse cost at local, sub-group, and global levels. A general poisoning mechanism is derived and tested in real-world scenarios like water distribution networks.

Result: Empirical evaluations show state-of-the-art counterfactual methods are vulnerable to data poisoning, and existing defenses fail to detect poisoned samples.

Conclusion: The findings highlight significant risks in using counterfactual explanations without robust defenses against data poisoning, especially in critical applications.

Abstract: Counterfactual explanations are a widely used approach for examining the predictions of black-box systems. They can offer the opportunity for computational recourse by suggesting actionable changes on how to alter the input to obtain a different (i.e., more favorable) system output. However, recent studies have pointed out their susceptibility to various forms of manipulation. This work studies the vulnerability of counterfactual explanations to data poisoning. We formally introduce and investigate data poisoning in the context of counterfactual explanations for increasing the cost of recourse on three different levels: locally for a single instance, a sub-group of instances, or globally for all instances. In this context, we formally introduce and characterize data poisonings, from which we derive and investigate a general data poisoning mechanism. We demonstrate the impact of such data poisoning in the critical real-world application of explaining event detections in water distribution networks. Additionally, we conduct an extensive empirical evaluation, demonstrating that state-of-the-art counterfactual generation methods and toolboxes are vulnerable to such data poisoning. Furthermore, we find that existing defense methods fail to detect those poisonous samples.

[690] Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions

Eitan Anzenberg, Arunava Samajpati, Sivasankaran Chandrasekar, Varun Kacholia

Main category: cs.LG

TL;DR: The paper benchmarks general-purpose LLMs against a proprietary hiring model (Match Score), showing Match Score outperforms in accuracy and fairness, emphasizing the need for domain-specific models and bias safeguards in hiring.

Details

Motivation: To address concerns about accuracy and algorithmic bias in using LLMs for hiring, comparing them with a specialized model to highlight the importance of fairness and domain-specific solutions.

Method: Benchmarked state-of-the-art LLMs and Match Score on predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio across gender, race, and intersectional subgroups) using 10,000 real-world candidate-job pairs.

Result: Match Score outperformed LLMs in accuracy (ROC AUC 0.85 vs 0.77) and fairness (minimum race-wise impact ratio 0.957 vs 0.809 or lower for LLMs), demonstrating that domain-specific models can achieve both accuracy and fairness.

Conclusion: Domain-specific models with bias auditing are crucial for high-stakes hiring tasks, as off-the-shelf LLMs may propagate biases; well-designed algorithms can balance accuracy and fairness.

Abstract: The use of large language models (LLMs) in hiring promises to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias where sufficient safeguards are not in place. In this work, we benchmark several state-of-the-art foundational LLMs - including models from OpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching. We evaluate each model’s predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysis across declared gender, race, and intersectional subgroups). Our experiments on a dataset of roughly 10,000 real-world recent candidate-job pairs show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957 (near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for the intersectionals, respectively). We discuss why pretraining biases may cause LLMs with insufficient safeguards to propagate societal biases in hiring scenarios, whereas a bespoke supervised model can more effectively mitigate these biases. Our findings highlight the importance of domain-specific modeling and bias auditing when deploying AI in high-stakes domains such as hiring, and caution against relying on off-the-shelf LLMs for such tasks without extensive fairness safeguards. Furthermore, we show with empirical evidence that there shouldn’t be a dichotomy between choosing accuracy and fairness in hiring: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes.

[691] Critiques of World Models

Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu

Main category: cs.LG

TL;DR: The paper critiques existing world model theories, proposes a new architecture for a general-purpose world model, and envisions a PAN AGI system.

Details

Motivation: The rising need for virtual agents with artificial intelligence drives the exploration of world models, focusing on simulating actionable possibilities for reasoning and acting.

Method: Critiques existing world model theories, proposes a hierarchical, multi-level, mixed continuous/discrete representation, and a generative self-supervision framework.

Result: A new architecture for a general-purpose world model is introduced, aiming to enable a PAN AGI system.

Conclusion: The proposed world model architecture advances AGI development by emphasizing actionable simulation and hierarchical representation.

Abstract: World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

[692] GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.LG

TL;DR: GUI-G² introduces Gaussian rewards for GUI grounding, outperforming UI-TARS-72B by 24.7% on ScreenSpot-Pro.

Details

Motivation: Current reinforcement learning uses sparse binary rewards, ignoring the continuous nature of spatial interactions. Human clicking behavior, forming Gaussian distributions, inspired the solution.

Method: GUI-G² models GUI elements as Gaussian distributions, using Gaussian point rewards for localization and coverage rewards for spatial alignment. Adaptive variance handles element scales.

Result: GUI-G² outperforms UI-TARS-72B by 24.7% on ScreenSpot-Pro, showing robustness to interface variations and better generalization.

Conclusion: GUI-G² transforms GUI grounding into dense continuous optimization, setting a new paradigm for spatial reasoning in GUI tasks.

Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

[693] Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin

Main category: cs.LG

TL;DR: GSPO is a reinforcement learning algorithm for training large language models, outperforming GRPO in efficiency and stability, and improving Qwen3 models.

Details

Motivation: To address inefficiencies and instability in token-level importance ratio methods used by previous algorithms like GRPO.

Method: GSPO uses sequence-level likelihood for importance ratios, performing sequence-level clipping, rewarding, and optimization.

Result: GSPO achieves superior training efficiency, stabilizes MoE RL training, and simplifies RL infrastructure design.

Conclusion: GSPO is a stable, efficient, and performant algorithm, contributing to significant improvements in Qwen3 models.

Abstract: This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

[694] Training Neural Networks for Modularity aids Interpretability

Satvik Golechha, Dylan Cope, Nandi Schoots

Main category: cs.LG

TL;DR: A method to enhance neural network interpretability by training models to be modular using an “enmeshment loss” function, improving clusterability and interpretability.

Details

Motivation: Pretrained models are often unclusterable, making them hard to interpret. The goal is to improve interpretability by making models more modular.

Method: Train models with an “enmeshment loss” function to encourage non-interacting, modular clusters.

Result: The method successfully forms clusters that learn distinct, disjoint, and smaller circuits for CIFAR-10 labels.

Conclusion: This approach offers a promising way to make neural networks more interpretable by improving modularity and clusterability.

Abstract: An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an ``enmeshment loss’’ function that encourages the formation of non-interacting clusters. Using automated interpretability measures, we show that our method finds clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a promising direction for making neural networks easier to interpret.

[695] Geometric Representation Condition Improves Equivariant Molecule Generation

Zian Li, Cai Zhou, Xiyuan Wang, Xingang Peng, Muhan Zhang

Main category: cs.LG

TL;DR: GeoRCG improves molecular generation by using a two-stage process with geometric representations, achieving better quality and efficiency.

Details

Motivation: Existing molecular generative models struggle with high-quality and conditional generation.

Method: Two-stage generation: first geometric representation, then molecule. Uses EDM and SemlaFlow as base generators.

Result: 50% performance improvement in conditional tasks; reduced diffusion steps (100 vs. 1,000) without quality loss.

Conclusion: GeoRCG’s geometric conditioning enhances molecular generation quality and efficiency.

Abstract: Recent advances in molecular generative models have demonstrated great promise for accelerating scientific discovery, particularly in drug design. However, these models often struggle to generate high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to improve molecular generative models by integrating geometric representation conditions with provable theoretical guarantees. We decompose the generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared with single-stage generation, the easy-to-generate representation in the first stage guides the second stage generation toward a high-quality molecule in a goal-oriented way. Leveraging EDM and SemlaFlow as base generators, we observe significant quality improvements in unconditional molecule generation on the widely used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 50% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations. Furthermore, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while largely preserving the generation quality achieved with 1,000 steps, thereby significantly reducing the generation iterations needed. Code is available at https://github.com/GraphPKU/GeoRCG.

[696] Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

Devdhar Patel, Hava Siegelmann

Main category: cs.LG

TL;DR: Sequence Reinforcement Learning (SRL) introduces action sequences for lower decision frequencies, outperforming traditional RL in variable-frequency tasks.

Details

Motivation: State-of-the-art RL algorithms require impractical high-speed decision-making, limiting real-world applicability.

Method: SRL uses a model and actor-critic architecture with a ’temporal recall’ mechanism to learn action sequences.

Result: SRL matches state-of-the-art performance with reduced sample complexity and excels in variable-frequency tasks.

Conclusion: SRL is a practical solution for RL in real-world settings with slower decision frequencies.

Abstract: Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a “temporal recall” mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Furthermore, we compare SRL with model-based online planning, showing that SRL achieves comparable FAS while leveraging the same model during training that online planners use for planning.

[697] Analytic Continual Test-Time Adaptation for Multi-Modality Corruption

Yufei Zhang, Yicheng Xu, Hongxin Wei, Zhiping Lin, Xiaofeng Zou, Cen Chen, Huiping Zhuang

Main category: cs.LG

TL;DR: MDAA addresses MM-CTTA challenges like catastrophic forgetting and reliability bias using Analytic Classifiers and Dynamic Late Fusion Mechanism, achieving top performance.

Details

Motivation: To bridge domain shifts in multi-modal, evolving target domains while mitigating catastrophic forgetting and reliability bias.

Method: Proposes Multi-modality Dynamic Analytic Adapter (MDAA) with Analytic Classifiers (ACs) for forgetting mitigation and Dynamic Late Fusion Mechanism (DLFM) for reliable multi-modal integration.

Result: MDAA achieves state-of-the-art performance in MM-CTTA tasks.

Conclusion: MDAA effectively tackles MM-CTTA challenges, offering a robust solution for multi-modal adaptation.

Abstract: Test-Time Adaptation (TTA) enables pre-trained models to bridge the gap between source and target datasets using unlabeled test data, addressing domain shifts caused by corruptions like weather changes, noise, or sensor malfunctions in test time. Multi-Modal Continual Test-Time Adaptation (MM-CTTA), as an extension of standard TTA, further allows models to handle multi-modal inputs and adapt to continuously evolving target domains. However, MM-CTTA faces critical challenges such as catastrophic forgetting and reliability bias, which are rarely addressed effectively under multi-modal corruption scenarios. In this paper, we propose a novel approach, Multi-modality Dynamic Analytic Adapter (MDAA), to tackle MM-CTTA tasks. MDAA introduces analytic learning,a closed-form training technique,through Analytic Classifiers (ACs) to mitigate catastrophic forgetting. Furthermore, we design the Dynamic Late Fusion Mechanism (DLFM) to dynamically select and integrate reliable information from different modalities. Extensive experiments show that MDAA achieves state-of-the-art performance across the proposed tasks.

[698] Syno: Structured Synthesis for Neural Operators

Yongqi Zhuo, Zhengyuan Su, Chenggang Zhao, Mingyu Gao

Main category: cs.LG

TL;DR: Syno is an end-to-end framework for neural operator synthesis, automating the discovery of novel neural operators for better accuracy and speed. It uses fine-grained primitives and guided synthesis to achieve significant speedups with minimal accuracy loss.

Details

Motivation: The need for improved prediction accuracy and execution performance in neural networks drives the exploration of neural operator synthesis, as existing methods like NAS and tensor compilers are limited to optimizing or composing existing operators.

Method: Syno employs fine-grained primitives for tensor dimensions, expression canonicalization to avoid redundancy, and a guided synthesis flow with stochastic tree search for efficient exploration of the design space.

Result: Syno achieves average speedups of 1.37× to 2.06× on various hardware and compiler choices, with less than 1% accuracy loss, even on NAS-optimized models.

Conclusion: Syno demonstrates the potential of neural operator synthesis to outperform traditional methods, offering a practical framework for discovering novel, high-performance neural operators.

Abstract: The desires for better prediction accuracy and higher execution performance in neural networks never end. Neural architecture search (NAS) and tensor compilers are two popular techniques to optimize these two goals, but they are both limited to composing or optimizing existing manually designed operators rather than coming up with completely new designs. In this work, we explore the less studied direction of neural operator synthesis, which aims to automatically and efficiently discover novel neural operators with better accuracy and/or speed. We develop an end-to-end framework Syno, to realize practical neural operator synthesis. Syno makes use of a novel set of fine-grained primitives defined on tensor dimensions, which ensure various desired properties to ease model training, and also enable expression canonicalization techniques to avoid redundant candidates during search. Syno further adopts a novel guided synthesis flow to obtain valid operators matched with the specified input/output dimension sizes, and leverages efficient stochastic tree search algorithms to quickly explore the design space. We demonstrate that Syno discovers better operators with average speedups of $1.37\times$ to $2.06\times$ on various hardware and compiler choices, while keeping less than 1% accuracy loss even on NAS-optimized models.

[699] Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised Learning

Jungtaek Kim

Main category: cs.LG

TL;DR: The paper proposes a semi-supervised learning approach for density ratio estimation-based Bayesian optimization to address overconfidence in supervised classifiers, demonstrating its effectiveness in experiments.

Details

Motivation: Bayesian optimization is widely used but faces challenges with overconfidence in supervised classifiers when estimating global optima.

Method: Uses semi-supervised learning with density ratio estimation to leverage unlabeled data, improving accuracy.

Result: Empirical results show the method outperforms baselines in scenarios with unlabeled data.

Conclusion: The proposed semi-supervised approach enhances Bayesian optimization by mitigating classifier overconfidence.

Abstract: Bayesian optimization has attracted huge attention from diverse research areas in science and engineering, since it is capable of efficiently finding a global optimum of an expensive-to-evaluate black-box function. In general, a probabilistic regression model is widely used as a surrogate function to model an explicit distribution over function evaluations given an input to estimate and a training dataset. Beyond the probabilistic regression-based methods, density ratio estimation-based Bayesian optimization has been suggested in order to estimate a density ratio of the groups relatively close and relatively far to a global optimum. Developing this line of research further, supervised classifiers are employed to estimate a class probability for the two groups instead of a density ratio. However, the supervised classifiers used in this strategy are prone to be overconfident for known knowledge on global solution candidates. Supposing that we have access to unlabeled points, e.g., predefined fixed-size pools, we propose density ratio estimation-based Bayesian optimization with semi-supervised learning to solve this challenge. Finally, we show the empirical results of our methods and several baseline methods in two distinct scenarios with unlabeled point sampling and a fixed-size pool, and analyze the validity of our methods in diverse experiments.

[700] Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task

Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka

Main category: cs.LG

TL;DR: The paper studies compositional generalization in diffusion models, revealing how data structure and frequency impact their ability to generate novel, out-of-distribution samples.

Details

Motivation: To understand why diffusion models sometimes fail unpredictably in compositional tasks and how data attributes influence their generalization.

Method: A controlled study using synthetic data, varying training data attributes to measure out-of-distribution generation capabilities.

Result: Compositional generalization depends on data structure, shows sudden emergence, and requires more optimization for low-frequency concepts.

Conclusion: The study provides foundational insights into generative models’ compositional capabilities from a data-centric view.

Abstract: Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of the real world, reliable use of these models in practical applications requires that they exhibit the capability to compose a novel set of concepts to generate outputs not seen in the training data set. Prior work demonstrates that recent diffusion models do exhibit intriguing compositional generalization abilities, but also fail unpredictably. Motivated by this, we perform a controlled study for understanding compositional generalization in conditional diffusion models in a synthetic setting, varying different attributes of the training data and measuring the model’s ability to generate samples out-of-distribution. Our results show: (i) the order in which the ability to generate samples from a concept and compose them emerges is governed by the structure of the underlying data-generating process; (ii) performance on compositional tasks exhibits a sudden “emergence” due to multiplicative reliance on the performance of constituent tasks, partially explaining emergent phenomena seen in generative models; and (iii) composing concepts with lower frequency in the training data to generate out-of-distribution samples requires considerably more optimization steps compared to generating in-distribution samples. Overall, our study lays a foundation for understanding capabilities and compositionality in generative models from a data-centric perspective.

[701] REDS: Resource-Efficient Deep Subnetworks for Dynamic Resource Constraints

Francesco Corti, Balz Maag, Joachim Schauer, Ulrich Pferschy, Olga Saukh

Main category: cs.LG

TL;DR: REDS introduces resource-efficient subnetworks for edge devices, adapting to dynamic resource constraints via structured sparsity and hardware optimizations.

Details

Motivation: Addressing the inability of current models to adapt to runtime resource variability in edge devices.

Method: Uses structured sparsity and permutation invariance for hardware-specific optimizations, skipping computational blocks and re-arranging operations.

Result: Achieves high accuracy and fast adaptation (<40µs) on multiple benchmarks and hardware platforms.

Conclusion: REDS effectively adapts to dynamic resource constraints, outperforming state-of-the-art methods.

Abstract: Deep learning models deployed on edge devices frequently encounter resource variability, which arises from fluctuating energy levels, timing constraints, or prioritization of other critical tasks within the system. State-of-the-art machine learning pipelines generate resource-agnostic models that are not capable to adapt at runtime. In this work, we introduce Resource-Efficient Deep Subnetworks (REDS) to tackle model adaptation to variable resources. In contrast to the state-of-the-art, REDS leverages structured sparsity constructively by exploiting permutation invariance of neurons, which allows for hardware-specific optimizations. Specifically, REDS achieves computational efficiency by (1) skipping sequential computational blocks identified by a novel iterative knapsack optimizer, and (2) taking advantage of data cache by re-arranging the order of operations in REDS computational graph. REDS supports conventional deep networks frequently deployed on the edge and provides computational benefits even for small and simple networks. We evaluate REDS on eight benchmark architectures trained on the Visual Wake Words, Google Speech Commands, Fashion-MNIST, CIFAR-10 and ImageNet-1K datasets, and test on four off-the-shelf mobile and embedded hardware platforms. We provide a theoretical result and empirical evidence demonstrating REDS’ outstanding performance in terms of submodels’ test set accuracy, and demonstrate an adaptation time in response to dynamic resource constraints of under 40$\mu$s, utilizing a fully-connected network on Arduino Nano 33 BLE.

[702] The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

Lénaïc Chizat, Praneeth Netrapalli

Main category: cs.LG

TL;DR: The paper introduces the angle θ_ℓ between feature updates and backward passes to predict and control feature learning in deep networks, providing a feature speed formula to adjust hyperparameters for desired dynamics.

Details

Motivation: Hyperparameters indirectly control hierarchical feature learning in deep networks. The paper aims to directly predict and control feature learning by analyzing the angle θ_ℓ.

Method: The feature speed formula relates feature updates to θ_ℓ, loss decay, and backward pass magnitude. The angle θ_ℓ is analyzed in ReLU MLPs and ResNets, focusing on initialization and large width/depth limits.

Result: ReLU MLPs show angle degeneration with depth (cos(θ_ℓ) = Θ(1/√ℓ)), while ResNets maintain a non-degenerate angle (cos(θ_ℓ) = Θ(1)). Insights lead to new hyperparameter scaling for ReLU MLPs.

Conclusion: The angle θ_ℓ and feature speed formula offer a principled way to control feature learning dynamics, recovering known hyperparameter scalings and introducing new ones for improved performance.

Abstract: Deep learning succeeds by doing hierarchical feature learning, yet tuning hyper-parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we introduce a key notion to predict and control feature learning: the angle $\theta_\ell$ between the feature updates and the backward pass (at layer index $\ell$). We show that the magnitude of feature updates after one GD step, at any training time, can be expressed via a simple and general \emph{feature speed formula} in terms of this angle $\theta_\ell$, the loss decay, and the magnitude of the backward pass. This angle $\theta_\ell$ is controlled by the conditioning of the layer-to-layer Jacobians and at random initialization, it is determined by the spectrum of a certain kernel, which coincides with the Neural Tangent Kernel when $\ell=\text{depth}$. Given $\theta_\ell$, the feature speed formula provides us with rules to adjust HPs (scales and learning rates) so as to satisfy certain dynamical properties, such as feature learning and loss decay. We investigate the implications of our approach for ReLU MLPs and ResNets in the large width-then-depth limit. Relying on prior work, we show that in ReLU MLPs with iid initialization, the angle degenerates with depth as $\cos(\theta_\ell)=\Theta(1/\sqrt{\ell})$. In contrast, ResNets with branch scale $O(1/\sqrt{\text{depth}})$ maintain a non-degenerate angle $\cos(\theta_\ell)=\Theta(1)$. We use these insights to recover key properties of known HP scalings and also to introduce a new HP scaling for large depth ReLU MLPs with favorable theoretical properties.

[703] On the rates of convergence for learning with convolutional neural networks

Yunfei Yang, Han Feng, Ding-Xuan Zhou

Main category: cs.LG

TL;DR: The paper analyzes the approximation and learning capacities of CNNs with specific constraints, deriving improved bounds and optimal convergence rates for regression and classification tasks.

Details

Motivation: To understand and improve the theoretical foundations of CNNs, particularly in approximation and learning tasks, by addressing gaps in existing literature.

Method: The study uses approximation bounds and covering number analysis for CNNs, focusing on weight constraints and deriving convergence rates for regression and classification.

Result: Improved approximation bounds and covering number analysis for CNNs, leading to minimax optimal convergence rates in regression and classification.

Conclusion: The paper provides theoretical advancements for CNNs, demonstrating their effectiveness in learning tasks with optimal performance guarantees.

Abstract: We study approximation and learning capacities of convolutional neural networks (CNNs) with one-side zero-padding and multiple channels. Our first result proves a new approximation bound for CNNs with certain constraint on the weights. Our second result gives new analysis on the covering number of feed-forward neural networks with CNNs as special cases. The analysis carefully takes into account the size of the weights and hence gives better bounds than the existing literature in some situations. Using these two results, we are able to derive rates of convergence for estimators based on CNNs in many learning problems. In particular, we establish minimax optimal convergence rates of the least squares based on CNNs for learning smooth functions in the nonparametric regression setting. For binary classification, we derive convergence rates for CNN classifiers with hinge loss and logistic loss. It is also shown that the obtained rates for classification are minimax optimal in some common settings.

[704] On the Robustness of Global Feature Effect Explanations

Hubert Baniecki, Giuseppe Casalicchio, Bernd Bischl, Przemyslaw Biecek

Main category: cs.LG

TL;DR: The paper examines the robustness of global post-hoc explanations (like partial dependence plots and accumulated local effects) for predictive models on tabular data, providing theoretical bounds and experimental results to quantify their vulnerability to perturbations.

Details

Motivation: Understanding the robustness of global explanations is crucial for model debugging and scientific discovery, but their vulnerability to data and model perturbations is not well-studied.

Method: The authors introduce theoretical bounds for evaluating robustness and conduct experiments using synthetic and real-world datasets.

Result: The experiments quantify the gap between best and worst-case scenarios of interpreting machine learning predictions globally.

Conclusion: The study highlights the need for caution when relying on global post-hoc explanations due to their potential vulnerability to perturbations.

Abstract: We study the robustness of global post-hoc explanations for predictive models trained on tabular data. Effects of predictor features in black-box supervised learning are an essential diagnostic tool for model debugging and scientific discovery in applied sciences. However, how vulnerable they are to data and model perturbations remains an open research question. We introduce several theoretical bounds for evaluating the robustness of partial dependence plots and accumulated local effects. Our experimental results with synthetic and real-world datasets quantify the gap between the best and worst-case scenarios of (mis)interpreting machine learning predictions globally.

[705] NeuSemSlice: Towards Effective DNN Model Maintenance via Neuron-level Semantic Slicing

Shide Zhou, Tianlin Li, Yihao Huang, Ling Shi, Kailong Wang, Yang Liu, Haoyu Wang

Main category: cs.LG

TL;DR: NeuSemSlice introduces semantic slicing to identify neuron-level semantic components in DNNs, enhancing model maintenance tasks like restructure, re-adaptation, and incremental development.

Details

Motivation: DNNs' monolithic architecture complicates maintenance tasks. Layer-level approaches lack precision for neuron-level manipulation, necessitating finer-grained solutions.

Method: NeuSemSlice uses semantic slicing to identify, categorize, and merge critical neurons by semantic similarity, enabling flexible maintenance strategies.

Result: NeuSemSlice outperforms baselines in model restructure, re-adaptation, and incremental development tasks.

Conclusion: NeuSemSlice provides an effective neuron-level framework for semantic-aware DNN maintenance, addressing limitations of prior layer-level approaches.

Abstract: Deep Neural networks (DNNs), extensively applied across diverse disciplines, are characterized by their integrated and monolithic architectures, setting them apart from conventional software systems. This architectural difference introduces particular challenges to maintenance tasks, such as model restructure (e.g., model compression), re-adaptation (e.g., fitting new samples), and incremental development (e.g., continual knowledge accumulation). Prior research addresses these challenges by identifying task-critical neuron layers, and dividing neural networks into semantically-similar sequential modules. However, such layer-level approaches fail to precisely identify and manipulate neuron-level semantic components, restricting their applicability to finer-grained model maintenance tasks. In this work, we implement NeuSemSlice, a novel framework that introduces the semantic slicing technique to effectively identify critical neuron-level semantic components in DNN models for semantic-aware model maintenance tasks. Specifically, semantic slicing identifies, categorizes and merges critical neurons across different categories and layers according to their semantic similarity, enabling their flexibility and effectiveness in the subsequent tasks. For semantic-aware model maintenance tasks, we provide a series of novel strategies based on semantic slicing to enhance NeuSemSlice. They include semantic components (i.e., critical neurons) preservation for model restructure, critical neuron tuning for model re-adaptation, and non-critical neuron training for model incremental development. A thorough evaluation has demonstrated that NeuSemSlice significantly outperforms baselines in all three tasks.

[706] ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei

Main category: cs.LG

TL;DR: ABQ-LLM introduces an arbitrary-bit quantization algorithm and framework to address performance degradation and GPU acceleration limitations in LLM inference, achieving superior results.

Details

Motivation: Overcome the challenges of low-bit quantization performance degradation and restricted GPU acceleration for quantized matrix operations in LLMs.

Method: Proposes a distribution correction method, bit balance strategy, and a novel quantization acceleration framework using BTC equivalents for arbitrary-precision inference.

Result: Achieved a WikiText2 perplexity of 7.59 (2.17↓ vs AffineQuant) and 1.6× acceleration improvement with 2.7× memory compression gain over SmoothQuant.

Conclusion: ABQ-LLM effectively mitigates quantization issues and enhances LLM inference efficiency, making it practical for deployment.

Abstract: Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an effective method to accelerate LLM inference. Despite its growing popularity in LLM model compression, PTQ deployment faces two major challenges. First, low-bit quantization leads to performance degradation. Second, restricted by the limited integer computing unit type on GPUs, quantized matrix operations with different precisions cannot be effectively accelerated. To address these issues, we introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. ABQ-LLM introduces several key innovations: (1) a distribution correction method for transformer blocks to mitigate distribution differences caused by full quantization of weights and activations, improving performance at low bit-widths. (2) the bit balance strategy to counteract performance degradation from asymmetric distribution issues at very low bit-widths (e.g., 2-bit). (3) an innovative quantization acceleration framework that reconstructs the quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents, gets rid of the limitations of INT4/INT8 computing units. ABQ-LLM can convert each component bit width gain into actual acceleration gain, maximizing performance under mixed precision(e.g., W6A6, W2A8). Based on W2*A8 quantization configuration on LLaMA-7B model, it achieved a WikiText2 perplexity of 7.59 (2.17$\downarrow $ vs 9.76 in AffineQuant). Compared to SmoothQuant, we realized 1.6$\times$ acceleration improvement and 2.7$\times$ memory compression gain.

[707] CoSTI: Consistency Models for (a faster) Spatio-Temporal Imputation

Javier Solís-García, Belén Vega-Márquez, Juan A. Nepomuceno, Isabel A. Nepomuceno-Chamorro

Main category: cs.LG

TL;DR: CoSTI, an adaptation of Consistency Models, offers efficient multivariate time series imputation with performance matching diffusion models but much faster.

Details

Motivation: Addressing the high computational cost and time inefficiency of existing methods like DDPMs in multivariate time series imputation.

Method: Proposes CoSTI, using Consistency Training to reduce inference times while maintaining accuracy comparable to DDPMs.

Result: Achieves up to 98% reduction in imputation time with performance on par with diffusion models.

Conclusion: CoSTI bridges efficiency and accuracy in generative imputation, suitable for real-time applications.

Abstract: Multivariate Time Series Imputation (MTSI) is crucial for many applications, such as healthcare monitoring and traffic management, where incomplete data can compromise decision-making. Existing state-of-the-art methods, like Denoising Diffusion Probabilistic Models (DDPMs), achieve high imputation accuracy; however, they suffer from significant computational costs and are notably time-consuming due to their iterative nature. In this work, we propose CoSTI, an innovative adaptation of Consistency Models (CMs) for the MTSI domain. CoSTI employs Consistency Training to achieve comparable imputation quality to DDPMs while drastically reducing inference times, making it more suitable for real-time applications. We evaluate CoSTI across multiple datasets and missing data scenarios, demonstrating up to a 98% reduction in imputation time with performance on par with diffusion-based models. This work bridges the gap between efficiency and accuracy in generative imputation tasks, providing a scalable solution for handling missing data in critical spatio-temporal systems.

[708] Persistent Backdoor Attacks in Continual Learning

Zhen Guo, Abhinav Kumar, Reza Tourani

Main category: cs.LG

TL;DR: The paper introduces two persistent backdoor attacks for neural networks in continual learning, demonstrating their effectiveness and evasion of defenses.

Details

Motivation: Backdoor attacks in neural networks are a critical threat, but their persistence in continual learning is understudied.

Method: Two attacks are proposed: Blind Task Backdoor (subtle loss computation alteration) and Latent Task Backdoor (single-task influence).

Result: Both attacks achieve high success rates across continual learning algorithms and evade defenses like SentiNet and I-BAU.

Conclusion: The study highlights the practicality and persistence of backdoor attacks in continual learning, calling for improved defenses.

Abstract: Backdoor attacks pose a significant threat to neural networks, enabling adversaries to manipulate model outputs on specific inputs, often with devastating consequences, especially in critical applications. While backdoor attacks have been studied in various contexts, little attention has been given to their practicality and persistence in continual learning, particularly in understanding how the continual updates to model parameters, as new data distributions are learned and integrated, impact the effectiveness of these attacks over time. To address this gap, we introduce two persistent backdoor attacks-Blind Task Backdoor and Latent Task Backdoor-each leveraging minimal adversarial influence. Our blind task backdoor subtly alters the loss computation without direct control over the training process, while the latent task backdoor influences only a single task’s training, with all other tasks trained benignly. We evaluate these attacks under various configurations, demonstrating their efficacy with static, dynamic, physical, and semantic triggers. Our results show that both attacks consistently achieve high success rates across different continual learning algorithms, while effectively evading state-of-the-art defenses, such as SentiNet and I-BAU.

[709] Hypergraph Neural Networks Reveal Spatial Domains from Single-cell Transcriptomics Data

Mehrad Soltani, Luis Rueda

Main category: cs.LG

TL;DR: The paper introduces a Hypergraph Neural Network (HGNN) model for spatial transcriptomics clustering, outperforming traditional GNNs by capturing implicit cell connections and achieving superior performance metrics.

Details

Motivation: Spatial clustering is crucial for analyzing tissue subpopulations, but traditional GNNs fail to capture implicit cell connections, limiting their effectiveness.

Method: The proposed HGNN model uses hyperedges to capture complex cell relationships and autoencoders for unsupervised learning.

Result: The model achieved the highest iLISI score (1.843), ARI (0.51), and Leiden score (0.60), indicating superior performance in cell type diversity and clustering.

Conclusion: HGNNs are more effective than GNNs for spatial transcriptomics, offering better performance in capturing implicit connections and clustering accuracy.

Abstract: The task of spatial clustering of transcriptomics data is of paramount importance. It enables the classification of tissue samples into diverse subpopulations of cells, which, in turn, facilitates the analysis of the biological functions of clusters, tissue reconstruction, and cell-cell interactions. Many approaches leverage gene expressions, spatial locations, and histological images to detect spatial domains; however, Graph Neural Networks (GNNs) as state of the art models suffer from a limitation in the assumption of pairwise connections between nodes. In the case of domain detection in spatial transcriptomics, some cells are found to be not directly related. Still, they are grouped as the same domain, which shows the incapability of GNNs for capturing implicit connections among the cells. While graph edges connect only two nodes, hyperedges connect an arbitrary number of nodes along their edges, which lets Hypergraph Neural Networks (HGNNs) capture and utilize richer and more complex structural information than traditional GNNs. We use autoencoders to address the limitation of not having the actual labels, which are well-suited for unsupervised learning. Our model has demonstrated exceptional performance, achieving the highest iLISI score of 1.843 compared to other methods. This score indicates the greatest diversity of cell types identified by our method. Furthermore, our model outperforms other methods in downstream clustering, achieving the highest ARI values of 0.51 and Leiden score of 0.60.

[710] Extended Histogram-based Outlier Score (EHBOS)

Tanvir Islam

Main category: cs.LG

TL;DR: EHBOS extends HBOS by using 2D histograms to capture feature dependencies, improving anomaly detection in datasets with critical feature interactions.

Details

Motivation: HBOS's assumption of feature independence limits its effectiveness in detecting anomalies where feature interactions matter.

Method: Proposes EHBOS, incorporating two-dimensional histograms to model dependencies between feature pairs.

Result: EHBOS outperforms HBOS on datasets with critical feature interactions, showing improved ROC AUC.

Conclusion: EHBOS is a valuable extension to HBOS, effective for detecting contextual or relational anomalies.

Abstract: Histogram-Based Outlier Score (HBOS) is a widely used outlier or anomaly detection method known for its computational efficiency and simplicity. However, its assumption of feature independence limits its ability to detect anomalies in datasets where interactions between features are critical. In this paper, we propose the Extended Histogram-Based Outlier Score (EHBOS), which enhances HBOS by incorporating two-dimensional histograms to capture dependencies between feature pairs. This extension allows EHBOS to identify contextual and dependency-driven anomalies that HBOS fails to detect. We evaluate EHBOS on 17 benchmark datasets, demonstrating its effectiveness and robustness across diverse anomaly detection scenarios. EHBOS outperforms HBOS on several datasets, particularly those where feature interactions are critical in defining the anomaly structure, achieving notable improvements in ROC AUC. These results highlight that EHBOS can be a valuable extension to HBOS, with the ability to model complex feature dependencies. EHBOS offers a powerful new tool for anomaly detection, particularly in datasets where contextual or relational anomalies play a significant role.

[711] Does equivariance matter at scale?

Johann Brehmer, Sönke Behrends, Pim de Haan, Taco Cohen

Main category: cs.LG

TL;DR: Equivariant models improve data efficiency and outperform non-equivariant ones with compute scaling, but data augmentation can bridge the gap. Optimal compute allocation differs between the two.

Details

Motivation: To determine whether designing neural architectures for problem-specific symmetries is more beneficial than learning them from data.

Method: Empirical study on rigid-body interactions using equivariant and non-equivariant transformer architectures, varying model size, training steps, and dataset size.

Result: Equivariance enhances data efficiency; non-equivariant models catch up with data augmentation. Equivariant models outperform in compute scaling. Compute allocation strategies differ.

Conclusion: Equivariant architectures are advantageous for data efficiency and compute scaling, but non-equivariant models can compensate with augmentation. Optimal compute use varies by model type.

Abstract: Given large datasets and sufficient compute, is it beneficial to design neural architectures for the structure and symmetries of each problem? Or is it more efficient to learn them from data? We study empirically how equivariant and non-equivariant networks scale with compute and training samples. Focusing on a benchmark problem of rigid-body interactions and on general-purpose transformer architectures, we perform a series of experiments, varying the model size, training steps, and dataset size. We find evidence for three conclusions. First, equivariance improves data efficiency, but training non-equivariant models with data augmentation can close this gap given sufficient epochs. Second, scaling with compute follows a power law, with equivariant models outperforming non-equivariant ones at each tested compute budget. Finally, the optimal allocation of a compute budget onto model size and training duration differs between equivariant and non-equivariant models.

[712] Lagrangian neural networks for nonholonomic mechanics

Viviana Alejandra Diaz, Leandro Martin Salomone, Marcela Zuccalli

Main category: cs.LG

TL;DR: Lagrangian Neural Networks (LNNs) are adapted for mechanical systems with nonholonomic constraints, improving trajectory accuracy, constraint adherence, and energy behavior.

Details

Motivation: To extend LNNs' effectiveness to systems with nonholonomic constraints, which are common in real-world mechanical systems.

Method: Adapt LNN techniques to incorporate nonholonomic constraints, testing on known examples.

Result: Improved trajectory estimation, better constraint adherence, and enhanced energy behavior compared to unconstrained LNNs.

Conclusion: LNNs can effectively handle nonholonomic constraints, offering practical benefits for constrained mechanical systems.

Abstract: Lagrangian Neural Networks (LNNs) are a powerful tool for addressing physical systems, particularly those governed by conservation laws. LNNs can parametrize the Lagrangian of a system to predict trajectories with nearly conserved energy. These techniques have proven effective in unconstrained systems as well as those with holonomic constraints. In this work, we adapt LNN techniques to mechanical systems with nonholonomic constraints. We test our approach on some well-known examples with nonholonomic constraints, showing that incorporating these restrictions into the neural network’s learning improves not only trajectory estimation accuracy but also ensures adherence to constraints and exhibits better energy behavior compared to the unconstrained counterpart.

[713] NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy

Main category: cs.LG

TL;DR: NestQuant is a novel PTQ scheme using self-similar nested lattices, achieving superior efficiency for LLMs with low-precision quantization.

Details

Motivation: Efficient deployment of large language models (LLMs) requires effective post-training quantization (PTQ) techniques.

Method: Proposes NestQuant, a practical low-complexity PTQ scheme based on Gosset lattice for weights and activations.

Result: Quantizes Llama-3-8B to 4 bits with perplexity 6.6, outperforming state-of-the-art methods like SpinQuant, OstQuant, and QuaRot.

Conclusion: NestQuant demonstrates uniform superiority across models and benchmarks, significantly reducing perplexity gaps.

Abstract: Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Metas SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

[714] Generalized Trusted Multi-view Classification Framework with Hierarchical Opinion Aggregation

Long Shi, Chuanqing Tang, Huangyi Deng, Cai Xu, Lei Xing, Badong Chen

Main category: cs.LG

TL;DR: A hierarchical trusted multi-view classification framework is proposed, improving upon existing methods by incorporating intra-view and inter-view aggregation for better decision-making.

Details

Motivation: Existing trusted multi-view learning methods focus only on inter-view aggregation, neglecting intra-view information, which limits their effectiveness.

Method: The framework uses a two-phase hierarchical aggregation: intra-view (common and specific information) and inter-view (evidence-level attention mechanism).

Result: The model outperforms state-of-the-art trust-related baselines in experiments.

Conclusion: The hierarchical aggregation framework enhances trusted multi-view learning by leveraging both intra-view and inter-view information.

Abstract: Recently, multi-view learning has witnessed a considerable interest on the research of trusted decision-making. Previous methods are mainly inspired from an important paper published by Han et al. in 2021, which formulates a Trusted Multi-view Classification (TMC) framework that aggregates evidence from different views based on Dempster’s combination rule. All these methods only consider inter-view aggregation, yet lacking exploitation of intra-view information. In this paper, we propose a generalized trusted multi-view classification framework with hierarchical opinion aggregation. This hierarchical framework includes a two-phase aggregation process: the intra-view and inter-view aggregation hierarchies. In the intra aggregation, we assume that each view is comprised of common information shared with other views, as well as its specific information. We then aggregate both the common and specific information. This aggregation phase is useful to eliminate the feature noise inherent to view itself, thereby improving the view quality. In the inter-view aggregation, we design an attention mechanism at the evidence level to facilitate opinion aggregation from different views. To the best of our knowledge, this is one of the pioneering efforts to formulate a hierarchical aggregation framework in the trusted multi-view learning domain. Extensive experiments show that our model outperforms some state-of art trust-related baselines. One can access the source code on https://github.com/lshi91/GTMC-HOA.

[715] On the Role of Discrete Representation in Sparse Mixture of Experts

Giang Do, Kha Pham, Hung Le, Truyen Tran

Main category: cs.LG

TL;DR: VQMoE replaces traditional routers in SMoE with vector quantization for expert assignment, improving robustness by 28% without compromising fine-tuning performance.

Details

Motivation: Address routing inconsistencies and representation collapse in SMoE by avoiding direct router fixes and using discrete input representations.

Method: Assign experts via indirection using discrete input representations learned through vector quantization (VQMoE).

Result: 28% improvement in robustness over other SMoE routing methods, with strong fine-tuning performance.

Conclusion: VQMoE effectively overcomes SMoE router weaknesses, offering a robust alternative for scaling model capacity.

Abstract: Sparse mixture of experts (SMoE) is an effective solution for scaling up model capacity without increasing the computational costs. A crucial component of SMoE is the router, responsible for directing the input to relevant experts; however, it also presents a major weakness, leading to routing inconsistencies and representation collapse issues. Instead of fixing the router like previous works, we propose an alternative that assigns experts to input via indirection, which employs the discrete representation of input that points to the expert. The discrete representations are learnt via vector quantization, resulting in a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE). We provide theoretical support and empirical evidence demonstrating the VQMoE’s ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMoE achieves a 28% improvement in robustness compared to other SMoE routing methods, while maintaining strong performance in fine-tuning tasks.

[716] Semi-Supervised Risk Control via Prediction-Powered Inference

Bat-Sheva Einbinder, Liran Ringel, Yaniv Romano

Main category: cs.LG

TL;DR: The paper introduces a semi-supervised calibration procedure to improve the RCPS framework by leveraging unlabeled data, reducing conservatism in error rates caused by limited hold-out data.

Details

Motivation: The RCPS framework's reliance on limited hold-out calibration data leads to noisy hyper-parameter tuning and overly conservative error rates.

Method: A semi-supervised calibration procedure using unlabeled data, adapted from prediction-powered inference, is proposed to refine hyper-parameter tuning.

Result: The method is validated through experiments in few-shot image classification and early time series classification, showing improved error rate control.

Conclusion: The semi-supervised approach effectively addresses the sample-size limitation of RCPS, enhancing its practicality and performance.

Abstract: The risk-controlling prediction sets (RCPS) framework is a general tool for transforming the output of any machine learning model to design a predictive rule with rigorous error rate control. The key idea behind this framework is to use labeled hold-out calibration data to tune a hyper-parameter that affects the error rate of the resulting prediction rule. However, the limitation of such a calibration scheme is that with limited hold-out data, the tuned hyper-parameter becomes noisy and leads to a prediction rule with an error rate that is often unnecessarily conservative. To overcome this sample-size barrier, we introduce a semi-supervised calibration procedure that leverages unlabeled data to rigorously tune the hyper-parameter without compromising statistical validity. Our procedure builds upon the prediction-powered inference framework, carefully tailoring it to risk-controlling tasks. We demonstrate the benefits and validity of our proposal through two real-data experiments: few-shot image classification and early time series classification.

[717] Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data

Yujie Li, Xiangkun Wang, Xin Yang, Marcello Bonsangue, Junbo Zhang, Tianrui Li

Main category: cs.LG

TL;DR: The paper introduces Open-World Few-Shot Continual Learning (OFCL), addressing challenges in learning with limited labeled data, open detection, and knowledge transfer. It proposes a framework with token augmentation, margin-based boundaries, and adaptive knowledge space, showing superior performance.

Details

Motivation: Existing open-world continual learning (OWCL) methods require extensive labeled data, which is impractical. This paper tackles the more realistic scenario of few-shot training samples in OWCL.

Method: Proposes an OFCL framework with three components: instance-wise token augmentation (ITA), margin-based open boundary (MOB), and adaptive knowledge space (AKS).

Result: The framework outperforms baselines, demonstrating practical importance and reproducibility.

Conclusion: The OFCL framework effectively addresses challenges in few-shot continual learning, offering a scalable and adaptable solution.

Abstract: Open-world continual learning (OWCL) adapts to sequential tasks with open samples, learning knowledge incrementally while preventing forgetting. However, existing OWCL still requires a large amount of labeled data for training, which is often impractical in real-world applications. Given that new categories/entities typically come with limited annotations and are in small quantities, a more realistic situation is OWCL with scarce labeled data, i.e., few-shot training samples. Hence, this paper investigates the problem of open-world few-shot continual learning (OFCL), challenging in (i) learning unbounded tasks without forgetting previous knowledge and avoiding overfitting, (ii) constructing compact decision boundaries for open detection with limited labeled data, and (iii) transferring knowledge about knowns and unknowns and even update the unknowns to knowns once the labels of open samples are learned. In response, we propose a novel OFCL framework that integrates three key components: (1) an instance-wise token augmentation (ITA) that represents and enriches sample representations with additional knowledge, (2) a margin-based open boundary (MOB) that supports open detection with new tasks emerge over time, and (3) an adaptive knowledge space (AKS) that endows unknowns with knowledge for the updating from unknowns to knowns. Finally, extensive experiments show that the proposed OFCL framework outperforms all baselines remarkably with practical importance and reproducibility. The source code is released at https://github.com/liyj1201/OFCL.

[718] GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference

Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Lean Fu, Xing Mei

Main category: cs.LG

TL;DR: GQSA combines quantization and sparsification for efficient LLM compression, outperforming traditional methods in accuracy and speed.

Details

Motivation: Traditional compression methods (quantization or sparsification alone) lead to performance loss at high compression rates. GQSA integrates both for better results.

Method: GQSA uses GPU-friendly structured group sparsity and quantization, with a two-stage sparse optimization strategy and a ’task-centric’ parallel strategy.

Result: GQSA W4S50% outperforms 2:4 pruning and W2 quantization in accuracy and speed (1.26× faster than W2, 2.35× faster than 2:4 pruning).

Conclusion: GQSA is a superior compression technique for LLMs, offering flexibility, higher compression rates, and efficient inference.

Abstract: Model compression has emerged as a mainstream solution to reduce memory usage and computational overhead. This paper presents Group Quantization and Sparse Acceleration (GQSA), a novel compression technique tailored for LLMs. Traditional methods typically focus exclusively on either quantization or sparsification, but relying on a single strategy often results in significant performance loss at high compression rates. In contrast, GQSA integrates quantization and sparsification in a tightly coupled manner, leveraging GPU-friendly structured group sparsity and quantization for efficient acceleration. Building upon system-algorithm co-design principles, we propose a two-stage sparse optimization strategy that ensures the performance superiority of the compressed model. On the engine side, we introduce a “task-centric” parallel strategy, which, to the best of our knowledge, is the first application in the domain of sparse computing. Compared to the traditional 2:4 sparse method, the GQSA offers a more flexible and adjustable sparsity rate, as well as a higher weight compression rate, and is efficiently compatible with weight-only quantization methods. Experimental results demonstrate that, under the GQSA W4S50% compression setting, the model’s accuracy surpasses that of both 2:4 pruning and W2 quantization. Furthermore, at the inference level, GQSA outperforms W2 by 1.26$\times$ and 2:4 pruning by 2.35$\times$ in terms of speed.

[719] Growing Neural Networks: Dynamic Evolution through Gradient Descent

Anil Radhakrishnan, John F. Lindner, Scott T. Miller, Sudeshna Sinha, William L. Ditto

Main category: cs.LG

TL;DR: The paper introduces two methods for dynamically growing neural networks during training, outperforming static networks of the same size.

Details

Motivation: To address the inefficiency of static neural networks by enabling dynamic growth during training, optimizing size and performance.

Method: Two approaches: one uses an auxiliary weight to control size, the other a controller-generated mask to modulate neuron participation. Both optimize size via gradient descent.

Result: Growing networks outperform static ones in regression and classification tasks. Scaling relations are explored for hyperparameters.

Conclusion: Starting small and growing networks may be more efficient than starting large, especially as networks grow in size and energy use.

Abstract: In contrast to conventional artificial neural networks, which are structurally static, we present two approaches for evolving small networks into larger ones during training. The first method employs an auxiliary weight that directly controls network size, while the second uses a controller-generated mask to modulate neuron participation. Both approaches optimize network size through the same gradient-descent algorithm that updates the network’s weights and biases. We evaluate these growing networks on nonlinear regression and classification tasks, where they consistently outperform static networks of equivalent final size. We then explore the hyperparameter space of these networks to find associated scaling relations relative to their static counterparts. Our results suggest that starting small and growing naturally may be preferable to simply starting large, particularly as neural networks continue to grow in size and energy consumption.

[720] Categorical Schrödinger Bridge Matching

Grigoriy Ksenofontov, Alexander Korotin

Main category: cs.LG

TL;DR: The paper introduces a theoretical and algorithmic foundation for solving the Schrödinger Bridge (SB) problem in discrete spaces using Iterative Markovian Fitting (IMF), proposing a practical algorithm called Categorical Schrödinger Bridge Matching (CSBM).

Details

Motivation: Most SB research focuses on continuous data, leaving gaps in theory and methods for discrete spaces (e.g., VQ representations, text tokens). This work addresses these gaps.

Method: The authors justify the convergence of discrete-time IMF (D-IMF) for SB in discrete spaces and develop CSBM, a practical algorithm for solving SB in such settings.

Result: CSBM is validated through experiments on synthetic data and VQ representations of images, demonstrating its effectiveness.

Conclusion: The paper provides a robust framework for SB in discrete spaces, with CSBM offering a practical solution, supported by theoretical and empirical evidence.

Abstract: The Schr"odinger Bridge (SB) is a powerful framework for solving generative modeling tasks such as unpaired domain translation. Most SB-related research focuses on continuous data space $\mathbb{R}^{D}$ and leaves open theoretical and algorithmic questions about applying SB methods to discrete data, e.g, on finite spaces $\mathbb{S}^{D}$. Notable examples of such sets $\mathbb{S}$ are codebooks of vector-quantized (VQ) representations of modern autoencoders, tokens in texts, categories of atoms in molecules, etc. In this paper, we provide a theoretical and algorithmic foundation for solving SB in discrete spaces using the recently introduced Iterative Markovian Fitting (IMF) procedure. Specifically, we theoretically justify the convergence of discrete-time IMF (D-IMF) to SB in discrete spaces. This enables us to develop a practical computational algorithm for SB, which we call Categorical Schr"odinger Bridge Matching (CSBM). We show the performance of CSBM via a series of experiments with synthetic data and VQ representations of images. The code of CSBM is available at https://github.com/gregkseno/csbm.

[721] Position: Untrained Machine Learning for Anomaly Detection by using 3D Point Cloud Data

Juan Du, Dongheng Chen

Main category: cs.LG

TL;DR: The paper addresses untrained anomaly detection in 3D point clouds, proposing three frameworks for accurate anomaly identification without relying on historical data or labels.

Details

Motivation: The research is driven by real-world needs in personalized manufacturing and healthcare, where only one sample is available, making traditional anomaly detection methods impractical.

Method: Three frameworks are introduced: Latent Variable Inference (probabilistic modeling), Decomposition (sparse learning), and Local Geometry (neighborhood-based).

Result: Untrained methods achieve competitive performance with a 15-fold speed increase, proving effective in data-scarce scenarios.

Conclusion: The proposed methods offer practical solutions for industries where collecting multiple samples is infeasible, bridging a critical gap in anomaly detection.

Abstract: Anomaly detection based on 3D point cloud data is an important research problem and receives more and more attention recently. Untrained anomaly detection based on only one sample is an emerging research problem motivated by real manufacturing industries such as personalized manufacturing where only one sample can be collected without any additional labels and historical datasets. Identifying anomalies accurately based on one 3D point cloud sample is a critical challenge in both industrial applications and the field of machine learning. This paper aims to provide a formal definition of the untrained anomaly detection problem based on 3D point cloud data, discuss the differences between untrained anomaly detection and current unsupervised anomaly detection problems. Unlike trained unsupervised learning, untrained unsupervised learning does not rely on any data, including unlabeled data. Instead, they leverage prior knowledge about the surfaces and anomalies. We propose three complementary methodological frameworks: the Latent Variable Inference Framework that employs probabilistic modeling to distinguish anomalies; the Decomposition Framework that separates point clouds into reference, anomaly, and noise components through sparse learning; and the Local Geometry Framework that leverages neighborhood information for anomaly identification. Experimental results demonstrate that untrained methods achieve competitive detection performance while offering significant computational advantages, demonstrating up to a 15-fold increase in execution speed. The proposed methods provide viable solutions for scenarios with extreme data scarcity, addressing critical challenges in personalized manufacturing and healthcare applications where collecting multiple samples or historical data is infeasible.

[722] TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Abir Harrasse, Philip Quirke, Clement Neo, Dhruv Nathawani, Luke Marks, Amir Abdullah

Main category: cs.LG

TL;DR: The paper proposes text-to-SQL generation as a task to bridge the gap between simple toy tasks and complex large models, using TinySQL dataset and interpretability techniques to analyze SQL circuits.

Details

Motivation: To address the gap in mechanistic interpretability research between simple circuits in toy tasks and features in large models.

Method: Introduces TinySQL dataset, trains models (33M to 1B parameters), and applies interpretability techniques like Edge Attribution Patching and Sparse Autoencoders to analyze SQL circuits.

Result: Identifies minimal circuits for SQL subskills, evaluates their properties, and reveals how models compose queries layerwise.

Conclusion: Provides a framework for comparing interpretability methods in structured, complex settings.

Abstract: Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset, progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including Edge Attribution Patching and Sparse Autoencoders, to identify minimal circuits and components supporting SQL generation. We compare circuits for different SQL subskills, evaluating their minimality, reliability, and identifiability. Finally, we conduct a layerwise logit lens analysis to reveal how models compose SQL queries across layers: from intent recognition to schema resolution to structured generation. Our work provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting.

[723] Finite-Time Analysis of Discrete-Time Stochastic Interpolants

Yuhao Liu, Yu Chen, Rui Hu, Longbo Huang

Main category: cs.LG

TL;DR: Discrete-time analysis of the stochastic interpolant framework, introducing a new sampler and quantifying factors affecting convergence rates.

Details

Motivation: Prior work focused on continuous-time settings with perfect solutions; this study addresses the gap by analyzing discrete-time scenarios.

Method: Introduces an innovative discrete-time sampler and derives a finite-time upper bound on distribution estimation error.

Result: Provides a novel quantification of factors like distribution distance and estimation accuracy, influencing convergence rates.

Conclusion: Offers a principled way to design efficient schedules for convergence acceleration, supported by numerical experiments.

Abstract: The stochastic interpolant framework offers a powerful approach for constructing generative models based on ordinary differential equations (ODEs) or stochastic differential equations (SDEs) to transform arbitrary data distributions. However, prior analyses of this framework have primarily focused on the continuous-time setting, assuming a perfect solution of the underlying equations. In this work, we present the first discrete-time analysis of the stochastic interpolant framework, where we introduce an innovative discrete-time sampler and derive a finite-time upper bound on its distribution estimation error. Our result provides a novel quantification of how different factors, including the distance between source and target distributions and estimation accuracy, affect the convergence rate and also offers a new principled way to design efficient schedules for convergence acceleration. Finally, numerical experiments are conducted on the discrete-time sampler to corroborate our theoretical findings.

[724] Preconditioned Inexact Stochastic ADMM for Deep Model

Shenglong Zhou, Ouya Wang, Ziyan Luo, Yongxu Zhu, Geoffrey Ye Li

Main category: cs.LG

TL;DR: PISA is a new optimizer for foundation models, addressing slow convergence and data heterogeneity with scalable parallel computing and flexible preconditions. It outperforms state-of-the-art methods in diverse applications.

Details

Motivation: Existing optimizers like SGD face limitations in convergence and struggle with data heterogeneity, especially in distributed settings. PISA aims to overcome these challenges.

Method: PISA uses preconditioned inexact stochastic ADMM, supporting various preconditions (e.g., second-order info, momentum) and requiring only Lipschitz continuity for convergence.

Result: PISA shows superior performance in training/fine-tuning diverse models (vision, LLMs, RL, GANs, RNNs) compared to other optimizers.

Conclusion: PISA effectively addresses data heterogeneity and convergence issues, offering a robust solution for training foundation models.

Abstract: The recent advancement of foundation models (FMs) has brought about a paradigm shift, revolutionizing various sectors worldwide. The popular optimizers used to train these models are stochastic gradient descent-based algorithms, which face inherent limitations, such as slow convergence and stringent assumptions for convergence. In particular, data heterogeneity arising from distributed settings poses significant challenges to their theoretical and numerical performance. This paper develops an algorithm, PISA (\textbf{P}reconditioned \textbf{I}nexact \textbf{S}tochastic \textbf{A}lternating Direction Method of Multipliers), which enables scalable parallel computing and supports various preconditions, such as second-order information, second moment, and orthogonalized momentum by Newton-Schulz iterations. Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz continuity of the gradient on a bounded region, thereby removing the need for other conditions commonly imposed by stochastic methods. This capability enables PISA to tackle the challenge of data heterogeneity effectively. Comprehensive experimental evaluations for training or fine-tuning diverse deep models, including vision models, large language models, reinforcement learning models, generative adversarial networks, and recurrent neural networks, demonstrate its superior numerical performance compared to various state-of-the-art optimizers.

[725] Adversarial Combinatorial Semi-bandits with Graph Feedback

Yuxiao Wen

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emph{graph feedback}, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graph $G$. We establish that the optimal regret over a time horizon $T$ scales as $\widetilde{\Theta}(S\sqrt{T}+\sqrt{\alpha ST})$, where $S$ is the size of the combinatorial decisions and $\alpha$ is the independence number of $G$. This result interpolates between the known regrets $\widetilde\Theta(S\sqrt{T})$ under full information (i.e., $G$ is complete) and $\widetilde\Theta(\sqrt{KST})$ under the semi-bandit feedback (i.e., $G$ has only self-loops), where $K$ is the total number of arms. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations. We also show that online stochastic mirror descent (OSMD) that only realizes convexified actions in expectation is suboptimal. In addition, we describe the problem of \emph{combinatorial semi-bandits with general capacity} and apply our results to derive an improved regret upper bound, which may be of independent interest.

[726] Satellite-Surface-Area Machine-Learning Models for Reservoir Storage Estimation: Regime-Sensitive Evaluation and Operational Deployment at Loskop Dam, South Africa

Hugo Retief, Kayathri, Vigneswaran, Surajit Ghosh, Mariangel Garcia Andarcia, Chris Dickens

Main category: cs.LG

TL;DR: The paper develops data-driven models for reliable reservoir storage estimates at Loskop Dam, using a 40-year DEA surface area archive and gauged water levels. Ridge regression outperforms other algorithms, and a stacked ensemble further reduces errors. Recommendations include using Gradient Boosting for daily operations, Ridge for drought warnings, and the ensemble for dashboards.

Details

Motivation: Accurate reservoir storage estimates are crucial for water allocation and drought response in semiarid regions, but conventional methods are unreliable due to sedimentation and drawdown.

Method: A 40-year DEA surface area archive (1984-2024) was combined with gauged water levels to develop volume predictors. Four feature sets and five algorithms (GB, RF, RI, LA, EN) were tested using random search and timeseries split. Errors were analyzed in low and high storage regimes.

Result: Ridge regression achieved the lowest RMSE (12.3 x 10^6 cubic meters). A stacked ensemble (GB, RF, Ridge) reduced RMSE to ~11 MCM (~3% of live capacity). Ridge was best for low storage, while GB and RF tied for high storage.

Conclusion: Recommendations include using GB for daily operations, Ridge for drought warnings, and the stacked ensemble for dashboards. Quarterly retraining and regime-specific metrics are advised to maintain accuracy below a 5% threshold.

Abstract: Reliable daily estimates of reservoir storage are pivotal for water allocation and drought response decisions in semiarid regions. Conventional rating curves at Loskop Dam, the primary storage on South Africa’s Olifants River, have become increasingly uncertain owing to sedimentation and episodic drawdown. A 40 year Digital Earth Africa (DEA) surface area archive (1984-2024) fused with gauged water levels to develop data driven volume predictors that operate under a maximum 9.14%, a 90 day drawdown constraint. Four nested feature sets were examined: (i) raw water area, (ii) +a power law “calculated volume” proxy, (iii) +six river geometry metrics, and (iv) +full supply elevation. Five candidate algorithms, Gradient Boosting (GB), Random Forest (RF), Ridge (RI), Lasso (LA) and Elastic Net (EN), were tuned using a 20 draw random search and assessed with a five fold Timeseries Split to eliminate look ahead bias. Prediction errors were decomposed into two regimes: Low (<250 x 10^6 cubic meters) and High (>250 x 10^6 cubic meters) storage regimes. Ridge regression achieved the lowest cross validated RMSE (12.3 x 10^6 cubic meters), outperforming GB by 16% and RF by 7%. In regime terms, Ridge was superior in the Low band (18.0 ver. 22.7 MCM for GB) and tied RF in the High band (12 MCM). In sample diagnostics showed GB’s apparent dominance (6.8-5.4 MCM) to be an artefact of overfitting. A Ridge meta stacked ensemble combining GB, RF, and Ridge reduced full series RMSE to ~ 11 MCM ( 3% of live capacity). We recommend (i) GB retrained daily for routine operations, (ii) Ridge for drought early warning, and (iii) the stacked blend for all weather dashboards. Quarterly rolling retraining and regime specific metrics are advised to maintain operational accuracy below the 5% threshold mandated by the Department of Water and Sanitation.

[727] Minimax Optimal Reinforcement Learning with Quasi-Optimism

Harin Lee, Min-hwan Oh

Main category: cs.LG

TL;DR: EQO (Exploration via Quasi-Optimism) is a new RL algorithm that avoids empirical variances, uses a simple bonus term, and achieves optimal regret bounds with practical efficiency.

Details

Motivation: To develop a practical and provably optimal RL algorithm that simplifies exploration while maintaining theoretical guarantees.

Method: EQO introduces quasi-optimism, avoiding full optimism, and uses a bonus term proportional to the inverse of state-action visit counts.

Result: EQO achieves the sharpest known regret bound for tabular RL under mild assumptions and outperforms existing algorithms in regret and computational efficiency.

Conclusion: EQO successfully combines theoretical soundness with practical effectiveness, offering a simpler yet highly efficient RL solution.

Abstract: In our quest for a reinforcement learning (RL) algorithm that is both practical and provably optimal, we introduce EQO (Exploration via Quasi-Optimism). Unlike existing minimax optimal approaches, EQO avoids reliance on empirical variances and employs a simple bonus term proportional to the inverse of the state-action visit count. Central to EQO is the concept of quasi-optimism, where estimated values need not be fully optimistic, allowing for a simpler yet effective exploration strategy. The algorithm achieves the sharpest known regret bound for tabular RL under the mildest assumptions, proving that fast convergence can be attained with a practical and computationally efficient approach. Empirical evaluations demonstrate that EQO consistently outperforms existing algorithms in both regret performance and computational efficiency, providing the best of both theoretical soundness and practical effectiveness.

[728] Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning

Gaurav Patel, Qiang Qiu

Main category: cs.LG

TL;DR: Proposes ‘Learning to Unlearn while Retaining’ to balance unlearning and retention by avoiding gradient conflicts, outperforming existing methods.

Details

Motivation: Address the challenge of conflicting gradients in machine unlearning, which degrades performance when removing specific data while retaining model utility.

Method: Uses implicit gradient regularization to prevent conflicts between unlearning and retention objectives.

Result: Effective unlearning without performance loss on remaining data, validated across discriminative and generative tasks.

Conclusion: Avoiding gradient conflicts improves unlearning effectiveness and model utility, outperforming prior methods.

Abstract: Machine Unlearning has recently garnered significant attention, aiming to selectively remove knowledge associated with specific data while preserving the model’s performance on the remaining data. A fundamental challenge in this process is balancing effective unlearning with knowledge retention, as naive optimization of these competing objectives can lead to conflicting gradients, hindering convergence and degrading overall performance. To address this issue, we propose Learning to Unlearn while Retaining, aimed to mitigate gradient conflicts between unlearning and retention objectives. Our approach strategically avoids conflicts through an implicit gradient regularization mechanism that emerges naturally within the proposed framework. This prevents conflicting gradients between unlearning and retention, leading to effective unlearning while preserving the model’s utility. We validate our approach across both discriminative and generative tasks, demonstrating its effectiveness in achieving unlearning without compromising performance on remaining data. Our results highlight the advantages of avoiding such gradient conflicts, outperforming existing methods that fail to account for these interactions.

[729] The Pitfalls of Imitation Learning when Actions are Continuous

Max Simchowitz, Daniel Pfrommer, Ali Jadbabaie

Main category: cs.LG

TL;DR: The paper demonstrates that smooth, deterministic imitator policies in control systems suffer exponentially larger errors compared to expert training data, unless policies are non-smooth, non-Markovian, or state-dependent stochastic.

Details

Motivation: To understand the limitations of imitation learning in control systems and explore the necessity of complex policy parameterizations for effective imitation.

Method: Theoretical analysis of imitation learning under exponential stability, supported by experimental evidence of complex policy parameterizations like action-chunking and diffusion policies.

Result: Smooth, deterministic imitators perform poorly unless policies are improper or expert data is spread. Complex parameterizations show benefits.

Conclusion: Effective imitation in control systems requires non-smooth or stochastic policies, highlighting the limitations of traditional methods.

Abstract: We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action control system. We show that, even if the dynamics satisfy a control-theoretic property called exponential stability (i.e. the effects of perturbations decay exponentially quickly), and the expert is smooth and deterministic, any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to any algorithm which learns solely from expert data, including both behavior cloning and offline-RL algorithms, unless the algorithm produces highly “improper” imitator policies–those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity–or unless the expert trajectory distribution is sufficiently “spread.” We provide experimental evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today’s popular policy parameterizations in robot learning (e.g. action-chunking and diffusion policies). We also establish a host of complementary negative and positive results for imitation in control systems.

[730] Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search

Matteo Gambella, Fabrizio Pittorino, Manuel Roveri

Main category: cs.LG

TL;DR: The paper explores geometric properties of neural architecture spaces in NAS, introduces flatness metrics, and proposes A²M, a method to improve generalization by biasing gradients toward flat minima.

Details

Motivation: To understand the geometric structure of neural architecture spaces and improve the generalization of differentiable NAS methods.

Method: Defines flatness metrics (neighborhoods, loss barriers) and proposes A²M, an algorithmic framework biasing gradients toward flat minima.

Result: A²M improves test accuracy by +3.60% on CIFAR-10, +4.60% on CIFAR-100, and +3.64% on ImageNet16-120.

Conclusion: A²M is effective, versatile, and easily integrable into existing NAS frameworks, advancing automated machine learning.

Abstract: Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (A$^2$M), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. A$^2$M consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, A$^2$M is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60% on CIFAR-10, +4.60% on CIFAR-100, and +3.64% on ImageNet16-120, demonstrating its superior effectiveness in practice. A$^2$M can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at https://github.com/AI-Tech-Research-Lab/AsquaredM.

[731] Interpretable Graph Kolmogorov-Arnold Networks for Multi-Cancer Classification and Biomarker Identification using Multi-Omics Data

Fadi Alharbi, Nishant Budhiraja, Aleksandar Vakanski, Boyu Zhang, Murtada K. Elbashir, Harshith Guduru, Mohanad Mohammed

Main category: cs.LG

TL;DR: MOGKAN, a deep learning framework, integrates multi-omics data (mRNA, micro-RNA, DNA methylation, PPI networks) for cancer classification, achieving 96.28% accuracy and validated biomarkers.

Details

Motivation: Addressing the challenge of integrating heterogeneous multi-omics datasets for precision cancer diagnostics.

Method: Combines DESeq2, LIMMA, LASSO for dimensionality reduction, and uses Kolmogorov-Arnold theorem-based architecture for interpretability.

Result: 96.28% classification accuracy, low variability, and validated cancer-related biomarkers.

Conclusion: MOGKAN offers robust predictive performance and interpretability for clinical cancer diagnostics.

Abstract: The integration of heterogeneous multi-omics datasets at a systems level remains a central challenge for developing analytical and computational models in precision cancer diagnostics. This paper introduces Multi-Omics Graph Kolmogorov-Arnold Network (MOGKAN), a deep learning framework that utilizes messenger-RNA, micro-RNA sequences, and DNA methylation samples together with Protein-Protein Interaction (PPI) networks for cancer classification across 31 different cancer types. The proposed approach combines differential gene expression with DESeq2, Linear Models for Microarray (LIMMA), and Least Absolute Shrinkage and Selection Operator (LASSO) regression to reduce multi-omics data dimensionality while preserving relevant biological features. The model architecture is based on the Kolmogorov-Arnold theorem principle and uses trainable univariate functions to enhance interpretability and feature analysis. MOGKAN achieves classification accuracy of 96.28 percent and exhibits low experimental variability in comparison to related deep learning-based models. The biomarkers identified by MOGKAN were validated as cancer-related markers through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. By integrating multi-omics data with graph-based deep learning, our proposed approach demonstrates robust predictive performance and interpretability with potential to enhance the translation of complex multi-omics data into clinically actionable cancer diagnostics.

[732] A Low-complexity Structured Neural Network to Realize States of Dynamical Systems

Hansaka Aluvihare, Levi Lingsch, Xianqi Li, Sirani M. Perera

Main category: cs.LG

TL;DR: The paper proposes a structured neural network (StNN) based on the Hankel operator to improve computational efficiency in solving dynamical systems, outperforming conventional methods like SINDy and HAVOK.

Details

Motivation: Addressing computational inefficiencies in data-driven learning for dynamical systems derived from nonlinear ODEs.

Method: Utilizing StNN with the Hankel operator to reduce parameters and complexity, validated through numerical simulations on Lotka-Volterra and Lorenz systems.

Result: StNN reduces computational complexity and outperforms conventional neural networks, SINDy, and HAVOK.

Conclusion: StNN offers a low-complexity solution for predicting and understanding dynamical systems, advancing data-driven learning.

Abstract: Data-driven learning is rapidly evolving and places a new perspective on realizing state-space dynamical systems. However, dynamical systems derived from nonlinear ordinary differential equations (ODEs) suffer from limitations in computational efficiency. Thus, this paper stems from data-driven learning to advance states of dynamical systems utilizing a structured neural network (StNN). The proposed learning technique also seeks to identify an optimal, low-complexity operator to solve dynamical systems, the so-called Hankel operator, derived from time-delay measurements. Thus, we utilize the StNN based on the Hankel operator to solve dynamical systems as an alternative to existing data-driven techniques. We show that the proposed StNN reduces the number of parameters and computational complexity compared with the conventional neural networks and also with the classical data-driven techniques, such as Sparse Identification of Nonlinear Dynamics (SINDy) and Hankel Alternative view of Koopman (HAVOK), which is commonly known as delay-Dynamic Mode Decomposition(DMD) or Hankel-DMD. More specifically, we present numerical simulations to solve dynamical systems utilizing the StNN based on the Hankel operator beginning from the fundamental Lotka-Volterra model, where we compare the StNN with the LEarning Across Dynamical Systems (LEADS), and extend our analysis to highly nonlinear and chaotic Lorenz systems, comparing the StNN with conventional neural networks, SINDy, and HAVOK. Hence, we show that the proposed StNN paves the way for realizing state-space dynamical systems with a low-complexity learning algorithm, enabling prediction and understanding of future states.

[733] Surrogate modeling of Cellular-Potts Agent-Based Models as a segmentation task using the U-Net neural network architecture

Tien Comlekoglu, J. Quetzalcóatl Toledo-Marín, Tina Comlekoglu, Douglas W. DeSimone, Shayn M. Peirce, Geoffrey Fox, James A. Glazier

Main category: cs.LG

TL;DR: A CNN surrogate model using U-Net architecture accelerates Cellular-Potts model (CPM) simulations by 590x, capturing emergent behaviors like vasculogenesis.

Details

Motivation: CPMs are computationally expensive due to interactions among agents and PDEs, limiting scalability.

Method: Developed a U-Net CNN surrogate model to predict 100 Monte-Carlo steps ahead, trained on a mechanistic CPM for vasculogenesis.

Result: Achieved 590x speedup while accurately capturing vessel sprouting, extension, anastomosis, and lacunae contraction.

Conclusion: Deep learning can efficiently replace CPMs, enabling faster, larger-scale simulations of biological processes.

Abstract: The Cellular-Potts model is a powerful and ubiquitous framework for developing computational models for simulating complex multicellular biological systems. Cellular-Potts models (CPMs) are often computationally expensive due to the explicit modeling of interactions among large numbers of individual model agents and diffusive fields described by partial differential equations (PDEs). In this work, we develop a convolutional neural network (CNN) surrogate model using a U-Net architecture that accounts for periodic boundary conditions. We use this model to accelerate the evaluation of a mechanistic CPM previously used to investigate in vitro vasculogenesis. The surrogate model was trained to predict 100 computational steps ahead (Monte-Carlo steps, MCS), accelerating simulation evaluations by a factor of 590 times compared to CPM code execution. Over multiple recursive evaluations, our model effectively captures the emergent behaviors demonstrated by the original Cellular-Potts model of such as vessel sprouting, extension and anastomosis, and contraction of vascular lacunae. This approach demonstrates the potential for deep learning to serve as efficient surrogate models for CPM simulations, enabling faster evaluation of computationally expensive CPM of biological processes at greater spatial and temporal scales.

[734] MIAT: Maneuver-Intention-Aware Transformer for Spatio-Temporal Trajectory Prediction

Chandra Raskoti, Iftekharul Islam, Xuan Wang, Weizi Li

Main category: cs.LG

TL;DR: The paper introduces MIAT, a Transformer-based model for vehicle trajectory prediction, improving accuracy by integrating maneuver intention awareness and spatiotemporal interaction modeling.

Details

Motivation: Accurate trajectory prediction is crucial for autonomous driving in mixed traffic, but uncertainties from human driving behaviors make it challenging.

Method: MIAT combines maneuver intention awareness with spatiotemporal modeling, tested on the NGSIM dataset against transformer- and LSTM-based methods.

Result: MIAT improves short-horizon predictions by 4.7% and long-horizon by 1.6%, with an 11.1% boost in long-horizon performance using intention awareness.

Conclusion: MIAT effectively enhances trajectory prediction accuracy, especially in long-horizon scenarios, and is publicly available for further research.

Abstract: Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments when both human-driven and autonomous vehicles co-exist. However, uncertainties introduced by inherent driving behaviors – such as acceleration, deceleration, and left and right maneuvers – pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness control mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short- and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4.7% in short-horizon predictions and a 1.6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging intention awareness control mechanism, MIAT realizes an 11.1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance. The source code and datasets are available at https://github.com/cpraskoti/MIAT.

[735] ReCA: A Parametric ReLU Composite Activation Function

John Chidiac, Danielle Azar

Main category: cs.LG

TL;DR: The paper introduces ReCA, a novel parametric activation function based on ReLU, demonstrating superior performance over existing baselines on advanced datasets and architectures.

Details

Motivation: The optimal activation function for deep neural networks is still unresolved, despite ReLU's dominance. This paper aims to address this gap by proposing a better alternative.

Method: The authors develop ReCA, a parametric activation function derived from ReLU, and evaluate it on state-of-the-art datasets using various complex neural network architectures.

Result: ReCA outperforms all baseline activation functions in the experiments conducted.

Conclusion: ReCA is a promising alternative to ReLU, offering improved performance for deep neural networks.

Abstract: Activation functions have been shown to affect the performance of deep neural networks significantly. While the Rectified Linear Unit (ReLU) remains the dominant choice in practice, the optimal activation function for deep neural networks remains an open research question. In this paper, we propose a novel parametric activation function, ReCA, based on ReLU, which has been shown to outperform all baselines on state-of-the-art datasets using different complex neural network architectures.

[736] NbBench: Benchmarking Language Models for Comprehensive Nanobody Tasks

Yiming Zhang, Koji Tsuda

Main category: cs.LG

TL;DR: NbBench is introduced as the first comprehensive benchmark for nanobody representation learning, evaluating 11 models across 8 tasks, revealing no single model excels in all areas.

Details

Motivation: Nanobody-specific modeling lacks a unified benchmark despite their advantages in therapeutics and diagnostics.

Method: NbBench spans 8 tasks across 9 datasets, evaluating 11 models in a frozen setting.

Result: Antibody language models perform well in antigen-related tasks, but regression tasks like thermostability remain challenging. No model consistently outperforms others.

Conclusion: NbBench provides a standardized, reproducible foundation for advancing nanobody modeling.

Abstract: Nanobodies – single-domain antibody fragments derived from camelid heavy-chain-only antibodies – exhibit unique advantages such as compact size, high stability, and strong binding affinity, making them valuable tools in therapeutics and diagnostics. While recent advances in pretrained protein and antibody language models (PPLMs and PALMs) have greatly enhanced biomolecular understanding, nanobody-specific modeling remains underexplored and lacks a unified benchmark. To address this gap, we introduce NbBench, the first comprehensive benchmark suite for nanobody representation learning. Spanning eight biologically meaningful tasks across nine curated datasets, NbBench encompasses structure annotation, binding prediction, and developability assessment. We systematically evaluate eleven representative models – including general-purpose protein LMs, antibody-specific LMs, and nanobody-specific LMs – in a frozen setting. Our analysis reveals that antibody language models excel in antigen-related tasks, while performance on regression tasks such as thermostability and affinity remains challenging across all models. Notably, no single model consistently outperforms others across all tasks. By standardizing datasets, task definitions, and evaluation protocols, NbBench offers a reproducible foundation for assessing and advancing nanobody modeling.

[737] Dragonfly: a modular deep reinforcement learning library

Jonathan Viquerat, Paul Garnier, Amirhossein Bateni, Elie Hachem

Main category: cs.LG

TL;DR: Dragonfly is a modular deep reinforcement learning library designed for easy experimentation, parameter sweeps, and CPU-intensive tasks, with competitive performance on benchmarks.

Details

Motivation: To simplify experimentation and development in deep reinforcement learning by providing a modular and maintainable framework.

Method: Uses JSON serialization for swapping building blocks and parameter sweeps, with features optimized for CPU-intensive environments.

Result: Performs favorably compared to existing literature on standard benchmarks.

Conclusion: Dragonfly offers a practical and efficient solution for deep reinforcement learning research and applications.

Abstract: Dragonfly is a deep reinforcement learning library focused on modularity, in order to ease experimentation and developments. It relies on a json serialization that allows to swap building blocks and perform parameter sweep, while minimizing code maintenance. Some of its features are specifically designed for CPU-intensive environments, such as numerical simulations. Its performance on standard agents using common benchmarks compares favorably with the literature.

[738] Guide your favorite protein sequence generative model

Junhao Xiong, Hunter Nisonoff, Maria Lukarska, Ishan Gaur, Luke M. Oltrogge, David F. Savage, Jennifer Listgarten

Main category: cs.LG

TL;DR: ProteinGuide is a framework for conditioning protein generative models on auxiliary data, demonstrated with models like ProteinMPNN and ESM3 for designing sequences with specific properties.

Details

Motivation: Lack of a principled framework for conditioning protein generative models on auxiliary data like experimental results.

Method: Unifies protein generative models under a single framework (ProteinGuide) and applies it to models like ProteinMPNN and ESM3, conditioning on properties like stability, enzyme classes, and folds.

Result: Successfully designed sequences with specified properties, including high-activity adenine base editors.

Conclusion: ProteinGuide provides a versatile and effective method for conditioning protein generative models on diverse auxiliary data.

Abstract: Generative machine learning models on sequences are transforming protein engineering. However, no principled framework exists for conditioning these models on auxiliary information, such as experimental data, in a plug-and-play manner. Herein, we present ProteinGuide – a principled and general method for conditioning – by unifying a broad class of protein generative models under a single framework. We demonstrate the applicability of ProteinGuide by guiding two protein generative models, ProteinMPNN and ESM3, to generate amino acid and structure token sequences, conditioned on several user-specified properties such as enhanced stability, enzyme classes, and CATH-labeled folds. We also used ProteinGuide with inverse folding models and our own experimental assay to design adenine base editor sequences for high activity.

[739] GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song

Main category: cs.LG

TL;DR: GuidedQuant improves post-training quantization by integrating gradient information and preserving weight dependencies, outperforming existing methods.

Details

Motivation: Existing quantization methods either ignore feature importance or neglect weight interactions, limiting performance.

Method: GuidedQuant incorporates end-loss gradients and preserves cross-weight dependencies. It also introduces a non-uniform scalar quantization algorithm.

Result: GuidedQuant enhances performance across various quantization types and outperforms existing non-uniform scalar methods.

Conclusion: GuidedQuant is a superior quantization approach, with code available for public use.

Abstract: Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at https://github.com/snu-mllab/GuidedQuant.

[740] SEAL: Searching Expandable Architectures for Incremental Learning

Matteo Gambella, Manuel Roveri

Main category: cs.LG

TL;DR: SEAL is a NAS-based framework for incremental learning that dynamically adapts model structure, reducing forgetting and improving accuracy while keeping model size small.

Details

Motivation: Addressing the challenge of balancing plasticity and stability in incremental learning, especially in resource-constrained environments where existing NAS-based methods expand models unnecessarily.

Method: SEAL dynamically expands the model only when necessary using a capacity estimation metric and preserves stability via cross-distillation training. The NAS component searches for architecture and expansion policy jointly.

Result: SEAL reduces forgetting, enhances accuracy, and maintains a lower model size compared to prior methods across multiple benchmarks.

Conclusion: Combining NAS with selective expansion in SEAL shows promise for efficient and adaptive incremental learning.

Abstract: Incremental learning is a machine learning paradigm where a model learns from a sequential stream of tasks. This setting poses a key challenge: balancing plasticity (learning new tasks) and stability (preserving past knowledge). Neural Architecture Search (NAS), a branch of AutoML, automates the design of the architecture of Deep Neural Networks and has shown success in static settings. However, existing NAS-based approaches to incremental learning often rely on expanding the model at every task, making them impractical in resource-constrained environments. In this work, we introduce SEAL, a NAS-based framework tailored for data-incremental learning, a scenario where disjoint data samples arrive sequentially and are not stored for future access. SEAL adapts the model structure dynamically by expanding it only when necessary, based on a capacity estimation metric. Stability is preserved through cross-distillation training after each expansion step. The NAS component jointly searches for both the architecture and the optimal expansion policy. Experiments across multiple benchmarks demonstrate that SEAL effectively reduces forgetting and enhances accuracy while maintaining a lower model size compared to prior methods. These results highlight the promise of combining NAS and selective expansion for efficient, adaptive learning in incremental scenarios.

[741] Token Reduction Should Go Beyond Efficiency in Generative Models – From Vision, Language to Multimodality

Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

Main category: cs.LG

TL;DR: Token reduction in transformers is traditionally seen as an efficiency strategy, but this paper argues it should be a fundamental principle in generative modeling, influencing architecture and applications.

Details

Motivation: To reposition token reduction beyond efficiency, highlighting its broader impact on generative models, including multimodal integration, coherence, and training stability.

Method: The paper reframes token reduction as a core principle, exploring its potential in vision, language, and multimodal systems.

Result: Token reduction can enhance multimodal alignment, reduce hallucinations, maintain coherence, and improve training stability.

Conclusion: Token reduction should drive future model architectures and learning strategies, improving robustness and interpretability in generative modeling.

Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input’s essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate “overthinking” and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.

[742] Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs

Jie Hu, Yi-Ting Ma, Do Young Eun

Main category: cs.LG

TL;DR: The paper introduces a history-driven target (HDT) framework for MCMC to enhance random walk algorithms on discrete state spaces, improving efficiency and compatibility with reversible and non-reversible methods.

Details

Motivation: Existing methods like SRRW have computational overhead and exclude non-reversible MCMC. HDT aims to overcome these limitations.

Method: HDT replaces the original target distribution with a history-dependent one, using local information and an LRU cache for scalability.

Result: HDT achieves near-zero variance, unbiased sampling, and compatibility with advanced MCMC methods, with demonstrated performance gains.

Conclusion: HDT provides a scalable, efficient solution for graph sampling, outperforming existing methods like SRRW.

Abstract: We propose a history-driven target (HDT) framework in Markov Chain Monte Carlo (MCMC) to improve any random walk algorithm on discrete state spaces, such as general undirected graphs, for efficient sampling from target distribution $\boldsymbol{\mu}$. With broad applications in network science and distributed optimization, recent innovations like the self-repellent random walk (SRRW) achieve near-zero variance by prioritizing under-sampled states through transition kernel modifications based on past visit frequencies. However, SRRW’s reliance on explicit computation of transition probabilities for all neighbors at each step introduces substantial computational overhead, while its strict dependence on time-reversible Markov chains excludes advanced non-reversible MCMC methods. To overcome these limitations, instead of direct modification of transition kernel, HDT introduces a history-dependent target distribution $\boldsymbol{\pi}[\mathbf{x}]$ to replace the original target $\boldsymbol{\mu}$ in any graph sampler, where $\mathbf{x}$ represents the empirical measure of past visits. This design preserves lightweight implementation by requiring only local information between the current and proposed states and achieves compatibility with both reversible and non-reversible MCMC samplers, while retaining unbiased samples with target distribution $\boldsymbol{\mu}$ and near-zero variance performance. Extensive experiments in graph sampling demonstrate consistent performance gains, and a memory-efficient Least Recently Used (LRU) cache ensures scalability to large general graphs.

[743] $K^2$VAE: A Koopman-Kalman Enhanced Variational AutoEncoder for Probabilistic Time Series Forecasting

Xingjian Wu, Xiangfei Qiu, Hongfan Gao, Jilin Hu, Bin Yang, Chenjuan Guo

Main category: cs.LG

TL;DR: The paper introduces $K^2$VAE, a VAE-based model for long-term probabilistic time series forecasting, addressing challenges in accuracy and efficiency by combining KoopmanNet and KalmanNet.

Details

Motivation: Existing methods for probabilistic time series forecasting struggle with long-term predictions due to nonlinear dynamics and inefficiency in generative models.

Method: $K^2$VAE uses a KoopmanNet to linearize nonlinear time series and a KalmanNet to refine predictions and model uncertainty in the linear system.

Result: Experiments show $K^2$VAE outperforms state-of-the-art methods in both short- and long-term forecasting.

Conclusion: $K^2$VAE provides an efficient and accurate solution for long-term probabilistic time series forecasting.

Abstract: Probabilistic Time Series Forecasting (PTSF) plays a crucial role in decision-making across various fields, including economics, energy, and transportation. Most existing methods excell at short-term forecasting, while overlooking the hurdles of Long-term Probabilistic Time Series Forecasting (LPTSF). As the forecast horizon extends, the inherent nonlinear dynamics have a significant adverse effect on prediction accuracy, and make generative models inefficient by increasing the cost of each iteration. To overcome these limitations, we introduce $K^2$VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such linear system, which reduces error accumulation in long-term forecasting. Extensive experiments demonstrate that $K^2$VAE outperforms state-of-the-art methods in both short- and long-term PTSF, providing a more efficient and accurate solution.

[744] Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

Tim Walter, Hannah Markgraf, Jonathan Külz, Matthias Althoff

Main category: cs.LG

TL;DR: The paper introduces the first safeguard for analytic gradient-based reinforcement learning, ensuring safety without compromising performance.

Details

Motivation: Autonomous robots in safety-critical applications need provable safety guarantees, but existing safeguards are limited to sampling-based methods, leaving gradient-based methods unsupported.

Method: The authors analyze and adapt differentiable safeguards, integrate them with a state-of-the-art learning algorithm and differentiable simulation, and test on three control tasks.

Result: Safeguarded training is achieved without performance loss, demonstrating effectiveness.

Conclusion: The work successfully bridges the gap in safeguarding gradient-based reinforcement learning, providing a viable solution for safety-critical applications.

Abstract: The deployment of autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research that aims to provide such guarantees using safeguards. These safeguards should be integrated during training to reduce the sim-to-real gap. While there are several approaches for safeguarding sampling-based reinforcement learning, analytic gradient-based reinforcement learning often achieves superior performance from fewer environment interactions. However, there is no safeguarding approach for this learning paradigm yet. Our work addresses this gap by developing the first effective safeguard for analytic gradient-based reinforcement learning. We analyse existing, differentiable safeguards, adapt them through modified mappings and gradient formulations, and integrate them with a state-of-the-art learning algorithm and a differentiable simulation. Using numerical experiments on three control tasks, we evaluate how different safeguards affect learning. The results demonstrate safeguarded training without compromising performance.

[745] Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation

Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang, Ying Wei

Main category: cs.LG

TL;DR: CoTo is a progressive training strategy for LoRA that stochastically deactivates adapters to improve optimization and generalization.

Details

Motivation: LoRA often locks adapters into suboptimal minima, limiting generalization and downstream operations like merging and pruning.

Method: CoTo gradually increases adapters’ activation probability during fine-tuning, encouraging balanced optimization and broader loss landscape exploration.

Result: CoTo improves single-task performance, multi-task merging accuracy, pruning robustness, and reduces training overhead.

Conclusion: CoTo enhances LoRA’s effectiveness while remaining compatible with its variants, offering a practical solution for parameter-efficient fine-tuning.

Abstract: Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapters’ activation probability over the course of fine-tuning. By stochastically deactivating adapters, CoTo encourages more balanced optimization and broader exploration of the loss landscape. We provide a theoretical analysis showing that CoTo promotes layer-wise dropout stability and linear mode connectivity, and we adopt a cooperative-game approach to quantify each adapter’s marginal contribution. Extensive experiments demonstrate that CoTo consistently boosts single-task performance, enhances multi-task merging accuracy, improves pruning robustness, and reduces training overhead, all while remaining compatible with diverse LoRA variants. Code is available at https://github.com/zwebzone/coto.

[746] Variational Inference Optimized Using the Curved Geometry of Coupled Free Energy

Kenric Nelson, Igor Oliveira, Amenah Al-Najafi, Fode Zhang, Hon Keung Tony Ng

Main category: cs.LG

TL;DR: A framework for variational inference using coupled free energy improves robustness and accuracy, especially for heavy-tailed distributions, with a 3% improvement in image reconstruction.

Details

Motivation: Extend variational inference to handle heavy-tailed distributions like generalized Pareto and Student's t by leveraging coupled free energy and curved geometry.

Method: Introduces coupled free energy and applies it to a coupled variational autoencoder (CVAE), modifying distributions and cost functions for robustness.

Result: CVAE shows 3% better performance in image reconstruction (CelebA dataset) compared to standard VAE after 5 epochs.

Conclusion: The framework enhances model robustness against outliers and ensures stable training, validated by improved reconstruction metrics.

Abstract: We introduce an optimization framework for variational inference based on the coupled free energy, extending variational inference techniques to account for the curved geometry of the coupled exponential family. This family includes important heavy-tailed distributions such as the generalized Pareto and the Student’s t. By leveraging the coupled free energy, which is equal to the coupled evidence lower bound (ELBO) of the inverted probabilities, we improve the accuracy and robustness of the learned model. The coupled generalization of Fisher Information metric and the affine connection. The method is applied to the design of a coupled variational autoencoder (CVAE). By using the coupling for both the distributions and cost functions, the reconstruction metric is derived to still be the mean-square average loss with modified constants. The novelty comes from sampling the heavy-tailed latent distribution with its associated coupled probability, which has faster decaying tails. The result is the ability to train a model robust against severe outliers, while assuring that the training process is stable. The Wasserstein-2 or Fr'echet Inception Distance of the reconstructed CelebA images shows the CVAE has a 3% improvement over the VAE after 5 epochs of training.

[747] LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation

Chen-Chia Chang, Wan-Hsuan Lin, Yikang Shen, Yiran Chen, Xin Zhang

Main category: cs.LG

TL;DR: LaMAGIC2 introduces a succinct float-input canonical formulation (SFCI) for analog topology generation, improving efficiency and precision over prior methods.

Details

Motivation: Automating analog topology design is essential due to manual efforts and inefficiencies in existing methods, such as high token length and low numeric precision sensitivity.

Method: LaMAGIC2 uses identifier-based representations (SFCI) to reduce token length complexity to O(|V|) and enhance numeric precision sensitivity.

Result: LaMAGIC2 achieves 34% higher success rates under tight tolerances and 10X lower MSEs, with improved transferability for larger circuits.

Conclusion: LaMAGIC2 is a robust framework for analog topology generation, addressing key challenges in efficiency and precision.

Abstract: Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O(|V |2) token length and suffers from low precision sensitivity to numeric inputs. In this work, we introduce LaMAGIC2, a succinct float-input canonical formulation with identifier (SFCI) for language model-based analog topology generation. SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to O(|V |), and enhancing numeric precision sensitivity for better performance under tight tolerances. Our experiments demonstrate that LaMAGIC2 achieves 34% higher success rates under a tight tolerance of 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5% improvement. These advancements establish LaMAGIC2 as a robust framework for analog topology generation.

[748] A Free Probabilistic Framework for Analyzing the Transformer-based Language Models

Swagatam Das

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: We present a formal operator-theoretic framework for analyzing Transformer-based language models using free probability theory. By modeling token embeddings and attention mechanisms as self-adjoint operators in a tracial ( W^* )-probability space, we reinterpret attention as non-commutative convolution and describe representation propagation via free additive convolution. This leads to a spectral dynamic system interpretation of deep Transformers. We derive entropy-based generalization bounds under freeness assumptions and provide insight into positional encoding, spectral evolution, and representational complexity. This work offers a principled, though theoretical, perspective on structural dynamics in large language models.

[749] Machine Learning Model Integration with Open World Temporal Logic for Process Automation

Dyuman Aditya, Colton Payne, Mario Leiva, Paulo Shakarian

Main category: cs.LG

TL;DR: The paper introduces a method to integrate ML model outputs with PyReason, a temporal logic programming engine, for adaptive decision-making in complex workflows.

Details

Motivation: The challenge lies in translating ML model outputs into actionable decisions in operational workflows.

Method: Integrates ML model outputs with PyReason, converting them into logical facts for dynamic reasoning and decision-making.

Result: Enables real-time adaptive decision-making with temporal reasoning, knowledge graph integration, and explainability.

Conclusion: The integration of ML and PyReason offers a powerful system for automating complex processes across various domains.

Abstract: Recent advancements in Machine Learning (ML) have yielded powerful models capable of extracting structured information from diverse and complex data sources. However, a significant challenge lies in translating these perceptual or extractive outputs into actionable, reasoned decisions within complex operational workflows. To address these challenges, this paper introduces a novel approach that integrates the outputs from various machine learning models directly with the PyReason framework, an open-world temporal logic programming reasoning engine. PyReason’s foundation in generalized annotated logic allows for the seamless incorporation of real-valued outputs (e.g., probabilities, confidence scores) from diverse ML models, treating them as truth intervals within its logical framework. Crucially, PyReason provides mechanisms, implemented in Python, to continuously poll ML model outputs, convert them into logical facts, and dynamically recompute the minimal model, ensuring real-tine adaptive decision-making. Furthermore, its native support for temporal reasoning, knowledge graph integration, and fully explainable interface traces enables sophisticated analysis over time-sensitive process data and existing organizational knowledge. By combining the strengths of perception and extraction from ML models with the logical deduction and transparency of PyReason, we aim to create a powerful system for automating complex processes. This integration finds utility across numerous domains, including manufacturing, healthcare, and business operations.

[750] Tractable Representation Learning with Probabilistic Circuits

Steven Braun, Sahil Sidheekh, Antonio Vergari, Martin Mundt, Sriraam Natarajan, Kristian Kersting

Main category: cs.LG

TL;DR: APCs introduce a novel framework for representation learning using probabilistic circuits, outperforming existing methods in reconstruction, embedding quality, and robustness to missing data.

Details

Motivation: Address the underexplored area of representation learning in probabilistic circuits (PCs) by leveraging their tractable inference capabilities.

Method: Develop autoencoding probabilistic circuits (APCs) to model data and embeddings jointly, using tractable probabilistic inference and integrating a neural decoder for hybrid training.

Result: APCs outperform PC-based autoencoding methods in reconstruction, generate competitive embeddings, and handle missing data better than neural autoencoders.

Conclusion: APCs are a powerful, flexible method for representation learning, showcasing the potential of PCs for robust inference, out-of-distribution detection, and knowledge distillation.

Abstract: Probabilistic circuits (PCs) are powerful probabilistic models that enable exact and tractable inference, making them highly suitable for probabilistic reasoning and inference tasks. While dominant in neural networks, representation learning with PCs remains underexplored, with prior approaches relying on external neural embeddings or activation-based encodings. To address this gap, we introduce autoencoding probabilistic circuits (APCs), a novel framework leveraging the tractability of PCs to model probabilistic embeddings explicitly. APCs extend PCs by jointly modeling data and embeddings, obtaining embedding representations through tractable probabilistic inference. The PC encoder allows the framework to natively handle arbitrary missing data and is seamlessly integrated with a neural decoder in a hybrid, end-to-end trainable architecture enabled by differentiable sampling. Our empirical evaluation demonstrates that APCs outperform existing PC-based autoencoding methods in reconstruction quality, generate embeddings competitive with, and exhibit superior robustness in handling missing data compared to neural autoencoders. These results highlight APCs as a powerful and flexible representation learning method that exploits the probabilistic inference capabilities of PCs, showing promising directions for robust inference, out-of-distribution detection, and knowledge distillation.

[751] BEAVER: Building Environments with Assessable Variation for Evaluating Multi-Objective Reinforcement Learning

Ruohong Liu, Jack Umenberger, Yize Chen

Main category: cs.LG

TL;DR: The paper explores the scalability and generalization challenges of RL-based agents in building energy management, proposing a multi-objective contextual RL framework and benchmarking its performance.

Details

Motivation: To address the lack of scalability and generalization in RL approaches for building energy management across diverse environments and operational scenarios.

Method: Formalizes the generalization space for cross-environment, multi-objective tasks, parameterizes contextual information, and constructs a benchmark for evaluating RL algorithms.

Result: Existing multi-objective RL methods achieve reasonable trade-offs but degrade under certain environmental variations, highlighting the need for dynamics-dependent contextual learning.

Conclusion: Incorporating contextual information into policy learning is crucial for improving the generalization of RL methods in building energy management.

Abstract: Recent years have seen significant advancements in designing reinforcement learning (RL)-based agents for building energy management. While individual success is observed in simulated or controlled environments, the scalability of RL approaches in terms of efficiency and generalization across building dynamics and operational scenarios remains an open question. In this work, we formally characterize the generalization space for the cross-environment, multi-objective building energy management task, and formulate the multi-objective contextual RL problem. Such a formulation helps understand the challenges of transferring learned policies across varied operational contexts such as climate and heat convection dynamics under multiple control objectives such as comfort level and energy consumption. We provide a principled way to parameterize such contextual information in realistic building RL environments, and construct a novel benchmark to facilitate the evaluation of generalizable RL algorithms in practical building control tasks. Our results show that existing multi-objective RL methods are capable of achieving reasonable trade-offs between conflicting objectives. However, their performance degrades under certain environment variations, underscoring the importance of incorporating dynamics-dependent contextual information into the policy learning process.

[752] Imitation Learning in Continuous Action Spaces: Mitigating Compounding Error without Interaction

Thomas T. Zhang, Daniel Pfrommer, Nikolai Matni, Max Simchowitz

Main category: cs.LG

TL;DR: The paper addresses imitation learning in continuous systems, proposing minimal interventions like ‘action chunking’ and ’noise injection’ to mitigate compounding errors.

Details

Motivation: Imitation learning in physical systems (e.g., robotics) is complex due to compounding errors, requiring advanced solutions beyond expert trajectories.

Method: Proposes ‘action chunking’ for stable systems and ’noise injection’ for unstable systems, drawing from control theory and reinforcement learning.

Result: The interventions provably mitigate compounding errors, aligning with practical robot learning methods but offering distinct benefits.

Conclusion: The work bridges control theory and reinforcement learning, revealing novel insights for stable imitation learning in continuous systems.

Abstract: We study the problem of imitating an expert demonstrator in a continuous state-and-action dynamical system. While imitation learning in discrete settings such as autoregressive language modeling has seen immense success and popularity in recent years, imitation in physical settings such as autonomous driving and robot learning has proven comparably more complex due to the compounding errors problem, often requiring elaborate set-ups to perform stably. Recent work has demonstrated that even in benign settings, exponential compounding errors are unavoidable when learning solely from expert-controlled trajectories, suggesting the need for more advanced policy parameterizations or data augmentation. To this end, we present minimal interventions that provably mitigate compounding errors in continuous state-and-action imitation learning. When the system is open-loop stable, we prescribe “action chunking,” i.e., predicting and playing sequences of actions in open-loop; when the system is possibly unstable, we prescribe “noise injection,” i.e., adding noise during expert demonstrations. These interventions align with popular choices in modern robot learning, though the benefits we derive are distinct from the effects they were designed to target. Our results draw insights and tools from both control theory and reinforcement learning; however, our analysis reveals novel considerations that do not naturally arise when either literature is considered in isolation.

[753] Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime

Amit Attia, Matan Schliserman, Uri Sherman, Tomer Koren

Main category: cs.LG

TL;DR: The paper analyzes the convergence of SGD for smooth convex objectives in the interpolation regime, focusing on the last iterate’s behavior with large stepsizes. It provides improved convergence rates under specific conditions.

Details

Motivation: Understanding SGD's last iterate behavior is crucial for over-parameterized models, continual learning, and solving linear systems.

Method: Analyzes SGD on β-smooth convex loss functions with stepsize 0 < η < 2/β, deriving expected excess risk bounds.

Result: Establishes improved convergence rates, including a near-optimal O~(1/T + σ⋆/√T) rate for well-tuned stepsizes and O(1/√T) for σ⋆=0.

Conclusion: The work extends and improves existing results, offering tighter bounds for SGD convergence in the interpolation regime.

Abstract: We study population convergence guarantees of stochastic gradient descent (SGD) for smooth convex objectives in the interpolation regime, where the noise at optimum is zero or near zero. The behavior of the last iterate of SGD in this setting – particularly with large (constant) stepsizes – has received growing attention in recent years due to implications for the training of over-parameterized models, as well as to analyzing forgetting in continual learning and to understanding the convergence of the randomized Kaczmarz method for solving linear systems. We establish that after $T$ steps of SGD on $\beta$-smooth convex loss functions with stepsize $0 < \eta < 2/\beta$, the last iterate exhibits expected excess risk $\widetilde{O}(\frac{1}{\eta (2-\beta \eta) T^{1-\beta\eta/2}} + \frac{\eta}{(2-\beta\eta)^2} T^{\beta\eta/2} \sigma_\star^2)$, where $\sigma_\star^2$ denotes the variance of the stochastic gradients at the optimum. In particular, for a well-tuned stepsize we obtain a near optimal $\widetilde{O}(1/T + \sigma_\star/\sqrt{T})$ rate for the last iterate, extending the results of Varre et al. (2021) beyond least squares regression; and when $\sigma_\star=0$ we obtain a rate of $\smash{O(1/\sqrt T)}$ with $\eta=1/\beta$, improving upon the best-known $\smash{O(T^{-1/4})}$ rate recently established by Evron et al. (2025) in the special case of realizable linear regression.

[754] Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: Vidar is a two-stage framework using video diffusion pre-training and masked inverse dynamics for bimanual robotic manipulation, achieving strong generalization with minimal human demonstrations.

Details

Motivation: Data scarcity and embodiment heterogeneity hinder scaling in bimanual robotic manipulation, necessitating a scalable and generalizable solution.

Method: Vidar combines large-scale video diffusion pre-training on 750K multi-view videos and a masked inverse dynamics model for action prediction without pixel-level labels.

Result: Vidar generalizes to unseen tasks and backgrounds with only 20 minutes of human demonstrations (1% of typical data), outperforming state-of-the-art methods.

Conclusion: Video foundation models with masked action prediction can enable scalable and generalizable robotic manipulation in diverse real-world settings.

Abstract: Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce Video Diffusion for Action Reasoning (Vidar), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), Vidar generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.

[755] Deep RL Dual Sourcing Inventory Management with Supply and Capacity Risk Awareness

Defeng Liu, Ying Liu, Carson Eisenach

Main category: cs.LG

TL;DR: The paper proposes using reinforcement learning (RL) with intervention models and pre-trained deep learning (DL) models to solve large-scale stochastic optimization problems, demonstrated on a multi-sourcing inventory management problem.

Details

Motivation: To efficiently explore solution spaces in stochastic optimization by leveraging DL models for simulating and composing stochastic processes.

Method: Employs deep RL models for learning and forecasting supply chain processes, introduces a constraint coordination mechanism for dual cost forecasting, and modularizes complex constraints into scalable DL modules.

Result: Improved performance on large real-world datasets by breaking down complex supply chain processes into composable DL modules.

Conclusion: The approach shows promise for large-scale stochastic optimization but highlights open problems for future research to further validate such models.

Abstract: In this work, we study how to efficiently apply reinforcement learning (RL) for solving large-scale stochastic optimization problems by leveraging intervention models. The key of the proposed methodology is to better explore the solution space by simulating and composing the stochastic processes using pre-trained deep learning (DL) models. We demonstrate our approach on a challenging real-world application, the multi-sourcing multi-period inventory management problem in supply chain optimization. In particular, we employ deep RL models for learning and forecasting the stochastic supply chain processes under a range of assumptions. Moreover, we also introduce a constraint coordination mechanism, designed to forecast dual costs given the cross-products constraints in the inventory network. We highlight that instead of directly modeling the complex physical constraints into the RL optimization problem and solving the stochastic problem as a whole, our approach breaks down those supply chain processes into scalable and composable DL modules, leading to improved performance on large real-world datasets. We also outline open problems for future research to further investigate the efficacy of such models.

[756] The Origin of Self-Attention: Pairwise Affinity Matrices in Feature Selection and the Emergence of Self-Attention

Giorgio Roffo

Main category: cs.LG

TL;DR: The paper connects self-attention in Transformers to the broader concept of affinity-based computation, highlighting Infinite Feature Selection (Inf-FS) as a foundational approach.

Details

Motivation: To unify self-attention mechanisms across domains by tracing their origins to affinity matrices and demonstrating their shared computational principles.

Method: Comparative analysis of self-attention and Inf-FS, focusing on how affinity matrices are defined and applied in each.

Result: Self-attention is shown as a special case of Inf-FS, with both relying on pairwise relationships but differing in affinity matrix construction and propagation.

Conclusion: The paper unifies diverse machine learning models under the affinity-based computation paradigm, emphasizing their common mathematical foundation.

Abstract: The self-attention mechanism, now central to deep learning architectures such as Transformers, is a modern instance of a more general computational principle: learning and using pairwise affinity matrices to control how information flows through a model. This paper traces the conceptual origins of self-attention across multiple domains, including computer vision, natural language processing, and graph learning, through their shared reliance on an affinity matrix, denoted as A. We highlight Infinite Feature Selection (Inf-FS) as a foundational approach that generalizes the idea of affinity-based weighting. Unlike the fixed dot-product structure used in Transformers, Inf-FS defines A either through domain knowledge or by learning, and computes feature relevance through multi-hop propagation over the affinity graph. From this perspective, self-attention can be seen as a special case of Inf-FS: it uses a single-hop affinity computation where A is dynamically built from token similarities. We argue that the underlying structure, reasoning over pairwise relationships, is preserved across both approaches, and the key differences lie in how the affinity matrix is defined and applied. By situating self-attention within the broader paradigm of affinity-based computation, we unify several strands of machine learning research and highlight a common mathematical foundation that underpins diverse models and tasks.

[757] GASPnet: Global Agreement to Synchronize Phases

Andrea Alamia, Sabine Muzellec, Thomas Serre, Rufin VanRullen

Main category: cs.LG

TL;DR: A novel brain-inspired mechanism combines Transformer-like attention with neuroscience’s binding by synchrony theory, improving noise robustness and generalization in neural networks.

Details

Motivation: To address the limitations of previous global routing mechanisms in multi-classification tasks by integrating neuroscience insights, specifically binding by synchrony, into neural networks.

Method: Incorporates angular phases into convolutional networks, aligns phases using Kuramoto dynamics, and enhances neuron interactions based on phase similarity.

Result: Outperforms CNNs in accuracy, noise robustness, and generalization on datasets with superimposed images.

Conclusion: The proposed mechanism effectively tackles the visual binding problem by merging neuroscience and machine learning principles.

Abstract: In recent years, Transformer architectures have revolutionized most fields of artificial intelligence, relying on an attentional mechanism based on the agreement between keys and queries to select and route information in the network. In previous work, we introduced a novel, brain-inspired architecture that leverages a similar implementation to achieve a global ‘routing by agreement’ mechanism. Such a system modulates the network’s activity by matching each neuron’s key with a single global query, pooled across the entire network. Acting as a global attentional system, this mechanism improves noise robustness over baseline levels but is insufficient for multi-classification tasks. Here, we improve on this work by proposing a novel mechanism that combines aspects of the Transformer attentional operations with a compelling neuroscience theory, namely, binding by synchrony. This theory proposes that the brain binds together features by synchronizing the temporal activity of neurons encoding those features. This allows the binding of features from the same object while efficiently disentangling those from distinct objects. We drew inspiration from this theory and incorporated angular phases into all layers of a convolutional network. After achieving phase alignment via Kuramoto dynamics, we use this approach to enhance operations between neurons with similar phases and suppresses those with opposite phases. We test the benefits of this mechanism on two datasets: one composed of pairs of digits and one composed of a combination of an MNIST item superimposed on a CIFAR-10 image. Our results reveal better accuracy than CNN networks, proving more robust to noise and with better generalization abilities. Overall, we propose a novel mechanism that addresses the visual binding problem in neural networks by leveraging the synergy between neuroscience and machine learning.

[758] PyG 2.0: Scalable Learning on Real World Graphs

Matthias Fey, Jinu Sunil, Akihiro Nitta, Rishi Puri, Manan Shah, Blaž Stojanovič, Ramona Bendias, Alexandria Barghi, Vid Kocijan, Zecheng Zhang, Xinwei He, Jan Eric Lenssen, Jure Leskovec

Main category: cs.LG

TL;DR: PyG 2.0 introduces major updates for scalability and real-world applications, supporting heterogeneous/temporal graphs and large-scale learning.

Details

Motivation: To enhance PyG's capabilities for handling large-scale and diverse graph learning tasks.

Method: Updated framework architecture with support for heterogeneous/temporal graphs, scalable stores, and optimizations.

Result: Improved scalability and application support, enabling efficient large-scale graph learning.

Conclusion: PyG 2.0 is a significant advancement, empowering researchers and practitioners in graph learning.

Abstract: PyG (PyTorch Geometric) has evolved significantly since its initial release, establishing itself as a leading framework for Graph Neural Networks. In this paper, we present Pyg 2.0 (and its subsequent minor versions), a comprehensive update that introduces substantial improvements in scalability and real-world application capabilities. We detail the framework’s enhanced architecture, including support for heterogeneous and temporal graphs, scalable feature/graph stores, and various optimizations, enabling researchers and practitioners to tackle large-scale graph learning problems efficiently. Over the recent years, PyG has been supporting graph learning in a large variety of application areas, which we will summarize, while providing a deep dive into the important areas of relational deep learning and large language modeling.

[759] A Learning-based Domain Decomposition Method

Rui Wu, Nikola Kovachki, Burigede Liu

Main category: cs.LG

TL;DR: A learning-based domain decomposition method (L-DDM) is proposed to efficiently model complex geometries using neural operators, outperforming traditional methods.

Details

Motivation: The need for scalable and efficient modeling of large, complex structures in engineering, where traditional methods like FEM struggle with computational cost and complexity.

Method: Uses a pre-trained neural operator (PPNO) within a domain decomposition scheme to approximate solutions for complex PDEs, including those with discontinuous microstructures.

Result: The method outperforms state-of-the-art techniques, offering resolution-invariance and generalization to unseen microstructural patterns.

Conclusion: L-DDM provides a scalable and efficient solution for complex PDEs, bridging the gap between neural networks and real-world engineering problems.

Abstract: Recent developments in mechanical, aerospace, and structural engineering have driven a growing need for efficient ways to model and analyse structures at much larger and more complex scales than before. While established numerical methods like the Finite Element Method remain reliable, they often struggle with computational cost and scalability when dealing with large and geometrically intricate problems. In recent years, neural network-based methods have shown promise because of their ability to efficiently approximate nonlinear mappings. However, most existing neural approaches are still largely limited to simple domains, which makes it difficult to apply to real-world PDEs involving complex geometries. In this paper, we propose a learning-based domain decomposition method (L-DDM) that addresses this gap. Our approach uses a single, pre-trained neural operator-originally trained on simple domains-as a surrogate model within a domain decomposition scheme, allowing us to tackle large and complicated domains efficiently. We provide a general theoretical result on the existence of neural operator approximations in the context of domain decomposition solution of abstract PDEs. We then demonstrate our method by accurately approximating solutions to elliptic PDEs with discontinuous microstructures in complex geometries, using a physics-pretrained neural operator (PPNO). Our results show that this approach not only outperforms current state-of-the-art methods on these challenging problems, but also offers resolution-invariance and strong generalization to microstructural patterns unseen during training.

[760] SETOL: A Semi-Empirical Theory of (Deep) Learning

Charles H Martin, Christopher Hinrichs

Main category: cs.LG

TL;DR: SETOL explains SOTA NN performance using heavy-tailed metrics, introduces ERG, and validates with MLP and SOTA models.

Details

Motivation: To formally explain the origin of heavy-tailed metrics in NN performance prediction without needing training/testing data.

Method: Uses statistical mechanics, random matrix theory, and quantum chemistry to derive SETOL, introducing ERG as a new metric.

Result: Validates SETOL on a 3-layer MLP and SOTA NNs, showing alignment between HTSR alpha and SETOL ERG metrics.

Conclusion: SETOL provides a theoretical framework for NN performance prediction and introduces ERG as a key metric.

Abstract: We present a SemiEmpirical Theory of Learning (SETOL) that explains the remarkable performance of State-Of-The-Art (SOTA) Neural Networks (NNs). We provide a formal explanation of the origin of the fundamental quantities in the phenomenological theory of Heavy-Tailed Self-Regularization (HTSR): the heavy-tailed power-law layer quality metrics, alpha and alpha-hat. In prior work, these metrics have been shown to predict trends in the test accuracies of pretrained SOTA NN models, importantly, without needing access to either testing or training data. Our SETOL uses techniques from statistical mechanics as well as advanced methods from random matrix theory and quantum chemistry. The derivation suggests new mathematical preconditions for ideal learning, including a new metric, ERG, which is equivalent to applying a single step of the Wilson Exact Renormalization Group. We test the assumptions and predictions of SETOL on a simple 3-layer multilayer perceptron (MLP), demonstrating excellent agreement with the key theoretical assumptions. For SOTA NN models, we show how to estimate the individual layer qualities of a trained NN by simply computing the empirical spectral density (ESD) of the layer weight matrices and plugging this ESD into our SETOL formulas. Notably, we examine the performance of the HTSR alpha and the SETOL ERG layer quality metrics, and find that they align remarkably well, both on our MLP and on SOTA NNs.

[761] Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models

Ruimeng Ye, Zihan Wang, Yang Xiao, Zinan Ling, Manling Li, Bo Hui

Main category: cs.LG

TL;DR: The paper extends Weak-to-Strong Generalization (W2SG) to complex interactive decision-making by fine-tuning strong models with weak model trajectories, including failures, and using trajectory trees with MCTS for optimization.

Details

Motivation: To enhance the capabilities of strong models by leveraging weak model supervision, inspired by human learning from both successes and failures.

Method: Fine-tune strong models with weak model trajectories, generalize failure experience, and optimize using trajectory trees and MCTS.

Result: Empirical evaluations show improved reasoning and decision-making, with theoretical guarantees for effectiveness.

Conclusion: The framework is scalable and robust, significantly advancing W2SG in complex environments.

Abstract: Weak-to-Strong generalization (W2SG) is a new trend to elicit the full capabilities of a strong model with supervision from a weak model. While existing W2SG studies focus on simple tasks like binary classification, we extend this paradigm to complex interactive decision-making environments. Specifically, we fine-tune a strong model with trajectories of intermediate actions generated by a weak model. Motivated by the human learning process, we propose to generalize not only success knowledge but also failure experience so that the strong model can learn from failed trajectories accumulated by weak models. To effectively and efficiently elicit the potential of strong agents, we further construct ``trajectory trees," a hierarchical representation that organizes weak model-generated action trajectories, coupled with Monte Carlo Tree Search (MCTS) to optimize the strong model. Through theoretical analysis, we provide formal guarantees for the effectiveness of our method in improving W2SG performance. Our empirical evaluations demonstrate substantial improvements in reasoning and decision-making capabilities across diverse task domains, validating the scalability and robustness of our proposed framework.

[762] Secure Best Arm Identification in the Presence of a Copycat

Asaf Cohen, Onur Günlü

Main category: cs.LG

TL;DR: The paper addresses best arm identification in stochastic linear bandits with a security constraint, proposing a secure algorithm using coded arms to hide the best arm from an observer while maintaining performance.

Details

Motivation: The problem involves identifying the best arm in a bandit setup while ensuring an observer (copycat Chloe) remains ignorant of the best arm, balancing performance and security.

Method: The proposed algorithm uses coded arms without cryptographic primitives, ensuring security by obfuscating the best arm while maintaining a competitive error exponent.

Result: The algorithm achieves an Ω(T/log²(d)) error exponent, outperforming naive secure methods (Ω(T/d)) and revealing minimal information about the best arm.

Conclusion: The secure algorithm effectively balances identification performance and security, offering a practical solution for scenarios requiring privacy in bandit problems.

Abstract: Consider the problem of best arm identification with a security constraint. Specifically, assume a setup of stochastic linear bandits with $K$ arms of dimension $d$. In each arm pull, the player receives a reward that is the sum of the dot product of the arm with an unknown parameter vector and independent noise. The player’s goal is to identify the best arm after $T$ arm pulls. Moreover, assume a copycat Chloe is observing the arm pulls. The player wishes to keep Chloe ignorant of the best arm. While a minimax–optimal algorithm identifies the best arm with an $\Omega\left(\frac{T}{\log(d)}\right)$ error exponent, it easily reveals its best-arm estimate to an outside observer, as the best arms are played more frequently. A naive secure algorithm that plays all arms equally results in an $\Omega\left(\frac{T}{d}\right)$ exponent. In this paper, we propose a secure algorithm that plays with \emph{coded arms}. The algorithm does not require any key or cryptographic primitives, yet achieves an $\Omega\left(\frac{T}{\log^2(d)}\right)$ exponent while revealing almost no information on the best arm.

cs.MA

[763] Towards Multi-Agent Economies: Enhancing the A2A Protocol with Ledger-Anchored Identities and x402 Micropayments for AI Agents

Awid Vaziry, Sandro Rodriguez Garzon, Axel Küpper

Main category: cs.MA

TL;DR: A novel architecture integrates DLT to enhance A2A communication by improving agent discoverability and enabling micropayments, advancing autonomous economic interactions.

Details

Motivation: Address limitations in A2A protocols, specifically decentralized agent discoverability and micropayments, to enable secure and scalable multi-agent economies.

Method: Integrates DLT for tamper-proof AgentCards as smart contracts and extends A2A with the x402 standard for HTTP-based micropayments.

Result: Demonstrates feasibility of DLT-based agent discovery and micropayments, enabling seamless agent interactions across boundaries.

Conclusion: The approach lays groundwork for secure, scalable, and economically viable multi-agent ecosystems, advancing agentic AI.

Abstract: This research article presents a novel architecture to empower multi-agent economies by addressing two critical limitations of the emerging Agent2Agent (A2A) communication protocol: decentralized agent discoverability and agent-to-agent micropayments. By integrating distributed ledger technology (DLT), this architecture enables tamper-proof, on-chain publishing of AgentCards as smart contracts, providing secure and verifiable agent identities. The architecture further extends A2A with the x402 open standard, facilitating blockchain-agnostic, HTTP-based micropayments via the HTTP 402 status code. This enables autonomous agents to seamlessly discover, authenticate, and compensate each other across organizational boundaries. This work further presents a comprehensive technical implementation and evaluation, demonstrating the feasibility of DLT-based agent discovery and micropayments. The proposed approach lays the groundwork for secure, scalable, and economically viable multi-agent ecosystems, advancing the field of agentic AI toward trusted, autonomous economic interactions.

[764] MLC-Agent: Cognitive Model based on Memory-Learning Collaboration in LLM Empowered Agent Simulation Environment

Ming Zhang, Yiling Xuan, Qun Ma, Yuwei Guo

Main category: cs.MA

TL;DR: The paper proposes an individual agent model with a memory-learning collaboration mechanism to improve modeling accuracy in artificial societies by addressing long-term memory effects ignored in current methods.

Details

Motivation: Current individual agent models often overlook the long-term accumulative effects of memory mechanisms, leading to deviations from real-world system characteristics.

Method: The model uses hierarchical memory modeling (individual, group, buffer pool) and a multi-indicator evaluation mechanism for dynamic memory updates and collaborative decision-making.

Result: Agents built with this model show better decision-making quality and adaptability compared to existing methods.

Conclusion: The proposed model effectively enhances individual-level modeling quality and anthropomorphic characteristics in artificial societies.

Abstract: Many real-world systems, such as transportation systems, ecological systems, and Internet systems, are complex systems. As an important tool for studying complex systems, computational experiments can map them into artificial society models that are computable and reproducible within computers, thereby providing digital and computational methods for quantitative analysis. In current research, the construction of individual agent models often ignores the long-term accumulative effect of memory mechanisms in the development process of agents, which to some extent causes the constructed models to deviate from the real characteristics of real-world systems. To address this challenge, this paper proposes an individual agent model based on a memory-learning collaboration mechanism, which implements hierarchical modeling of the memory mechanism and a multi-indicator evaluation mechanism. Through hierarchical modeling of the individual memory repository, the group memory repository, and the memory buffer pool, memory can be effectively managed, and knowledge sharing and dissemination between individuals and groups can be promoted. At the same time, the multi-indicator evaluation mechanism enables dynamic evaluation of memory information, allowing dynamic updates of information in the memory set and promoting collaborative decision-making between memory and learning. Experimental results show that, compared with existing memory modeling methods, the agents constructed by the proposed model demonstrate better decision-making quality and adaptability within the system. This verifies the effectiveness of the individual agent model based on the memory-learning collaboration mechanism proposed in this paper in improving the quality of individual-level modeling in artificial society modeling and achieving anthropomorphic characteristics.

[765] Contrastive learning-based agent modeling for deep reinforcement learning

Wenhao Ma, Yu-Cheng Chang, Jie Yang, Yu-Kai Wang, Chin-Teng Lin

Main category: cs.MA

TL;DR: The paper introduces CLAM, a contrastive learning-based method for agent modeling in multi-agent systems, removing restrictive assumptions and achieving state-of-the-art performance.

Details

Motivation: Agent modeling is crucial for adaptive policies in multi-agent systems, but existing methods rely on restrictive assumptions like local observations from other agents or long trajectories.

Method: Proposes CLAM, a contrastive learning-based approach using only the ego agent’s local observations to generate high-quality policy representations in real-time.

Result: CLAM outperforms existing methods in both cooperative and competitive multi-agent environments, achieving state-of-the-art results.

Conclusion: Contrastive learning-based agent modeling (CLAM) enhances reinforcement learning by improving policy adaptability and performance in diverse multi-agent settings.

Abstract: Multi-agent systems often require agents to collaborate with or compete against other agents with diverse goals, behaviors, or strategies. Agent modeling is essential when designing adaptive policies for intelligent machine agents in multiagent systems, as this is the means by which the ego agent understands other agents’ behavior and extracts their meaningful policy representations. These representations can be used to enhance the ego agent’s adaptive policy which is trained by reinforcement learning. However, existing agent modeling approaches typically assume the availability of local observations from other agents (modeled agents) during training or a long observation trajectory for policy adaption. To remove these constrictive assumptions and improve agent modeling performance, we devised a Contrastive Learning-based Agent Modeling (CLAM) method that relies only on the local observations from the ego agent during training and execution. With these observations, CLAM is capable of generating consistent high-quality policy representations in real-time right from the beginning of each episode. We evaluated the efficacy of our approach in both cooperative and competitive multi-agent environments. Our experiments demonstrate that our approach achieves state-of-the-art on both cooperative and competitive tasks, highlighting the potential of contrastive learning-based agent modeling for enhancing reinforcement learning.

[766] Real-Time LaCAM for Real-Time MAPF

Runzhe Liang, Rishi Veerapaneni, Daniel Harabor, Jiaoyang Li, Maxim Likhachev

Main category: cs.MA

TL;DR: Real-Time LaCAM is the first Real-Time MAPF method with provable completeness, addressing the impracticality of full-horizon planning by using incremental LaCAM for real-time execution.

Details

Motivation: Full-horizon MAPF planning is impractical for real-world applications due to time constraints, and existing real-time methods lack completeness guarantees, leading to livelock or deadlock.

Method: Leverages LaCAM incrementally for real-time planning, allowing iterative planning in congested environments with millisecond cutoff times.

Result: Achieves the same success rate as full-horizon LaCAM while enabling real-time execution and compatibility with learned MAPF policies.

Conclusion: Real-Time LaCAM provides a practical, provably complete solution for real-time MAPF, balancing efficiency and reliability.

Abstract: The vast majority of Multi-Agent Path Finding (MAPF) methods with completeness guarantees require planning full-horizon paths. However, planning full-horizon paths can take too long and be impractical in real-world applications. Instead, real-time planning and execution, which only allows the planner a finite amount of time before executing and replanning, is more practical for real-world multi-agent systems. Several methods utilize real-time planning schemes but none are provably complete, which leads to livelock or deadlock. Our main contribution is Real-Time LaCAM, the first Real-Time MAPF method with provable completeness guarantees. We do this by leveraging LaCAM (Okumura 2023) in an incremental fashion. Our results show how we can iteratively plan for congested environments with a cutoff time of milliseconds while still maintaining the same success rate as full-horizon LaCAM. We also show how it can be used with a single-step learned MAPF policy.

[767] ADL: A Declarative Language for Agent-Based Chatbots

Sirui Zeng, Xifeng Yan

Main category: cs.MA

TL;DR: ADL is a declarative language for customer service chatbots, abstracting implementation details to ease maintenance and debugging. It supports natural language programming and integrates with custom functions and third-party agents. MICA, an open-source system, executes ADL programs.

Details

Motivation: Existing frameworks tightly couple Python programming with agent declaration, complicating maintenance and optimization. ADL aims to simplify agent definition and interaction.

Method: ADL abstracts implementation details, provides a declarative language for agent definition, and incorporates natural language programming. It supports four agent types and integrates with custom tools and third-party agents. MICA interprets and executes ADL programs.

Result: ADL simplifies chatbot design and maintenance. MICA, an open-source system, successfully executes ADL programs, as demonstrated by its availability and documentation.

Conclusion: ADL offers a practical solution for declarative agent definition in chatbots, improving maintainability and ease of use. MICA’s open-source release facilitates adoption and further development.

Abstract: There are numerous frameworks capable of creating and orchestrating agents to address complex tasks. However, most of them highly coupled Python programming with agent declaration, making it hard for maintenance and runtime optimization. In this work, we introduce ADL, an agent declarative language for customer service chatbots. ADL abstracts away implementation details, offering a declarative way to define agents and their interactions, which could ease maintenance and debugging. It also incorporates natural language programming at its core to simplify the specification and communication of chatbot designs. ADL includes four basic types of agents and supports integration with custom functions, tool use, and third-party agents. MICA, a multi-agent system designed to interpret and execute ADL programs, has been developed and is now available as an open-source project at https://github.com/Mica-labs/MICA. Its documentation can be found at https://mica-labs.github.io/.

[768] Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan

Main category: cs.MA

TL;DR: ARG-Designer reframes multi-agent system (MAS) design as a conditional autoregressive graph generation task, enabling dynamic and task-specific collaboration topologies.

Details

Motivation: Existing MAS designs rely on rigid templates, limiting adaptability to task-specific needs.

Method: Proposes ARG-Designer, an autoregressive model that generates collaboration graphs from scratch, determining agents, roles, and links dynamically.

Result: Achieves state-of-the-art performance, token efficiency, and extensibility across six benchmarks.

Conclusion: ARG-Designer offers a flexible and extensible solution for MAS design, tailored to diverse tasks.

Abstract: Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.

cs.MM

Chia-Ming Lee, Bo-Cheng Qiu, Cheng-Jun Kang, Yi-Hsuan Wu, Jun-Lin Chen, Yu-Fan Lin, Yi-Shiuan Chou, Chih-Chung Hsu

Main category: cs.MM

TL;DR: AMCFG framework addresses prediction drift in video popularity by using multi-modal clustering and LLM-generated semantic anchors for stable, accurate predictions.

Details

Motivation: Prediction drift due to evolving trends and user behaviors challenges video popularity prediction.

Method: Uses multi-modal clustering and LLMs to generate temporally-invariant semantic anchors (e.g., demographics, themes) and statistical features.

Result: AMCFF improves accuracy and robustness, performing well on out-of-distribution data.

Conclusion: AMCFG offers a viable solution for real-world video popularity prediction by focusing on stable patterns.

Abstract: Predicting online video popularity faces a critical challenge: prediction drift, where models trained on historical data rapidly degrade due to evolving viral trends and user behaviors. To address this temporal distribution shift, we propose an Anchored Multi-modal Clustering and Feature Generation (AMCFG) framework that discovers temporally-invariant patterns across data distributions. Our approach employs multi-modal clustering to reveal content structure, then leverages Large Language Models (LLMs) to generate semantic Anchor Features, such as audience demographics, content themes, and engagement patterns that transcend superficial trend variations. These semantic anchors, combined with cluster-derived statistical features, enable prediction based on stable principles rather than ephemeral signals. Experiments demonstrate that AMCFG significantly enhances both predictive accuracy and temporal robustness, achieving superior performance on out-of-distribution data and providing a viable solution for real-world video popularity prediction.

[770] Controllable Video-to-Music Generation with Multiple Time-Varying Conditions

Junxian Wu, Weitao You, Heda Zuo, Dengming Zhang, Pei Chen, Lingyun Sun

Main category: cs.MM

TL;DR: A novel multi-condition guided V2M framework improves music generation by incorporating time-varying conditions and a two-stage training strategy for better control and alignment with user expectations.

Details

Motivation: Existing V2M methods lack control and often fail to meet user expectations due to their black-box nature, prompting the need for a more controllable and aligned approach.

Method: The proposed framework uses a two-stage training strategy: (1) fine-grained feature selection and temporal alignment, and (2) dynamic conditional fusion and control-guided decoding to integrate multiple conditions.

Result: The method outperforms existing V2M pipelines in subjective and objective evaluations, offering enhanced control and alignment with user needs.

Conclusion: The multi-condition guided V2M framework successfully addresses the limitations of existing methods, providing a more controllable and user-aligned music generation process.

Abstract: Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users’ needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a control-guided decoder module to integrate multiple conditions and accurately guide the music composition process. Extensive experiments demonstrate that our method outperforms existing V2M pipelines in both subjective and objective evaluations, significantly enhancing control and alignment with user expectations.

[771] Dark Side of Modalities: Reinforced Multimodal Distillation for Multimodal Knowledge Graph Reasoning

Yu Zhao, Ying Zhang, Xuhui Sui, Baohang Zhou, Haoze Zhu, Jeff Z. Pan, Xiaojie Yuan

Main category: cs.MM

TL;DR: The paper introduces a Reinforced Multimodal Distillation framework to address limitations in multimodal knowledge graph reasoning by leveraging dark knowledge from non-target entities and dynamically selecting optimal modalities.

Details

Motivation: Existing MKGR approaches neglect probabilistic correlations of entity labels and statically incorporate all modalities, leading to inefficiencies.

Method: Proposes logit distillation for unimodal KGR training and a reinforced teacher combination mechanism to dynamically select helpful modalities.

Result: Demonstrates effectiveness on 5 MKGR datasets.

Conclusion: The DSoM framework successfully addresses the issues of label correlations and modality selection, improving MKGR performance.

Abstract: The multimodal knowledge graph reasoning (MKGR) task aims to predict the missing facts in the incomplete MKGs by leveraging auxiliary images and descriptions of entities. Existing approaches are trained with single-target objectives, which neglect the probabilistic correlations of entity labels, especially in non-target entities. Moreover, previous studies incorporate all modalities statically or adaptively, overlooking the negative impacts of irrelevant or misleading information in the incompetent modalities. To address these issues, we introduce a novel Reinforced Multimodal Distillation framework, exploiting the Dark Side of Modalities (DSoM) from two perspectives: (1) Dark knowledge from non-target entities: We propose to train a unimodal KGR model through logit distillation to mimic the multimodal soft labels provided by pre-trained multimodal teacher models. The multimodal soft labels could provide rich supervision signals with subtle correlations among both target and non-target entities from multiple perspectives. We further decouple logits into neighbor entities and non-neighbor entities to divide into two types of correlations. (2) Dark side in unhelpful modalities: To exclude the adverse effects of unhelpful modalities, we introduce a reinforced teacher combination mechanism that dynamically selects the optimal set of multimodal teachers for each triple. The agent is trained to maximize the rewards, which are only assigned to the beneficial multimodal combination strategies for the student model. Comprehensive experiments demonstrate the effectiveness of DSoM framework on 5 MKGR datasets. Codes are available at github.com/OreOZhao/DSoM.

[772] PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning

Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie

Main category: cs.MM

TL;DR: PUMA proposes a layer-pruned MLLM with modality-adaptive learning for efficient unified multimodal retrieval, reducing resource usage without sacrificing performance.

Details

Motivation: The high training costs and low inference efficiency of large MLLMs hinder their practical use in unified multimodal retrieval (UMR).

Method: PUMA uses Layer-Pruned Self-Distillation to prune MLLMs and Modality-Adaptive Contrastive Learning Loss (MAC-Loss) to enhance learning efficiency.

Result: The method significantly reduces resource usage while maintaining strong retrieval performance.

Conclusion: PUMA offers an efficient solution for UMR by balancing structural pruning and adaptive learning.

Abstract: As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their large parameter size results in high training costs and low inference efficiency. To address this, we propose PUMA: a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning. Our approach improves UMR from both structural and learning perspectives. (1) Structurally, we propose Layer-Pruned Self-Distillation, which prunes MLLMs by keeping only shallow layers while distilling features from dropped deep layers as teacher signals. This reduces parameters and preserves representation capability. (2) On the learning side, we introduce Modality-Adaptive Contrastive Learning Loss (MAC-Loss), which separates in-batch negatives into harder intra-modality and easier inter-modality groups based on the target modality, assigning different temperature strategies to enhance learning efficiency. Experiments show our method significantly reduces resource usage while maintaining strong performance.

eess.AS

[773] Binaural Speech Enhancement Using Complex Convolutional Recurrent Networks

Vikas Tokala, Eric Grinstein, Mike Brookes, Simon Doclo, Jesper Jensen, Patrick A. Naylor

Main category: eess.AS

TL;DR: An end-to-end binaural speech enhancement method using a complex recurrent convolutional network improves speech intelligibility and noise reduction while preserving spatial information.

Details

Motivation: To enhance speech intelligibility and listening comfort in binaural devices like hearing aids and AR/VR, addressing noise reduction and spatial information preservation.

Method: Uses a complex recurrent convolutional network with encoder-decoder architecture and complex LSTM, estimating individual complex ratio masks for left and right-ear channels.

Result: Significantly improves speech intelligibility and noise reduction while preserving spatial information compared to baseline algorithms.

Conclusion: The proposed method effectively enhances binaural speech in noisy environments, maintaining spatial cues for better listening experiences.

Abstract: From hearing aids to augmented and virtual reality devices, binaural speech enhancement algorithms have been established as state-of-the-art techniques to improve speech intelligibility and listening comfort. In this paper, we present an end-to-end binaural speech enhancement method using a complex recurrent convolutional network with an encoder-decoder architecture and a complex LSTM recurrent block placed between the encoder and decoder. A loss function that focuses on the preservation of spatial information in addition to speech intelligibility improvement and noise reduction is introduced. The network estimates individual complex ratio masks for the left and right-ear channels of a binaural hearing device in the time-frequency domain. We show that, compared to other baseline algorithms, the proposed method significantly improves the estimated speech intelligibility and reduces the noise while preserving the spatial information of the binaural signals in acoustic situations with a single target speaker and isotropic noise of various types.

[774] Binaural Localization Model for Speech in Noise

Vikas Tokala, Eric Grinstein, Rory Brooks, Mike Brookes, Simon Doclo, Jesper Jensen, Patrick A. Naylor

Main category: eess.AS

TL;DR: The paper presents a lightweight convolutional recurrent network for binaural speech localization in noisy, reverberant conditions, comparing it with human performance and traditional methods.

Details

Motivation: Binaural acoustic source localization is crucial for human spatial awareness, communication, and safety.

Method: An end-to-end binaural localization model using a convolutional recurrent network, incorporating internal ear noise to mimic human hearing thresholds.

Result: The model’s performance is compared with the steered response power algorithm and human localization in noisy conditions.

Conclusion: The model serves as a tool to evaluate interaural cue preservation in binaural speech enhancement methods.

Abstract: Binaural acoustic source localization is important to human listeners for spatial awareness, communication and safety. In this paper, an end-to-end binaural localization model for speech in noise is presented. A lightweight convolutional recurrent network that localizes sound in the frontal azimuthal plane for noisy reverberant binaural signals is introduced. The model incorporates additive internal ear noise to represent the frequency-dependent hearing threshold of a typical listener. The localization performance of the model is compared with the steered response power algorithm, and the use of the model as a measure of interaural cue preservation for binaural speech enhancement methods is studied. A listening test was performed to compare the performance of the model with human localization of speech in noisy conditions.

[775] Binaural Sound Event Localization and Detection based on HRTF Cues for Humanoid Robots

Gyeong-Tae Lee, Hyeonuk Nam, Yong-Hwa Park

Main category: eess.AS

TL;DR: The paper introduces BiSELD, a task for detecting and localizing sound events using binaural audio, and proposes a new feature (BTFF) and model (BiSELDnet) to achieve human-like auditory perception.

Details

Motivation: To mimic human spatial hearing by jointly detecting and localizing sound events using binaural audio.

Method: Proposes BTFF, an 8-channel feature encoding ITD, ILD, and SC cues, and BiSELDnet, a CRNN-based model to learn from BTFF.

Result: Achieves 87.1% F-score and 4.4° localization error on the Binaural Set dataset.

Conclusion: The framework effectively mimics human auditory perception, with each BTFF sub-feature contributing to performance.

Abstract: This paper introduces Binaural Sound Event Localization and Detection (BiSELD), a task that aims to jointly detect and localize multiple sound events using binaural audio, inspired by the spatial hearing mechanism of humans. To support this task, we present a synthetic benchmark dataset, called the Binaural Set, which simulates realistic auditory scenes using measured head-related transfer functions (HRTFs) and diverse sound events. To effectively address the BiSELD task, we propose a new input feature representation called the Binaural Time-Frequency Feature (BTFF), which encodes interaural time difference (ITD), interaural level difference (ILD), and high-frequency spectral cues (SC) from binaural signals. BTFF is composed of eight channels, including left and right mel-spectrograms, velocity-maps, SC-maps, and ITD-/ILD-maps, designed to cover different spatial cues across frequency bands and spatial axes. A CRNN-based model, BiSELDnet, is then developed to learn both spectro-temporal patterns and HRTF-based localization cues from BTFF. Experiments on the Binaural Set show that each BTFF sub-feature enhances task performance: V-map improves detection, ITD-/ILD-maps enable accurate horizontal localization, and SC-map captures vertical spatial cues. The final system achieves a SELD error of 0.110 with 87.1% F-score and 4.4{\deg} localization error, demonstrating the effectiveness of the proposed framework in mimicking human-like auditory perception.

[776] MIMII-Agent: Leveraging LLMs with Function Calling for Relative Evaluation of Anomalous Sound Detection

Harsh Purohit, Tomoya Nishida, Kota Dohi, Takashi Endo, Yohei Kawaguchi

Main category: eess.AS

TL;DR: The paper introduces an LLM-based method to generate synthetic anomalies for evaluating UASD systems, overcoming limitations of manual and data-dependent approaches.

Details

Motivation: Existing methods for anomaly sound generation are either unrealistic (keyword-based) or require anomalous data (MIMII-Gen). The goal is to enable scalable evaluation of UASD systems without real anomaly data.

Method: Uses LLMs to interpret fault descriptions and select audio transformations, converting normal sounds into diverse anomalies. Validated on five machine types with real and synthetic anomalies.

Result: Synthetic anomalies show consistent detection difficulty trends with real anomalies, validating the method’s effectiveness.

Conclusion: The LLM-based approach is effective for relative evaluation of UASD systems, especially when real anomaly data is scarce.

Abstract: This paper proposes a method for generating machine-type-specific anomalies to evaluate the relative performance of unsupervised anomalous sound detection (UASD) systems across different machine types, even in the absence of real anomaly sound data. Conventional keyword-based data augmentation methods often produce unrealistic sounds due to their reliance on manually defined labels, limiting scalability as machine types and anomaly patterns diversify. Advanced audio generative models, such as MIMII-Gen, show promise but typically depend on anomalous training data, making them less effective when diverse anomalous examples are unavailable. To address these limitations, we propose a novel synthesis approach leveraging large language models (LLMs) to interpret textual descriptions of faults and automatically select audio transformation functions, converting normal machine sounds into diverse and plausible anomalous sounds. We validate this approach by evaluating a UASD system trained only on normal sounds from five machine types, using both real and synthetic anomaly data. Experimental results reveal consistent trends in relative detection difficulty across machine types between synthetic and real anomalies. This finding supports our hypothesis and highlights the effectiveness of the proposed LLM-based synthesis approach for relative evaluation of UASD systems.

[777] End-to-End DOA-Guided Speech Extraction in Noisy Multi-Talker Scenarios

Kangqi Jing, Wenbin Zhang, Yu Gao

Main category: eess.AS

TL;DR: An end-to-end TSE model using DOA and beamwidth embeddings effectively extracts target speech in noisy, multi-speaker environments, improving ASR performance.

Details

Motivation: Enhancing speech signals in noisy and multi-speaker environments is critical for real-world applications.

Method: Incorporates Direction of Arrival (DOA) and beamwidth embeddings to extract speech from a specified spatial region.

Result: Significantly enhances target speech within the beamwidth, suppresses interference, and improves ASR tasks.

Conclusion: The model is robust and suitable for real-world applications, offering clear target voice extraction.

Abstract: Target Speaker Extraction (TSE) plays a critical role in enhancing speech signals in noisy and multi-speaker environments. This paper presents an end-to-end TSE model that incorporates Direction of Arrival (DOA) and beamwidth embeddings to extract speech from a specified spatial region centered around the DOA. Our approach efficiently captures spatial and temporal features, enabling robust performance in highly complex scenarios with multiple simultaneous speakers. Experimental results demonstrate that the proposed model not only significantly enhances the target speech within the defined beamwidth but also effectively suppresses interference from other directions, producing a clear and isolated target voice. Furthermore, the model achieves remarkable improvements in downstream Automatic Speech Recognition (ASR) tasks, making it particularly suitable for real-world applications.

[778] Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment

Tien-Hong Lo, Meng-Ting Tsai, Yao-Ting Sung, Berlin Chen

Main category: eess.AS

TL;DR: The study investigates using zero-shot text-to-speech (ZS-TTS) to generate learner-specific golden speech for improving L2 pronunciation assessment, showing performance improvements on benchmark datasets.

Details

Motivation: To enhance L2 pronunciation learning by leveraging personalized golden speech and exploring its role in automatic pronunciation assessment (APA).

Method: Develops a framework to evaluate ZS-TTS for golden speech generation and tests its effectiveness in APA using L2-ARCTIC and Speechocean762 datasets.

Result: Proposed modeling outperforms prior methods in assessment metrics, demonstrating the potential of golden speech in APA.

Conclusion: First study to integrate golden speech in ZS-TTS and APA, offering a new approach for computer-assisted pronunciation training (CAPT).

Abstract: Second language (L2) learners can improve their pronunciation by imitating golden speech, especially when the speech that aligns with their respective speech characteristics. This study explores the hypothesis that learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS) techniques can be harnessed as an effective metric for measuring the pronunciation proficiency of L2 learners. Building on this exploration, the contributions of this study are at least two-fold: 1) design and development of a systematic framework for assessing the ability of a synthesis model to generate golden speech, and 2) in-depth investigations of the effectiveness of using golden speech in automatic pronunciation assessment (APA). Comprehensive experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets suggest that our proposed modeling can yield significant performance improvements with respect to various assessment metrics in relation to some prior arts. To our knowledge, this study is the first to explore the role of golden speech in both ZS-TTS and APA, offering a promising regime for computer-assisted pronunciation training (CAPT).

[779] RADE: A Neural Codec for Transmitting Speech over HF Radio Channels

David Rowe, Jean-Marc Valin

Main category: eess.AS

TL;DR: The paper introduces a neural autoencoder for speech compression over HF radio, replacing traditional signal processing with a network trained for robustness to noise and multipath, outperforming existing systems in intelligibility.

Details

Motivation: To improve speech intelligibility over HF radio by replacing classical signal processing elements with a neural network, addressing challenges like noise and multipath.

Method: An autoencoder converts vocoder features to QAM symbols, transmitted via OFDM over HF radio, and decodes back to features for synthesis, trained for robustness and low PAPR.

Result: The system achieves higher speech intelligibility than analog and digital radio systems across various SNRs in simulated and real-world HF channels.

Conclusion: The neural autoencoder effectively replaces traditional methods, offering superior performance in speech transmission over HF radio.

Abstract: Speech compression is commonly used to send voice over radio channels in applications such as mobile telephony and two-way push-to-talk (PTT) radio. In classical systems, the speech codec is combined with forward error correction, modulation and radio hardware. In this paper we describe an autoencoder that replaces many of the traditional signal processing elements with a neural network. The encoder takes a vocoder feature set (short term spectrum, pitch, voicing), and produces discrete time, but continuously valued quadrature amplitude modulation (QAM) symbols. We use orthogonal frequency domain multiplexing (OFDM) to send and receive these symbols over high frequency (HF) radio channels. The decoder converts received QAM symbols to vocoder features suitable for synthesis. The autoencoder has been trained to be robust to additive Gaussian noise and multipath channel impairments while simultaneously maintaining a Peak To Average Power Ratio (PAPR) of less than 1 dB. Over simulated and real world HF radio channels we have achieved output speech intelligibility that clearly surpasses existing analog and digital radio systems over a range of SNRs.

[780] Neural Spectral Band Generation for Audio Coding

Woongjib Choi, Byeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang

Main category: eess.AS

TL;DR: The paper proposes a DNN-based method (n-SBG) for high-frequency band generation in audio coding, outperforming HE-AAC-v1 in perceptual quality with less side information.

Details

Motivation: SBR's subband-wise replication limits adaptability to diverse signals, prompting exploration of a DNN-based generative approach for better performance.

Method: A DNN encoder-decoder extracts and quantizes high-frequency side information, generating components using both side info and decoded core-band signals, optimized with adversarial criteria.

Result: The method achieves superior perceptual quality compared to HE-AAC-v1 while requiring less side information.

Conclusion: n-SBG offers a promising alternative to SBR for high-frequency band generation in audio coding, enhancing perceptual quality efficiently.

Abstract: Spectral band replication (SBR) enables bit-efficient coding by generating high-frequency bands from the low-frequency ones. However, it only utilizes coarse spectral features upon a subband-wise signal replication, limiting adaptability to diverse acoustic signals. In this paper, we explore the efficacy of a deep neural network (DNN)-based generative approach for coding the high-frequency bands, which we call neural spectral band generation (n-SBG). Specifically, we propose a DNN-based encoder-decoder structure to extract and quantize the side information related to the high-frequency components and generate the components given both the side information and the decoded core-band signals. The whole coding pipeline is optimized with generative adversarial criteria to enable the generation of perceptually plausible sound. From experiments using AAC as the core codec, we show that the proposed method achieves a better perceptual quality than HE-AAC-v1 with much less side information.

[781] SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding

Linye Wei, Shuzhang Zhong, Songqiang Xu, Runsheng Wang, Ru Huang, Meng Li

Main category: eess.AS

TL;DR: SpecASR is a novel speculative decoding framework for ASR that reduces latency by dynamically adjusting draft sequence length and recycling sequences, achieving significant speedup without accuracy loss.

Details

Motivation: High decoding latency in LLM-based ASR challenges real-time requirements, and existing speculative decoding methods overlook ASR-specific characteristics.

Method: SpecASR uses adaptive draft sequence generation, draft sequence recycling, and a two-pass sparse token tree algorithm to optimize latency.

Result: SpecASR achieves 3.04x-3.79x speedup over autoregressive decoding and 1.25x-1.84x over speculative decoding, with no accuracy loss.

Conclusion: SpecASR effectively addresses ASR latency issues by leveraging ASR-specific optimizations, offering a practical solution for real-time applications.

Abstract: Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that dynamically modifies the draft sequence length to maximize the token acceptance length. SpecASR further proposes a draft sequence recycling strategy that reuses the previously generated draft sequence to reduce the draft ASR model latency. Moreover, a two-pass sparse token tree generation algorithm is also proposed to balance the latency of draft and target ASR models. With extensive experimental results, we demonstrate SpecASR achieves 3.04x-3.79x and 1.25x-1.84x speedup over the baseline autoregressive decoding and speculative decoding, respectively, without any loss in recognition accuracy.

[782] Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery?

Simon Malan, Benjamin van Niekerk, Herman Kamper

Main category: eess.AS

TL;DR: The paper compares bottom-up and top-down methods for segmenting unlabeled speech into word-like units and clustering them into a lexicon, finding both achieve similar state-of-the-art results, with bottom-up being faster.

Details

Motivation: To determine if top-down information is necessary for improving speech segmentation or if simpler bottom-up methods suffice.

Method: Two approaches: a bottom-up method using self-supervised features for boundary prediction and clustering, and a top-down method (ES-KMeans) iteratively updating boundaries with K-means.

Result: Both methods perform comparably on ZeroSpeech benchmarks, with bottom-up being faster. Top-down benefits depend on factors like candidate boundaries.

Conclusion: Future work should focus on improving clustering techniques and learning more discriminative word-like representations, as clustering is a limiting factor.

Abstract: We investigate the problem of segmenting unlabeled speech into word-like units and clustering these to create a lexicon. Prior work can be categorized into two frameworks. Bottom-up methods first determine boundaries and then cluster the fixed segmented words into a lexicon. In contrast, top-down methods incorporate information from the clustered words to inform boundary selection. However, it is unclear whether top-down information is necessary to improve segmentation. To explore this, we look at two similar approaches that differ in whether top-down clustering informs boundary selection. Our simple bottom-up strategy predicts word boundaries using the dissimilarity between adjacent self-supervised features, then clusters the resulting segments to construct a lexicon. Our top-down system is an updated version of the ES-KMeans dynamic programming method that iteratively uses K-means to update its boundaries. On the five-language ZeroSpeech benchmarks, both approaches achieve comparable state-of-the-art results, with the bottom-up system being nearly five times faster. Through detailed analyses, we show that the top-down influence of ES-KMeans can be beneficial (depending on factors like the candidate boundaries), but in many cases the simple bottom-up method performs just as well. For both methods, we show that the clustering step is a limiting factor. Therefore, we recommend that future work focus on improved clustering techniques and learning more discriminative word-like representations. Project code repository: https://github.com/s-malan/prom-seg-clus.

eess.IV

[783] SLENet: A Novel Multiscale CNN-Based Network for Detecting the Rats Estrous Cycle

Qinyang Wang, Hoileong Lee, Xiaodi Pu, Yuanming Lai, Yiming Ma

Main category: eess.IV

TL;DR: SLENet, a modified EfficientNet with SECA and Non-local attention, achieves 96.31% accuracy in classifying rat estrous cycles from microscopy images, outperforming baseline models.

Details

Motivation: Manual estrous cycle identification in rats is costly, time-consuming, and subjective, necessitating an automated solution.

Method: SLENet modifies EfficientNet by replacing SE with SECA and adding Non-local attention to capture long-range dependencies.

Result: SLENet achieves 96.31% accuracy on 2,655 images, surpassing EfficientNet’s 94.2%.

Conclusion: SLENet is effective for rat estrous cycle classification but requires multi-modal inputs for broader application.

Abstract: In clinical medicine, rats are commonly used as experimental subjects. However, their estrous cycle significantly impacts their biological responses, leading to differences in experimental results. Therefore, accurately determining the estrous cycle is crucial for minimizing interference. Manually identifying the estrous cycle in rats presents several challenges, including high costs, long training periods, and subjectivity. To address these issues, this paper proposes a classification network-Spatial Long-distance EfficientNet (SLENet). This network is designed based on EfficientNet, specifically modifying the Mobile Inverted Bottleneck Convolution (MBConv) module by introducing a novel Spatial Efficient Channel Attention (SECA) mechanism to replace the original Squeeze Excitation (SE) module. Additionally, a Non-local attention mechanism is incorporated after the last convolutional layer to enhance the network’s ability to capture long-range dependencies. The dataset used 2,655 microscopic images of rat vaginal epithelial cells, with 531 images in the test set. Experimental results indicate that SLENet achieved an accuracy of 96.31%, outperforming baseline EfficientNet model (94.2%). This finding provide practical value for optimizing experimental design in rat-based studies such as reproductive and pharmacological research, but this study is limited to microscopy image data, without considering other factors like temporal patterns, thus, incorporating multi-modal input is necessary for future application.

[784] Multisession Longitudinal Dynamic MRI Incorporating Patient-Specific Prior Image Information Across Time

Jingjia Chen, Hersh Chandarana, Daniel K. Sodickson, Li Feng

Main category: eess.IV

TL;DR: Proposes longitudinal dynamic MRI using prior images to improve reconstruction quality and reduce scan time across sessions.

Details

Motivation: Existing MRI methods process sessions independently, missing shared anatomical and motion information.

Method: Uses 4D GRASP MRI, concatenates multi-session data, and applies low-rank subspace-based reconstruction.

Result: Outperforms single-session reconstruction in image quality while preserving inter-session variations.

Conclusion: Introduces a context-aware paradigm for faster imaging with repeated sessions.

Abstract: Serial Magnetic Resonance Imaging (MRI) exams are often performed in clinical practice, offering shared anatomical and motion information across imaging sessions. However, existing reconstruction methods process each session independently without leveraging this valuable longitudinal information. In this work, we propose a novel concept of longitudinal dynamic MRI, which incorporates patient-specific prior images to exploit temporal correlations across sessions. This framework enables progressive acceleration of data acquisition and reduction of scan time as more imaging sessions become available. The concept is demonstrated using the 4D Golden-angle RAdial Sparse Parallel (GRASP) MRI, a state-of-the-art dynamic imaging technique. Longitudinal reconstruction is performed by concatenating multi-session time-resolved 4D GRASP datasets into an extended dynamic series, followed by a low-rank subspace-based reconstruction algorithm. A series of experiments were conducted to evaluate the feasibility and performance of the proposed method. Results show that longitudinal 4D GRASP reconstruction consistently outperforms standard single-session reconstruction in image quality, while preserving inter-session variations. The approach demonstrated robustness to changes in anatomy, imaging intervals, and body contour, highlighting its potential for improving imaging efficiency and consistency in longitudinal MRI applications. More generally, this work suggests a new context-aware imaging paradigm in which the more we see a patient, the faster we can image.

[785] A Metabolic-Imaging Integrated Model for Prognostic Prediction in Colorectal Liver Metastases

Qinlong Li, Pu Sun, Guanlin Zhu, Tianjiao Liang, Honggang QI

Main category: eess.IV

TL;DR: A machine learning model was developed to predict postoperative recurrence risk in colorectal liver metastases (CRLM) patients, focusing on preoperative data to avoid data leakage. The 3-month recurrence model showed strong performance (AUC 0.723) and clinical utility.

Details

Motivation: Current clinical models for CRLM prognosis lack accuracy, necessitating a more reliable predictive tool.

Method: Used preoperative clinical parameters and radiomic features from CT imaging to develop and validate a machine learning model, avoiding postoperative data to prevent leakage.

Result: The 3-month recurrence prediction model achieved an AUC of 0.723 and outperformed ’treat-all’ or ’treat-none’ strategies in decision curve analysis.

Conclusion: The study provides a clinically useful predictive model for early CRLM recurrence and emphasizes the importance of avoiding data leakage in prognostic modeling.

Abstract: Prognostic evaluation in patients with colorectal liver metastases (CRLM) remains challenging due to suboptimal accuracy of conventional clinical models. This study developed and validated a robust machine learning model for predicting postoperative recurrence risk. Preliminary ensemble models achieved exceptionally high performance (AUC $>$ 0.98) but incorporated postoperative features, introducing data leakage risks. To enhance clinical applicability, we restricted input variables to preoperative baseline clinical parameters and radiomic features from contrast-enhanced CT imaging, specifically targeting recurrence prediction at 3, 6, and 12 months postoperatively. The 3-month recurrence prediction model demonstrated optimal performance with an AUC of 0.723 in cross-validation. Decision curve analysis revealed that across threshold probabilities of 0.55-0.95, the model consistently provided greater net benefit than “treat-all” or “treat-none” strategies, supporting its utility in postoperative surveillance and therapeutic decision-making. This study successfully developed a robust predictive model for early CRLM recurrence with confirmed clinical utility. Importantly, it highlights the critical risk of data leakage in clinical prognostic modeling and proposes a rigorous framework to mitigate this issue, enhancing model reliability and translational value in real-world settings.

[786] SpecBPP: A Self-Supervised Learning Approach for Hyperspectral Representation and Soil Organic Carbon Estimation

Daniel La’ah Ayuba, Jean-Yves Guillemaut, Belen Marti-Cardona, Oscar Mendez Maldonado

Main category: eess.IV

TL;DR: Proposes Spectral Band Permutation Prediction (SpecBPP), a self-supervised learning method for hyperspectral imagery (HSI), achieving state-of-the-art results in Soil Organic Carbon estimation.

Details

Motivation: Self-supervised learning is underexplored for HSI despite its unique spectral structure. SpecBPP aims to leverage spectral continuity for better representation learning.

Method: SpecBPP shuffles spectral segments and trains models to recover the correct order, using a curriculum-based strategy to manage permutation complexity.

Result: Achieves R² of 0.9456, RMSE of 1.1053%, and RPD of 4.19, outperforming masked autoencoder and joint-embedding baselines.

Conclusion: Spectral order prediction is a powerful pretext task for HSI, advancing representation learning in remote sensing.

Abstract: Self-supervised learning has revolutionized representation learning in vision and language, but remains underexplored for hyperspectral imagery (HSI), where the sequential structure of spectral bands offers unique opportunities. In this work, we propose Spectral Band Permutation Prediction (SpecBPP), a novel self-supervised learning framework that leverages the inherent spectral continuity in HSI. Instead of reconstructing masked bands, SpecBPP challenges a model to recover the correct order of shuffled spectral segments, encouraging global spectral understanding. We implement a curriculum-based training strategy that progressively increases permutation difficulty to manage the factorial complexity of the permutation space. Applied to Soil Organic Carbon (SOC) estimation using EnMAP satellite data, our method achieves state-of-the-art results, outperforming both masked autoencoder (MAE) and joint-embedding predictive (JEPA) baselines. Fine-tuned on limited labeled samples, our model yields an $R^2$ of 0.9456, RMSE of 1.1053%, and RPD of 4.19, significantly surpassing traditional and self-supervised benchmarks. Our results demonstrate that spectral order prediction is a powerful pretext task for hyperspectral understanding, opening new avenues for scientific representation learning in remote sensing and beyond.

[787] Hybrid Deep Learning and Handcrafted Feature Fusion for Mammographic Breast Cancer Classification

Maximilian Tschuchnig, Michael Gadermayr, Khalifa Djemal

Main category: eess.IV

TL;DR: A hybrid framework combining ResNet-50, handcrafted features, and transformer embeddings improves breast cancer classification, achieving competitive performance with simplicity.

Details

Motivation: Automated breast cancer classification is challenging due to subtle differences between benign and malignant tissue.

Method: A hybrid framework fuses deep ResNet-50 features, handcrafted descriptors, and DINOv2 transformer embeddings, tested on the CBIS-DDSM dataset.

Result: AUC improved from 78.1% (ResNet-50 baseline) to 79.6%, with peak recall of 80.5% and F1 score of 67.4%. Handcrafted features enhanced performance beyond transformer embeddings.

Conclusion: The hybrid approach is practical, computationally efficient, and comparable to state-of-the-art methods, suitable for clinical decision support.

Abstract: Automated breast cancer classification from mammography remains a significant challenge due to subtle distinctions between benign and malignant tissue. In this work, we present a hybrid framework combining deep convolutional features from a ResNet-50 backbone with handcrafted descriptors and transformer-based embeddings. Using the CBIS-DDSM dataset, we benchmark our ResNet-50 baseline (AUC: 78.1%) and demonstrate that fusing handcrafted features with deep ResNet-50 and DINOv2 features improves AUC to 79.6% (setup d1), with a peak recall of 80.5% (setup d1) and highest F1 score of 67.4% (setup d1). Our experiments show that handcrafted features not only complement deep representations but also enhance performance beyond transformer-based embeddings. This hybrid fusion approach achieves results comparable to state-of-the-art methods while maintaining architectural simplicity and computational efficiency, making it a practical and effective solution for clinical decision support.

[788] Taming Domain Shift in Multi-source CT-Scan Classification via Input-Space Standardization

Chia-Ming Lee, Bo-Cheng Qiu, Ting-Yao Chen, Ming-Han Sun, Fang-Ying Lin, Jung-Tse Tsai, I-An Tsai, Yu-Fan Lin, Chih-Chung Hsu

Main category: eess.IV

TL;DR: The paper analyzes how SSFL++ and KDS preprocessing improves multi-source CT-scan classification by reducing domain shifts and enhancing cross-source generalization.

Details

Motivation: Multi-source CT-scan classification faces challenges due to domain shifts, and existing methods' robustness mechanisms are not well understood.

Method: The study uses SSFL++ and KDS for spatial and temporal standardization to align inputs into a consistent target space, mitigating domain shifts.

Result: The pipeline improves performance across architectures and won a competitive challenge, validating its effectiveness.

Conclusion: Input-space standardization via SSFL++ and KDS is a robust solution for multi-institutional medical imaging.

Abstract: Multi-source CT-scan classification suffers from domain shifts that impair cross-source generalization. While preprocessing pipelines combining Spatial-Slice Feature Learning (SSFL++) and Kernel-Density-based Slice Sampling (KDS) have shown empirical success, the mechanisms underlying their domain robustness remain underexplored. This study analyzes how this input-space standardization manages the trade-off between local discriminability and cross-source generalization. The SSFL++ and KDS pipeline performs spatial and temporal standardization to reduce inter-source variance, effectively mapping disparate inputs into a consistent target space. This preemptive alignment mitigates domain shift and simplifies the learning task for network optimization. Experimental validation demonstrates consistent improvements across architectures, proving the benefits stem from the preprocessing itself. The approach’s effectiveness was validated by securing first place in a competitive challenge, supporting input-space standardization as a robust and practical solution for multi-institutional medical imaging.

[789] SkinDualGen: Prompt-Driven Diffusion for Simultaneous Image-Mask Generation in Skin Lesions

Zhaobin Xu

Main category: eess.IV

TL;DR: A novel method uses Stable Diffusion-2.0 to generate synthetic skin lesion images and masks, improving model performance by 8-15% in accuracy and F1-score.

Details

Motivation: Addressing data scarcity and class imbalance in medical image analysis for early disease diagnosis.

Method: Adapts Stable Diffusion-2.0 via LoRA fine-tuning and multi-objective loss optimization to generate images and masks in one step.

Result: Generated images closely resemble real ones (validated by FID scores), and hybrid datasets boost model performance significantly.

Conclusion: The method provides a scalable solution for medical imaging challenges, enhancing diagnostic accuracy for rare diseases.

Abstract: Medical image analysis plays a pivotal role in the early diagnosis of diseases such as skin lesions. However, the scarcity of data and the class imbalance significantly hinder the performance of deep learning models. We propose a novel method that leverages the pretrained Stable Diffusion-2.0 model to generate high-quality synthetic skin lesion images and corresponding segmentation masks. This approach augments training datasets for classification and segmentation tasks. We adapt Stable Diffusion-2.0 through domain-specific Low-Rank Adaptation (LoRA) fine-tuning and joint optimization of multi-objective loss functions, enabling the model to simultaneously generate clinically relevant images and segmentation masks conditioned on textual descriptions in a single step. Experimental results show that the generated images, validated by FID scores, closely resemble real images in quality. A hybrid dataset combining real and synthetic data markedly enhances the performance of classification and segmentation models, achieving substantial improvements in accuracy and F1-score of 8% to 15%, with additional positive gains in other key metrics such as the Dice coefficient and IoU. Our approach offers a scalable solution to address the challenges of medical imaging data, contributing to improved accuracy and reliability in diagnosing rare diseases.

[790] On Uncertainty Prediction for Deep-Learning-based Particle Image Velocimetry

Wei Wang, Jeremiah Hu, Jia Ai, Yong Lee

Main category: eess.IV

TL;DR: The paper explores three methods (UNN, MM, MT) for uncertainty quantification in deep learning-based PIV, finding UNN as the best performer.

Details

Motivation: Reliable uncertainty quantification in deep learning-based PIV is a critical but overlooked challenge.

Method: Three methods (UNN, MM, MT) are evaluated across datasets for uncertainty quantification.

Result: All methods perform well under mild perturbations, with UNN consistently achieving the best performance.

Conclusion: The study provides a framework for uncertainty quantification in PIV, highlighting UNN’s potential for future research and implementation.

Abstract: Particle Image Velocimetry (PIV) is a widely used technique for flow measurement that traditionally relies on cross-correlation to track the displacement. Recent advances in deep learning-based methods have significantly improved the accuracy and efficiency of PIV measurements. However, despite its importance, reliable uncertainty quantification for deep learning-based PIV remains a critical and largely overlooked challenge. This paper explores three methods for quantifying uncertainty in deep learning-based PIV: the Uncertainty neural network (UNN), Multiple models (MM), and Multiple transforms (MT). We evaluate the three methods across multiple datasets. The results show that all three methods perform well under mild perturbations. Among the three evaluation metrics, the UNN method consistently achieves the best performance, providing accurate uncertainty estimates and demonstrating strong potential for uncertainty quantification in deep learning-based PIV. This study provides a comprehensive framework for uncertainty quantification in PIV, offering insights for future research and practical implementation.

[791] Implicit Spatiotemporal Bandwidth Enhancement Filter by Sine-activated Deep Learning Model for Fast 3D Photoacoustic Tomography

I Gede Eka Sulistyawan, Takuro Ishii, Riku Suzuki, Yoshifumi Saijo

Main category: eess.IV

TL;DR: A deep learning model with sine activation is introduced to enhance 3D photoacoustic tomography by restoring broadband signals from sparse, bandlimited data, improving image quality and practical imaging speed.

Details

Motivation: Overcome limitations of sparse and bandlimited sensors in 3D-PAT, which degrade image quality, by leveraging deep learning to restore broadband signals.

Method: Use a sine-activated DL model trained on simulated random spherical absorbers to restore broadband PARF signals, emphasizing bandwidth learning over memorization.

Result: Improved image quality with clearer vascular structures, full bandwidth recovery, higher contrast-to-noise ratio, and minimal structural similarity loss. Achieved 2 volumes-per-second imaging speed.

Conclusion: The sine-activated DL model effectively enhances 3D-PAT by restoring broadband signals, enabling high-quality, fast imaging for practical applications.

Abstract: 3D photoacoustic tomography (3D-PAT) using high-frequency hemispherical transducers offers near-omnidirectional reception and enhanced sensitivity to the finer structural details encoded in the high-frequency components of the broadband photoacoustic (PA) signal. However, practical constraints such as limited number of channels with bandlimited sampling rate often result in sparse and bandlimited sensors that degrade image quality. To address this, we revisit the 2D deep learning (DL) approach applied directly to sensor-wise PA radio-frequency (PARF) data. Specifically, we introduce sine activation into the DL model to restore the broadband nature of PARF signals given the observed band-limited and high-frequency PARF data. Given the scarcity of 3D training data, we employ simplified training strategies by simulating random spherical absorbers. This combination of sine-activated model and randomized training is designed to emphasize bandwidth learning over dataset memorization. Our model was evaluated on a leaf skeleton phantom, a micro-CT-verified 3D spiral phantom and in-vivo human palm vasculature. The results showed that the proposed training mechanism on sine-activated model was well-generalized across the different tests by effectively increasing the sensor density and recovering the spatiotemporal bandwidth. Qualitatively, the sine-activated model uniquely enhanced high-frequency content that produces clearer vascular structure with fewer artefacts. Quantitatively, the sine-activated model exhibits full bandwidth at -12 dB spectrum and significantly higher contrast-to-noise ratio with minimal loss of structural similarity index. Lastly, we optimized our approach to enable fast enhanced 3D-PAT at 2 volumes-per-second for better practical imaging of a free-moving targets.

[792] Onboard Hyperspectral Super-Resolution with Deep Pushbroom Neural Network

Davide Piccinini, Diego Valsesia, Enrico Magli

Main category: eess.IV

TL;DR: A lightweight neural network (DPSR) is proposed for real-time hyperspectral image super-resolution onboard satellites, matching pushbroom sensor acquisition and outperforming complex methods.

Details

Motivation: Hyperspectral images have fine spectral but limited spatial resolution. Enhancing spatial resolution is crucial for better detection, especially for onboard satellite processing requiring lightweight methods.

Method: DPSR processes images line-by-line in the along-track direction using a causal memory mechanism, reducing memory and computational demands.

Result: DPSR achieves real-time performance on low-power hardware, with super-resolved image quality competitive or superior to state-of-the-art methods.

Conclusion: DPSR is an efficient solution for onboard hyperspectral image super-resolution, balancing performance and computational constraints.

Abstract: Hyperspectral imagers on satellites obtain the fine spectral signatures essential for distinguishing one material from another at the expense of limited spatial resolution. Enhancing the latter is thus a desirable preprocessing step in order to further improve the detection capabilities offered by hyperspectral images on downstream tasks. At the same time, there is a growing interest towards deploying inference methods directly onboard of satellites, which calls for lightweight image super-resolution methods that can be run on the payload in real time. In this paper, we present a novel neural network design, called Deep Pushbroom Super-Resolution (DPSR) that matches the pushbroom acquisition of hyperspectral sensors by processing an image line by line in the along-track direction with a causal memory mechanism to exploit previously acquired lines. This design greatly limits memory requirements and computational complexity, achieving onboard real-time performance, i.e., the ability to super-resolve a line in the time it takes to acquire the next one, on low-power hardware. Experiments show that the quality of the super-resolved images is competitive or even outperforms state-of-the-art methods that are significantly more complex.

[793] AutoLungDx: A Hybrid Deep Learning Approach for Early Lung Cancer Diagnosis Using 3D Res-U-Net, YOLOv5, and Vision Transformers

Samiul Based Shuvo, Tasnia Binte Mamun

Main category: eess.IV

TL;DR: An automated deep learning framework for early lung nodule detection and classification, achieving high accuracy in segmentation, detection, and classification, especially for low-resource settings.

Details

Motivation: Early lung cancer detection is critical but challenging in low-resource settings due to limited access to medical resources and radiologists.

Method: A three-stage framework: lung segmentation (3D Res-U-Net), nodule detection (YOLO-v5), and classification (Vision Transformer), evaluated on LUNA16 dataset.

Result: Achieved 98.82% segmentation dice score, 0.76 mAP@50 for detection, and 93.57% classification accuracy, outperforming state-of-the-art methods.

Conclusion: The framework effectively improves lung cancer screening accuracy and efficiency in low-resource settings, enhancing patient outcomes.

Abstract: Lung cancer is a leading cause of cancer-related deaths worldwide, and early detection is crucial for improving patient outcomes. Nevertheless, early diagnosis of cancer is a major challenge, particularly in low-resource settings where access to medical resources and trained radiologists is limited. The objective of this study is to propose an automated end-to-end deep learning-based framework for the early detection and classification of lung nodules, specifically for low-resource settings. The proposed framework consists of three stages: lung segmentation using a modified 3D U-Net named 3D Res-U-Net, nodule detection using YOLO-v5, and classification with a Vision Transformer-based architecture. We evaluated the proposed framework on a publicly available dataset, LUNA16. The proposed framework’s performance was measured using the respective domain’s evaluation matrices. The proposed framework achieved a 98.82% lung segmentation dice score while detecting the lung nodule with 0.76 mAP@50 from the segmented lung, at a low false-positive rate. The performance of both networks of the proposed framework was compared with other studies and found to outperform them regarding segmentation and detection accuracy. Additionally, our proposed Vision transformer network obtained an accuracy of 93.57%, which is 1.21% higher than the state-of-the-art networks. Our proposed end-to-end deep learning-based framework can effectively segment lungs, and detect and classify lung nodules, specifically in low-resource settings with limited access to radiologists. The proposed framework outperforms existing studies regarding all the respective evaluation metrics. The proposed framework can potentially improve the accuracy and efficiency of lung cancer screening in low-resource settings, ultimately leading to better patient outcomes.

[794] Synomaly Noise and Multi-Stage Diffusion: A Novel Approach for Unsupervised Anomaly Detection in Medical Images

Yuan Bi, Lucie Huang, Ricarda Clarenbach, Reza Ghotbi, Angelos Karlas, Nassir Navab, Zhongliang Jiang

Main category: eess.IV

TL;DR: Proposes an unsupervised anomaly detection framework using a diffusion model with synthetic anomaly noise and multi-stage diffusion, achieving results comparable to supervised methods without needing annotated anomalies.

Details

Motivation: Addresses the scarcity of expert annotations and anatomical complexity in medical imaging by enabling unsupervised anomaly detection.

Method: Uses a diffusion model with Synomaly noise for synthetic anomalies and multi-stage diffusion for detailed, high-fidelity reconstructions.

Result: Outperforms existing unsupervised methods and matches supervised models in some datasets, validated on brain MRI, liver CT, and carotid US.

Conclusion: The framework is a robust, annotation-efficient alternative for medical anomaly detection, enhancing interpretability and clinical decision-making.

Abstract: Anomaly detection in medical imaging plays a crucial role in identifying pathological regions across various imaging modalities, such as brain MRI, liver CT, and carotid ultrasound (US). However, training fully supervised segmentation models is often hindered by the scarcity of expert annotations and the complexity of diverse anatomical structures. To address these issues, we propose a novel unsupervised anomaly detection framework based on a diffusion model that incorporates a synthetic anomaly (Synomaly) noise function and a multi-stage diffusion process. Synomaly noise introduces synthetic anomalies into healthy images during training, allowing the model to effectively learn anomaly removal. The multi-stage diffusion process is introduced to progressively denoise images, preserving fine details while improving the quality of anomaly-free reconstructions. The generated high-fidelity counterfactual healthy images can further enhance the interpretability of the segmentation models, as well as provide a reliable baseline for evaluating the extent of anomalies and supporting clinical decision-making. Notably, the unsupervised anomaly detection model is trained purely on healthy images, eliminating the need for anomalous training samples and pixel-level annotations. We validate the proposed approach on brain MRI, liver CT datasets, and carotid US. The experimental results demonstrate that the proposed framework outperforms existing state-of-the-art unsupervised anomaly detection methods, achieving performance comparable to fully supervised segmentation models in the US dataset. Ablation studies further highlight the contributions of Synomaly noise and the multi-stage diffusion process in improving anomaly segmentation. These findings underscore the potential of our approach as a robust and annotation-efficient alternative for medical anomaly detection.

[795] GDSR: Global-Detail Integration through Dual-Branch Network with Wavelet Losses for Remote Sensing Image Super-Resolution

Qiwei Zhu, Kai Li, Guojing Zhang, Xiaoying Wang, Jianqiang Huang, Xilai Li

Main category: eess.IV

TL;DR: The paper introduces GDSR, a dual-branch model combining RWKV and convolutional operations for RSI-SR, addressing global-local dependency gaps and computational inefficiency. It outperforms HAT in PSNR, efficiency, and speed.

Details

Motivation: Existing SR methods fail to balance global and local dependencies efficiently, especially for large-scale RSIs, leading to suboptimal performance and high computational costs.

Method: Proposes GDSR with RWKV for long-range dependencies and convolutional operations for local features, linked by GDRM. Introduces a wavelet-domain loss for better reconstruction.

Result: GDSR surpasses HAT by 0.09 dB PSNR, uses 63% parameters, 51% FLOPs, and is 3.2x faster.

Conclusion: GDSR effectively balances global-local features, reduces computational burden, and improves RSI-SR performance.

Abstract: In recent years, deep neural networks, including Convolutional Neural Networks, Transformers, and State Space Models, have achieved significant progress in Remote Sensing Image (RSI) Super-Resolution (SR). However, existing SR methods typically overlook the complementary relationship between global and local dependencies. These methods either focus on capturing local information or prioritize global information, which results in models that are unable to effectively capture both global and local features simultaneously. Moreover, their computational cost becomes prohibitive when applied to large-scale RSIs. To address these challenges, we introduce the novel application of Receptance Weighted Key Value (RWKV) to RSI-SR, which captures long-range dependencies with linear complexity. To simultaneously model global and local features, we propose the Global-Detail dual-branch structure, GDSR, which performs SR by paralleling RWKV and convolutional operations to handle large-scale RSIs. Furthermore, we introduce the Global-Detail Reconstruction Module (GDRM) as an intermediary between the two branches to bridge their complementary roles. In addition, we propose the Dual-Group Multi-Scale Wavelet Loss, a wavelet-domain constraint mechanism via dual-group subband strategy and cross-resolution frequency alignment for enhanced reconstruction fidelity in RSI-SR. Extensive experiments under two degradation methods on several benchmarks, including AID, UCMerced, and RSSRD-QH, demonstrate that GSDR outperforms the state-of-the-art Transformer-based method HAT by an average of 0.09 dB in PSNR, while using only 63% of its parameters and 51% of its FLOPs, achieving an inference speed 3.2 times faster.

[796] Compressed Image Generation with Denoising Diffusion Codebook Models

Guy Ohayon, Hila Manor, Tomer Michaeli, Michael Elad

Main category: eess.IV

TL;DR: A novel generative method, Denoising Diffusion Codebook Model (DDCM), replaces Gaussian noise in diffusion models with pre-defined codebooks, enabling high-quality image generation and lossless compression. It also extends to conditional tasks like image restoration.

Details

Motivation: To improve image compression and generation by integrating pre-defined noise codebooks into diffusion models, retaining quality while enabling efficient compression.

Method: Replaces standard Gaussian noise in reverse diffusion with noise samples from fixed codebooks, creating DDCM. This allows lossless compression and extends to conditional generation tasks.

Result: DDCM maintains sample quality and diversity even with small codebooks, achieving state-of-the-art perceptual image compression. It also works for conditional tasks like restoration.

Conclusion: DDCM effectively combines generative modeling with compression, offering versatile applications in image generation and restoration while maintaining high quality.

Abstract: We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces high-quality image samples along with their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termed Denoising Diffusion Codebook Model (DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered.

[797] Text-guided multi-stage cross-perception network for medical image segmentation

Gaoyu Chen, Haixia Pan

Main category: eess.IV

TL;DR: The paper proposes a Text-guided Multi-stage Cross-perception network (TMC) to improve medical image segmentation by enhancing cross-modal interaction and feature expression, achieving superior performance on three datasets.

Details

Motivation: Existing medical image segmentation methods struggle with weak semantic expression due to low contrast between target and non-target regions. Text prompts can help but current methods lack effective cross-modal interaction.

Method: TMC introduces a multistage cross-attention module for better semantic detail understanding and a multi-stage alignment loss for cross-modal consistency.

Result: TMC outperforms UNet-based and text-guided methods with Dice scores of 84.77%, 78.50%, and 88.73% on QaTa-COV19, MosMedData, and Breast datasets.

Conclusion: TMC effectively addresses limitations in medical image segmentation by leveraging text prompts and cross-modal enhancements, demonstrating significant performance improvements.

Abstract: Medical image segmentation plays a crucial role in clinical medicine, serving as a tool for auxiliary diagnosis, treatment planning, and disease monitoring, thus facilitating physicians in the study and treatment of diseases. However, existing medical image segmentation methods are limited by the weak semantic expression of the target segmentation regions, which is caused by the low contrast between the target and non-target segmentation regions. To address this limitation, text prompt information has greast potential to capture the lesion location. However, existing text-guided methods suffer from insufficient cross-modal interaction and inadequate cross-modal feature expression. To resolve these issues, we propose the Text-guided Multi-stage Cross-perception network (TMC). In TMC, we introduce a multistage cross-attention module to enhance the model’s understanding of semantic details and a multi-stage alignment loss to improve the consistency of cross-modal semantics. The results of the experiments demonstrate that our TMC achieves a superior performance with Dice of 84.77%, 78.50%, 88.73% in three public datasets (QaTa-COV19, MosMedData and Breast), outperforming UNet based networks and text-guided methods.

[798] ADAgent: LLM Agent for Alzheimer’s Disease Analysis with Collaborative Coordinator

Wenlong Hou, Guangqian Yang, Ye Du, Yeung Lau, Lihao Liu, Junjun He, Ling Long, Shujun Wang

Main category: eess.IV

TL;DR: ADAgent, an AI agent for Alzheimer’s disease analysis, integrates multi-modal data and outperforms state-of-the-art methods in diagnosis and prognosis tasks.

Details

Motivation: Early and precise AD diagnosis is critical, but existing methods lack flexibility for multi-modal or missing data. ADAgent addresses this gap.

Method: Built on a large language model, ADAgent combines a reasoning engine, medical tools, and a coordinator for multi-modal AD tasks.

Result: ADAgent achieves significant accuracy improvements: 2.7% in multi-modal diagnosis, 0.7% in prognosis, and better MRI/PET diagnosis.

Conclusion: ADAgent is a versatile and effective AI solution for AD analysis, enhancing diagnostic and prognostic accuracy.

Abstract: Alzheimer’s disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single-modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches process multi-modal data, they are limited to specific tasks with a small set of input modalities and cannot handle arbitrary combinations. This highlights the need for a system that can address diverse AD-related tasks, process multi-modal or missing input, and integrate multiple advanced methods for improved performance. In this paper, we propose ADAgent, the first specialized AI agent for AD analysis, built on a large language model (LLM) to address user queries and support decision-making. ADAgent integrates a reasoning engine, specialized medical tools, and a collaborative outcome coordinator to facilitate multi-modal diagnosis and prognosis tasks in AD. Extensive experiments demonstrate that ADAgent outperforms SOTA methods, achieving significant improvements in accuracy, including a 2.7% increase in multi-modal diagnosis, a 0.7% improvement in multi-modal prognosis, and enhancements in MRI and PET diagnosis tasks.

[799] Efficacy of Image Similarity as a Metric for Augmenting Small Dataset Retinal Image Segmentation

Thomas Wallace, Ik Siong Heng, Senad Subasic, Chris Messenger

Main category: eess.IV

TL;DR: The study evaluates how synthetic images, measured by FID, improve DME segmentation by a U-Net. Lower FID (more similar datasets) leads to better performance, with synthetic data outperforming standard augmentation.

Details

Motivation: To assess the effectiveness of synthetic images (generated by PGGAN) in augmenting limited medical datasets for improving U-Net segmentation of DME.

Method: Used FID to measure dataset similarity and tested synthetic (PGGAN) vs. standard augmentation on U-Net segmentation with limited training data.

Result: Lower FID (more similar datasets) significantly improves segmentation. Synthetic data follows a separate, more effective trend than standard augmentation.

Conclusion: Synthetic images with lower FID enhance U-Net performance, but improvement may require sufficient dissimilarity. Synthetic augmentation is more effective than standard methods.

Abstract: Synthetic images are an option for augmenting limited medical imaging datasets to improve the performance of various machine learning models. A common metric for evaluating synthetic image quality is the Fr'echet Inception Distance (FID) which measures the similarity of two image datasets. In this study we evaluate the relationship between this metric and the improvement which synthetic images, generated by a Progressively Growing Generative Adversarial Network (PGGAN), grant when augmenting Diabetes-related Macular Edema (DME) intraretinal fluid segmentation performed by a U-Net model with limited amounts of training data. We find that the behaviour of augmenting with standard and synthetic images agrees with previously conducted experiments. Additionally, we show that dissimilar (high FID) datasets do not improve segmentation significantly. As FID between the training and augmenting datasets decreases, the augmentation datasets are shown to contribute to significant and robust improvements in image segmentation. Finally, we find that there is significant evidence to suggest that synthetic and standard augmentations follow separate log-normal trends between FID and improvements in model performance, with synthetic data proving more effective than standard augmentation techniques. Our findings show that more similar datasets (lower FID) will be more effective at improving U-Net performance, however, the results also suggest that this improvement may only occur when images are sufficiently dissimilar.

[800] Beyond Manual Annotation: A Human-AI Collaborative Framework for Medical Image Segmentation Using Only “Better or Worse” Expert Feedback

Yizhe Zhang

Main category: eess.IV

TL;DR: A novel Human-AI framework reduces manual annotation in medical image segmentation by using binary preference feedback instead of pixel-level labeling.

Details

Motivation: Manual annotation is labor-intensive and slows AI development in medical imaging.

Method: The framework includes a foundation model, label propagation, a clicking agent learning from feedback, and multi-round segmentation learning.

Result: Competitive segmentation performance achieved on three datasets with minimal human input.

Conclusion: The approach effectively reduces annotation burden while maintaining performance.

Abstract: Manual annotation of medical images is a labor-intensive and time-consuming process, posing a significant bottleneck in the development and deployment of robust medical imaging AI systems. This paper introduces a novel hands-free Human-AI collaborative framework for medical image segmentation that substantially reduces the annotation burden by eliminating the need for explicit manual pixel-level labeling. The core innovation lies in a preference learning paradigm, where human experts provide minimal, intuitive feedback – simply indicating whether an AI-generated segmentation is better or worse than a previous version. The framework comprises four key components: (1) an adaptable foundation model (FM) for feature extraction, (2) label propagation based on feature similarity, (3) a clicking agent that learns from human better-or-worse feedback to decide where to click and with which label, and (4) a multi-round segmentation learning procedure that trains a state-of-the-art segmentation network using pseudo-labels generated by the clicking agent and FM-based label propagation. Experiments on three public datasets demonstrate that the proposed approach achieves competitive segmentation performance using only binary preference feedback, without requiring experts to directly manually annotate the images.

[801] IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution

Sejin Park, Sangmin Lee, Kyong Hwan Jin, Seung-Won Jung

Main category: eess.IV

TL;DR: The paper introduces IM-LUT, a framework for arbitrary-scale image super-resolution (ASISR) that blends interpolation functions efficiently using LUTs, balancing quality and computational cost.

Details

Motivation: Existing LUT-based SR methods are limited to fixed scales, while ASISR techniques are computationally expensive. IM-LUT aims to bridge this gap.

Method: Proposes IM-Net to predict mixing weights for interpolation functions, then transforms it into IM-LUT using LUTs for efficient CPU inference.

Result: IM-LUT achieves superior quality-efficiency balance on benchmark datasets compared to existing methods.

Conclusion: IM-LUT is a promising solution for resource-constrained ASISR applications due to its lightweight and fast inference.

Abstract: Super-resolution (SR) has been a pivotal task in image processing, aimed at enhancing image resolution across various applications. Recently, look-up table (LUT)-based approaches have attracted interest due to their efficiency and performance. However, these methods are typically designed for fixed scale factors, making them unsuitable for arbitrary-scale image SR (ASISR). Existing ASISR techniques often employ implicit neural representations, which come with considerable computational cost and memory demands. To address these limitations, we propose Interpolation Mixing LUT (IM-LUT), a novel framework that operates ASISR by learning to blend multiple interpolation functions to maximize their representational capacity. Specifically, we introduce IM-Net, a network trained to predict mixing weights for interpolation functions based on local image patterns and the target scale factor. To enhance efficiency of interpolation-based methods, IM-Net is transformed into IM-LUT, where LUTs are employed to replace computationally expensive operations, enabling lightweight and fast inference on CPUs while preserving reconstruction quality. Experimental results on several benchmark datasets demonstrate that IM-LUT consistently achieves a superior balance between image quality and efficiency compared to existing methods, highlighting its potential as a promising solution for resource-constrained applications.

Ningyong Wu, Jinzhi Wang, Wenhong Zhao, Chenzhan Yu, Zhigang Xiu, Duwei Dai

Main category: eess.IV

TL;DR: OrthoInsight, a multi-modal deep learning framework, integrates YOLOv9 for rib fracture detection, a medical knowledge graph, and LLaVA for report generation, outperforming models like GPT-4 and Claude-3 in diagnostic accuracy and clinical utility.

Details

Motivation: The increasing volume of medical imaging data and the inefficiency of manual interpretation necessitate automated tools for diagnosing musculoskeletal injuries like rib fractures.

Method: OrthoInsight combines YOLOv9 for fracture detection, a medical knowledge graph for clinical context, and a fine-tuned LLaVA model for report generation, merging visual and textual data.

Result: Evaluated on 28,675 CT images and expert reports, OrthoInsight scores 4.28 on average, excelling in diagnostic accuracy, content completeness, logical coherence, and clinical guidance.

Conclusion: OrthoInsight showcases the potential of multi-modal learning in medical image analysis, offering valuable support for radiologists.

Abstract: The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.

[803] A multi-dynamic low-rank deep image prior (ML-DIP) for real-time 3D cardiovascular MRI

Chong Chen, Marc Vornehm, Preethi Chandrasekaran, Muhammad A. Sultan, Syed M. Arshad, Yingmin Liu, Yuchi Han, Rizwan Ahmad

Main category: eess.IV

TL;DR: A framework (ML-DIP) for 3D real-time CMR reconstruction from highly undersampled data, achieving high-quality results without fully sampled training data.

Details

Motivation: To enable high-quality 3D real-time CMR without needing fully sampled training datasets, addressing challenges like irregular heartbeats and motion artifacts.

Method: ML-DIP uses separate neural networks for spatial and temporal modeling, optimized per scan to reconstruct dynamic images from undersampled k-space data. Evaluated on phantom, healthy subjects, and patients with PVCs.

Result: Achieved PSNR > 29 dB and SSIM > 0.90 in phantoms; comparable functional measurements to 2D cine in healthy subjects; preserved beat-to-beat variability in PVC patients.

Conclusion: ML-DIP enables high-quality 3D real-time CMR with high acceleration factors, learning directly from undersampled data.

Abstract: Purpose: To develop a reconstruction framework for 3D real-time cine cardiovascular magnetic resonance (CMR) from highly undersampled data without requiring fully sampled training data. Methods: We developed a multi-dynamic low-rank deep image prior (ML-DIP) framework that models spatial image content and temporal deformation fields using separate neural networks. These networks are optimized per scan to reconstruct the dynamic image series directly from undersampled k-space data. ML-DIP was evaluated on (i) a 3D cine digital phantom with simulated premature ventricular contractions (PVCs), (ii) ten healthy subjects (including two scanned during both rest and exercise), and (iii) five patients with PVCs. Phantom results were assessed using peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). In vivo performance was evaluated by comparing left-ventricular function quantification (against 2D real-time cine) and image quality (against 2D real-time cine and binning-based 5D-Cine). Results: In the phantom study, ML-DIP achieved PSNR > 29 dB and SSIM > 0.90 for scan times as short as two minutes, while recovering cardiac motion, respiratory motion, and PVC events. In healthy subjects, ML-DIP yielded functional measurements comparable to 2D cine and higher image quality than 5D-Cine, including during exercise with high heart rates and bulk motion. In PVC patients, ML-DIP preserved beat-to-beat variability and reconstructed irregular beats, whereas 5D-Cine showed motion artifacts and information loss due to binning. Conclusion: ML-DIP enables high-quality 3D real-time CMR with acceleration factors exceeding 1,000 by learning low-rank spatial and temporal representations from undersampled data, without relying on external fully sampled training datasets.

[804] Multi-Attention Stacked Ensemble for Lung Cancer Detection in CT Scans

Uzzal Saha, Surya Prakash

Main category: eess.IV

TL;DR: A multi-level attention stacked ensemble of deep neural networks is proposed for binary lung nodule classification, achieving high accuracy and AUC with balanced performance.

Details

Motivation: The challenge of accurately classifying benign vs malignant lung nodules in CT images, especially in cases of radiologist disagreement, motivates the development of a robust automated system.

Method: Three pretrained backbones (EfficientNet V2 S, MobileViT XXS, DenseNet201) are adapted with custom classification heads. A two-stage attention mechanism and meta-learner refine predictions. Techniques like dynamic focal loss, MixUp augmentation, and test-time augmentation are used.

Result: Achieves 98.09% accuracy, 0.9961 AUC, and balanced sensitivity (98.73%) and specificity (98.96%), with a 35% error reduction compared to state-of-the-art methods.

Conclusion: The model serves as a robust automated aid for radiologists in lung cancer screening, demonstrating significant improvements in performance.

Abstract: In this work, we address the challenge of binary lung nodule classification (benign vs malignant) using CT images by proposing a multi-level attention stacked ensemble of deep neural networks. Three pretrained backbones – EfficientNet V2 S, MobileViT XXS, and DenseNet201 – are each adapted with a custom classification head tailored to 96 x 96 pixel inputs. A two-stage attention mechanism learns both model-wise and class-wise importance scores from concatenated logits, and a lightweight meta-learner refines the final prediction. To mitigate class imbalance and improve generalization, we employ dynamic focal loss with empirically calculated class weights, MixUp augmentation during training, and test-time augmentation at inference. Experiments on the LIDC-IDRI dataset demonstrate exceptional performance, achieving 98.09 accuracy and 0.9961 AUC, representing a 35 percent reduction in error rate compared to state-of-the-art methods. The model exhibits balanced performance across sensitivity (98.73) and specificity (98.96), with particularly strong results on challenging cases where radiologist disagreement was high. Statistical significance testing confirms the robustness of these improvements across multiple experimental runs. Our approach can serve as a robust, automated aid for radiologists in lung cancer screening.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Advancing Mental Disorder Detection: A Comparative Evaluation of Transformer and LSTM Architectures on Social Media

[2] Setting The Table with Intent: Intent-aware Schema Generation and Editing for Literature Review Tables

[3] Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri

[4] Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning

[5] Efficient Attention Mechanisms for Large Language Models: A Survey

[6] MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

[7] HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track

[8] MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

[9] RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

[10] ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

[11] Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

[12] Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs

[13] Basic Reading Distillation

[14] JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models

[15] Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs

[16] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities

[17] Flora: Effortless Context Construction to Arbitrary Length and Scale

[18] HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

[19] DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments

[20] Scaling Analysis of Interleaved Speech-Text Language Models

[21] The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment

[22] ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

[23] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

[24] Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam

[25] A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs

[26] CaliDrop: KV Cache Compression with Calibration

[27] KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

[28] Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

[29] Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

[30] VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

[31] Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach

[32] FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression

[33] Infogen: Generating Complex Statistical Infographics from Documents

[34] A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications

[35] RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation

[36] AI-Driven Generation of Old English: A Framework for Low-Resource Languages

[37] Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering

[38] Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

[39] Multi-Agent Interactive Question Generation Framework for Long Document Understanding

[40] Goal Alignment in LLM-Based User Simulators for Conversational AI

[41] SGPO: Self-Generated Preference Optimization based on Self-Improver

[42] SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding

[43] Diversity-Enhanced Reasoning for Subjective Questions

[44] IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

[45] Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

[46] Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

[47] Modeling Professionalism in Expert Questioning through Linguistic Differentiation

[48] Post-Completion Learning for Language Models

[49] EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms

[50] MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

[51] What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

[52] Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

[53] DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

[54] RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

[55] Length Representations in Large Language Models

[56] Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations

[57] CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

[58] Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

[59] CodeNER: Code Prompting for Named Entity Recognition

[60] Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems

[61] AQUA: A Large Language Model for Aquaculture & Fisheries

[62] SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

[63] Dialogues of Dissent: Thematic and Rhetorical Dimensions of Hate and Counter-Hate Speech in Social Media Conversations

[64] Enhancing Hallucination Detection via Future Context

[65] ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning

[66] Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior

[67] Ontology-Enhanced Knowledge Graph Completion using Large Language Models

[68] Geometric-Mean Policy Optimization

[69] When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

[70] Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

[71] Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

[72] Multilingual Self-Taught Faithfulness Evaluators

[73] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

[74] Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models

[75] Latent Inter-User Difference Modeling for LLM Personalization

[76] A survey of diversity quantification in natural language processing: The why, what, where and how

[77] Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings